OpenCL program freezes when high number of kernels are launched within a loop

Hi,

I have a loop (about 1 billion iterations) that launches OpenCL kernels. Each kernel is executed by 1 thread, and performs a very trivial operation. The problem is that after the execution of few millions iterations the code freezes (stops) and the program does not terminate at all. It freezes in the call to clFinish(). The program does not always freeze in the same iteration.

The problem disappears if clFinish() is called once every 1000 iterations instead of being called in every iteration, so I have the feeling like the problem is that clFinish() is waiting for the end of the kernel but the kernl is killed (somehow) before clFinish() is called. Note also that when I insert many printf() calls inside the loop the problem disappears!

I get the problem when I execute the program on CPU device (on my laptop, I use AMD SDK), and I get the problem also on a machine with Nvidia Fermi GPU (Nvidia SDK and drivers, AMD SDK is also installed on that machine).

I’m checking for errors after each OpenCL API call but no error is detected.

My questions:

  • Is their any incorrect use of the OpenCL API below ?
  • Is their any problem if a huge number of OpenCL kernels are launched simultaneously ?

Host code:


   /* OpenCL initialization.  */
   /* ... */
    cl_mem dev_acc = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(double), NULL, &err);
    
    for (int h0 = 1; h0 <= ni; h0 += 1)
      for (int h2 = 0; h2 < nj; h2 += 1)
        for (int h5 = 0; h5 < h2 - 1; h5 += 1) {
	      size_t global_work_size[1] = {1};
	      size_t block_size[1] = {1};
	      cl_kernel kernel2 = clCreateKernel(program, "kernel2", &err);
	      clSetKernelArg(kernel2, 0, sizeof(cl_mem), (void *) &dev_acc);
              clEnqueueNDRangeKernel(queue, kernel2, 1, NULL, global_work_size,block_size,0, NULL, NULL);
              clFinish(queue);
	      clReleaseKernel(kernel2);
           }

Kernel code:


__kernel void kernel2(__global double *acc)
{
      *acc = 1;
}

Compilation:
gcc -O3 -lm -std=gnu99 polybench.c ocl_utilities.c symm_host.c -lOpenCL -lm -I/opt/AMDAPP/include -L/opt/AMDAPP/lib/x86_64

I’m using Ubuntu 12.04, Kernel 3.2.0-29-generic, X86_64, RAM: 2 GB

Any comment about this problem ?

I don’t see any errors in your approach. So I’m wondering if there is an error in the library. Did you figure it out?
I’m finding your post because of my interest in calling clSetKernelArg many times.

In your example case, you can move the kernel object and argument code outside of the for loops, right? I wonder if this would reduce resource pressure within the library and run error-free?


/* OpenCL initialization.  */
   /* ... */
    cl_mem dev_acc = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(double), NULL, &err);
    cl_kernel kernel2 = clCreateKernel(program, "kernel2", &err);
    clSetKernelArg(kernel2, 0, sizeof(cl_mem), (void *) &dev_acc);

    for (int h0 = 1; h0 <= ni; h0 += 1)
      for (int h2 = 0; h2 < nj; h2 += 1)
        for (int h5 = 0; h5 < h2 - 1; h5 += 1) {
	      size_t global_work_size[1] = {1};
	      size_t block_size[1] = {1};
              clEnqueueNDRangeKernel(queue, kernel2, 1, NULL, global_work_size,block_size,0, NULL, NULL);
              clFinish(queue);
           }
    clReleaseKernel(kernel2);

I noticed that the problem appears only with the Nvidia OpenCL library, the same code works fine with the AMD OpenCl library. This makes me think that the problem is a library problem, but I don’t have any proof.

Did you experience a similar problem ?

Hello,

I have an OpenCL code, multi-threaded, each thread using GPUs:
it loops over:

  • get data
  • spawn : each thread (up to 16) treats data, running 2 subtasks on GPUs or Xeon Phi depending on the computer
  • join.

It works fine, on the computer with Nvidia GPUS,
It runs sometimes, or one task freezes on the computer with the xeon phi. The code sits there for ever.
Both uses opencl-1.2-3.0.67279 (from intel)

Any idea on how to find out the problem ?

Thanks

Claude