Executing a large kernel

Hi everyone.
Now I have been writing a raytracing kernel using OpenCL.
The kernel is big because it includes lots of procedures.
And it takes an argument which specifies the number of iterations in the kernel.


kernel void raytracing(global uchar* ..., uint spp, ...) {
    ...
    for (int i = 0; i < spp; ++i) {
        ...
    }
    ...
}

Furthermore, kernel execution is iterated also in runtime API.


for (int i = 0; i < iterations; ++i) {
    printf("[ %d ]
", i);
    queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange{g_width, g_height}, cl::NDRange{32, 32}, nullptr, nullptr);
    queue.finish();
}

So, the argument to the kernel controls the unit of computing.

When I set the argument by the value 1.
The program works fine.
But when I set the argument by a larger value for example 16, the program does not return at queue.finish() at all.
Things get complicated because sometimes the program works fine even if the value is large.

I have tried searching for this problem.
And I maybe found the cause.

it says that if kernel execution takes so much time, the system resets the video driver. So the program does not works correctly.

I am still going to extend the kernel for a rich rendering feature, so this is the problem even if I set the argument by the low value.

Is there a good way to solve this problem?

Thanks in advance.

------Environment------
MacBook Pro Retina late 2013
OS X 10.9.2
Core-i7 4850HQ
16GB RAM
512GB SSD
Iris Pro Graphics 5200 128MB
Geforce GT750M 2GB

There are various ways of extending or removing the OS timeout / driver reset, but they are work-around for a the larger issue. Ideally, kernels should not exceed dozens of milliseconds if you want your system to remain interactive (windows, buttons, etc.). In a special case of a raw compute machine, perhaps hundreds of milliseconds. But if you’re approaching seconds I’d recommend to find a way to divide up your work into small units so your kernels don’t kill the system. The GPU is also needed for UI. Alternatively, add a second GPU and use it only for compute (no monitor). NVIDIA Tesla cards have a mode for this on Windows at least; they can compute for hours if need be.

Thank you for your replying.
I have had a misconception.
The problem is not a time for each work-item execution, it is the time for the sum of them.
Therefore the problem can be solved by reducing the computational time of enqueueNDRangeKernel(). right?

I try to divide a image plane to a few tiles, the program works fine even when I set the argument by a large value.

Thank you, and I’m sorry my poor English.

Hi,

So here is something to think of. You can slice up your execution of data many ways, but ultimately the total amount of computation will remain the same, however memory accesses and load balancing potential will change. Let me say that a different way. The total number of instructions you have to execute to do your algorithm is pretty much the same no matter how you structure your code. Some code structures, however, are going to be slower for memory accesses and some are going to be faster. It is also the case, however, that a few big chunks of work and many small chunks of work are going to take roughly the same time to complete.

Now that I have probably confused you, here is something to consider. Let your uint spp parameter dictate the number of “jobs” to compute, rather than using it internally within a for-loop within the work-item. One thing you can do is to use a 2D execution of num_items x spp, another thing you can do is just a big pool of num_items x spp calculations. You need to be careful of memory accesses to maintain good performance.

You might also want to structure your code in such a way that you can reduce the amount of work you give to a single kernel. This can help eliminate issues with kernel execution limits, and probably won’t hurt your performance too much.

I hope this helps a little bit.