Early returns and barriers inside the kernel

sean.settle · March 16, 2011, 10:42am

What is the recommended way to convert serial loops into NDRanges. The loop count is a starting value for global_size, but global_size must be a multiple of local_size. So that either requires doing a conditional statement inside the kernel for global_id < loop count, or pad (with appropriate values) whatever array argument before passing it to the kernel so that it becomes a multiple of local_size. The second method removes the condition inside the kernel but takes time to resize array(s). So the first method seems better, but doesn’t that preclude one from them using a barrier(CLK_LOCAL_MEM_FENCE)?

Basically, is it possible to do something like:

if (get_global_id(0) >= n)
{
return;
}
else
{
…;
barrier(CLK_LOCAL_MEM_FENCE);
…;
}

I recall the OpenCL 1.1 spec saying if any work-item encounters a barrier, then all executing work-items must also encounter that barrier. But if a work-item already returned, then it’s not an executing work-item, right?

david.garcia · March 16, 2011, 11:03am

Your analysis of the situation is correct.

if (get_global_id(0) >= n)
{
return;
}
else
{
...;
barrier(CLK_LOCAL_MEM_FENCE);
...;
}

Is not valid.

The second method removes the condition inside the kernel but takes time to resize array(s)

Wouldn’t it be possible to size the arrays properly from the beginning?

sean.settle · March 16, 2011, 11:24am

I agree doing it from the beginning would be much more efficient. However I’m trying to port a pre-existing library (GNU Scientific Library) to OpenCL so unfortunately I can’t assume anything about the input array sizes.

Another thing is that the particular function has a stride for the array elements, which if calculations are to be done on a GPU (either host side or over PCIe) should be pre-coalesced. In that case I can temporarily pack and pad as desired. What’s your opinion in this case?

david.garcia · March 16, 2011, 3:54pm

I agree doing it from the beginning would be much more efficient. However I’m trying to port a pre-existing library (GNU Scientific Library) to OpenCL so unfortunately I can’t assume anything about the input array sizes.

I don’t understand why that prevents you from padding all buffers at creation time to have a size that is a multiple of the maximum work-group size supported by the device. Surely the CL buffer objects are not visible by the app. If the issue is because you are currently using CL_USE_HOST_PTR what I would do is replace it with CL_ALLOC_HOST_PTR and then enqueue a write command.

Another thing is that the particular function has a stride for the array elements, which if calculations are to be done on a GPU (either host side or over PCIe) should be pre-coalesced. In that case I can temporarily pack and pad as desired. What’s your opinion in this case?

Packing and padding on the CPU prior to sending through PCIe sounds like a good idea to me.

affie · March 17, 2011, 1:55pm

You could pad the buffers appropriately and this solution works. If you decide you do not want to pad then the code would need to be something like:

if (get_global_id(0) < n)
{
…
}

barrier(CLK_LOCAL_MEM_FENCE)

if (get_global_id(0) < n)
{
// do work after the barrier
…
}

Basically the barrier needs to come out of the conditional and must be encountered by all work-items of the work-group.