Early returns and barriers inside the kernel

What is the recommended way to convert serial loops into NDRanges. The loop count is a starting value for global_size, but global_size must be a multiple of local_size. So that either requires doing a conditional statement inside the kernel for global_id < loop count, or pad (with appropriate values) whatever array argument before passing it to the kernel so that it becomes a multiple of local_size. The second method removes the condition inside the kernel but takes time to resize array(s). So the first method seems better, but doesn’t that preclude one from them using a barrier(CLK_LOCAL_MEM_FENCE)?

Basically, is it possible to do something like:

if (get_global_id(0) >= n)
{
return;
}
else
{
…;
barrier(CLK_LOCAL_MEM_FENCE);
…;
}

I recall the OpenCL 1.1 spec saying if any work-item encounters a barrier, then all executing work-items must also encounter that barrier. But if a work-item already returned, then it’s not an executing work-item, right?

Your analysis of the situation is correct.

if (get_global_id(0) >= n)
{
return;
}
else
{
...;
barrier(CLK_LOCAL_MEM_FENCE);
...;
}

Is not valid.

The second method removes the condition inside the kernel but takes time to resize array(s)

Wouldn’t it be possible to size the arrays properly from the beginning?

I agree doing it from the beginning would be much more efficient. However I’m trying to port a pre-existing library (GNU Scientific Library) to OpenCL so unfortunately I can’t assume anything about the input array sizes.

Another thing is that the particular function has a stride for the array elements, which if calculations are to be done on a GPU (either host side or over PCIe) should be pre-coalesced. In that case I can temporarily pack and pad as desired. What’s your opinion in this case?

I agree doing it from the beginning would be much more efficient. However I’m trying to port a pre-existing library (GNU Scientific Library) to OpenCL so unfortunately I can’t assume anything about the input array sizes.

I don’t understand why that prevents you from padding all buffers at creation time to have a size that is a multiple of the maximum work-group size supported by the device. Surely the CL buffer objects are not visible by the app. If the issue is because you are currently using CL_USE_HOST_PTR what I would do is replace it with CL_ALLOC_HOST_PTR and then enqueue a write command.

Another thing is that the particular function has a stride for the array elements, which if calculations are to be done on a GPU (either host side or over PCIe) should be pre-coalesced. In that case I can temporarily pack and pad as desired. What’s your opinion in this case?

Packing and padding on the CPU prior to sending through PCIe sounds like a good idea to me.

You could pad the buffers appropriately and this solution works. If you decide you do not want to pad then the code would need to be something like:

if (get_global_id(0) < n)
{

}

barrier(CLK_LOCAL_MEM_FENCE)

if (get_global_id(0) < n)
{
// do work after the barrier

}

Basically the barrier needs to come out of the conditional and must be encountered by all work-items of the work-group.