What is the recommended way to convert serial loops into NDRanges. The loop count is a starting value for global_size, but global_size must be a multiple of local_size. So that either requires doing a conditional statement inside the kernel for global_id < loop count, or pad (with appropriate values) whatever array argument before passing it to the kernel so that it becomes a multiple of local_size. The second method removes the condition inside the kernel but takes time to resize array(s). So the first method seems better, but doesn’t that preclude one from them using a barrier(CLK_LOCAL_MEM_FENCE)?
I recall the OpenCL 1.1 spec saying if any work-item encounters a barrier, then all executing work-items must also encounter that barrier. But if a work-item already returned, then it’s not an executing work-item, right?
I agree doing it from the beginning would be much more efficient. However I’m trying to port a pre-existing library (GNU Scientific Library) to OpenCL so unfortunately I can’t assume anything about the input array sizes.
Another thing is that the particular function has a stride for the array elements, which if calculations are to be done on a GPU (either host side or over PCIe) should be pre-coalesced. In that case I can temporarily pack and pad as desired. What’s your opinion in this case?
I agree doing it from the beginning would be much more efficient. However I’m trying to port a pre-existing library (GNU Scientific Library) to OpenCL so unfortunately I can’t assume anything about the input array sizes.
I don’t understand why that prevents you from padding all buffers at creation time to have a size that is a multiple of the maximum work-group size supported by the device. Surely the CL buffer objects are not visible by the app. If the issue is because you are currently using CL_USE_HOST_PTR what I would do is replace it with CL_ALLOC_HOST_PTR and then enqueue a write command.
Another thing is that the particular function has a stride for the array elements, which if calculations are to be done on a GPU (either host side or over PCIe) should be pre-coalesced. In that case I can temporarily pack and pad as desired. What’s your opinion in this case?
Packing and padding on the CPU prior to sending through PCIe sounds like a good idea to me.