Hello. I’m porting a fairly simple CUDA kernel to OpenCL, but I’m struggling at getting the indexing correct. I have a 4D array, we’ll call it A(isize,jsize,ksize,msize). In CUDA I can do a ksize x msize grid of isize x jsize blocks and that results in:
i = threadidx.x
j = threadidx.y
k = blockidx.x
m = blockidx.y
I can then calculate the index from that. I’d like to convert the same thing to OpenCL. Since OpenCL defines the global and local work sizes, I expected that I could do something like this:
globalsize[0] = ksize * isize
globalsize[1] = msize * jsize
localsize[0] = isize
localsize[1] = jsize
and pass 2 in for the size of globalsize and localsize, then in the kernel:
i = get_local_id(0)
j = get_local_id(1)
k = get_group_id(0)
m = get_group_id(1)
If I then calculate my index in the same way and just do something simple, like set every element (i,j,k,m) = i, the CUDA code and OpenCL code don’t match. Am I doing something really dumb here? Is there a simpler way to migrate from the CUDA grid/block to OpenCL global/local work size?
Thanks!