Making kernels aware of a large amount of buffers

Hi all,

I posted this question on StackOverflow, but when researching the answer, I saw someone there recommend here as the best place to get OpenCL questions answered. So hopefully you guys can give me some guidance,

In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What’s the recommended way to for letting each kernel know the location of each of these buffers?

The buffers themselves are stable throughout the life of the application – that is, we will allocate the buffers at application start, call multiple kernels multiple times, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.

In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.

What’s the recommended way to do this in OpenCL? It seems that CL’s global variables are a bit different than CUDA’s, but I can’t find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what’s the best way otherwise?

Thanks!

clSetKernelArg

So each kernel should have 60+ arguments to it?

Even if this is allowed by the runtime, won’t this mean that each thread now has (32 * 60) = 240 bytes of local memory just for arguments? And these values are the same to every kernel, on every run.

There has to be a better way…

So each kernel should have 60+ arguments to it?

Even if this is allowed by the runtime, won’t this mean that each thread now has (32 * 60) = 240 bytes of local memory just for arguments? And these values are the same to every kernel, on every run.

There has to be a better way…[/quote]

You do realise that using global variables in cuda probably does the same thing, it’s just a different syntax.

If the data types are the same, just pass one pointer and use indexes to offset into them. Or if the arrays are the same size, turn them into n+1 D arrays.

OpenCL buffers are quite different from CUDA malloc()ed global memory, in that there is no 1:1 mapping between an OpenCL buffer and a specific global memory area on the device (theoretically, the OpenCL runtime could unmap and remap the OpenCL buffer to different global memory addresses at different kernel calls, although I doubt this happens in practice), so you cannot reliably do what you can do in CUDA (i.e. throw the global memory addresses returned by cudaMalloc() into some constant space and forget about passing them as parameters to kernels). You will have to pass each buffer as a separate kernel argument to each kernel.