Overhead of Passing the same buffer to different kernels

Hi, I’m a beginner in OpenCL and I have a (maybe) naive question. In my use case I have two kernels which are queued sequentially; both of them need the same buffer object as input argument (amongst other arguments), and I’m worrying about the overhead of transferring it to the GPU. Currently I do semething like this:


// Create cl::Buffer objects
// Then:
kernel1.setArg(0, thatBuffer);
kernel1.setArg(1, bufferForKernel1);
queue.enqueueNDRangeKernel(kernel1, ...)
kernel2.setArg(0, thatBuffer);
kernel2.setArg(1, bufferForKernel2);
queue.enqueueNDRangeKernel(kernel2, ...)

My hope is that doing this way thatBuffer is transferred only once to the GPU and consumed by both kernels, but I fear that both the setArg calls might trigger a data transfer to the GPU. If so, how to optimize the data transfer to avoid unnecessary overhead?
Thanks.

Buffers and Images (and other cl_mem objects on newer versions of OpenCL) passed to kernels are just handles to the memory object. Therefore they are very fast and you can use the same Buffer in multiple kernels with no overhead.

The actual copying of the contents of the buffer happens with the clEnqueueWriteBuffer / clEnqueueReadBuffer / clEnqueueWriteImage / clEnqueueReadImage / clEnqueueMapBuffer / clEnqueueMapImage / clEnqueueUnmapMemObject commands. You want to minimize those.

If you create and write to your buffer then process a bunch of kernels on it, it is resident on the GPU during all of the kernel execution. Then you copy back the buffer with the results.

Thanks Dithermaster. But now I’m a bit confused: I don’t explicitly enqueue any buffer copy to the GPU. I simply create cl::Buffer objects (using the C++ wrapper API), then set kernel arguments calling cl::Kernel::setArg, then enqueue the first kernel for execution with cl::Queue::enqueueNDRangeKernel. It actually works so somewhere under the hood the buffer copy from host memory to GPU happens, but it’s not clear to me when and where. That’s why I asked the question. Your answer makes sense but does not fit to my use case…

The initial data transfer could be part of buffer creation, depending on how you created your buffer. If you created it using the CL_MEM_USE_HOST_PTR flag, then on some architectures, you may not even have a copy at all at any point. On other architectures, or if you create your buffer with CL_MEM_COPY_HOST_PTR, you would have an initial copy in the context of the clCreateBuffer call.

But typically, buffers used in two consecutive kernels don’t require additional copies, as Dithermaster explained.

Thanks to everybody, it is clear now.