Device-host memory communication

I’m doing a program that executes dozen of kernels in command queue, and all kernels are using same input/output buffers. Of course I would like that all these computations perform in device memory, and only after all kernels are finished I want to be able to access computed results in host code.
How to allocate the memory this way, and be able to read the memory afterwards in host?

Thank you

OpenCL should do this automatically. Simply create your cl_mem objects, write in the initial data, and then enqueue your kernels in the order you want them executed. The runtime will try to do the best job it can of keeping the data on the device as long as possible. As long as all the data fits, you should get the best performance. If, for example, the data for kernel A fits all at once, but kernel B requires other data that does not fit with kernel A’s data, then the runtime will have to page data on-and-off the device.

My advice is to allocate your memory objects not using CL_MEM_USE_HOST_PTR (this may incur extra work to keep the host pointer synchronized) and then just enqueue your kernels. As long as you don’t do a clEnqueueRead/Write, the data should stay on the card. Make sure, however, that if your command queue is out-of-order that you use events to ensure the order of execution of your kernels. (If it’s in-order, you should just enqueue them in the order you want.)

Thank you for the clarification! :smiley: