I'm doing a program that executes dozen of kernels in command queue, and all kernels are using same input/output buffers. Of course I would like that all these computations perform in device memory, and only after all kernels are finished I want to be able to access computed results in host code.
How to allocate the memory this way, and be able to read the memory afterwards in host?

Thank you