Questions about the usage of clEnqueueMapBuffer and CL_MEM_USE_HOST_PTR

hi All
What I’m trying to do is to use OpenCL do some previous work on a picture on GPU and then send the processed image to some video analytic module on CPU.
I tried to exclude any kind of memory copy for some performance reasons.

What I can think of is to allocate some memory on CPU side and create a GPU cl_mem with CL_MEM_USE_HOST_PTR.
Before the CPU can use that part of memory, I need to call clEnqueueMapBuffer() first.
But I don’t know when CPU side will finish the processing, so when shall I call clReleaseMemObject?
I don’t want to wait there or add a callback function(boring :().
Since the processed data is already in the system memory, is there any function which can just tell OpenCL to “release” that part of memory?

If I call clReleaseMemObject(), the data is invalid in the corresponding system memory.
If I never call clReleaseMemObject(), I guess there will be a lot of memory leaks.

Thanks

In your scenario, you can use clEnqueueReadBuffer() with blocking_read=true and ptr set to the host memory pointer.
This will synchronize the (host) buffer with the GPU cache. You can then release the OpenCL memory object.
The user-allocated buffer is still valid and contains the result of the GPU computation.

If you call clEnqueueMapBuffer (with blocking==TRUE), then immediately call clEnqueueUnmapBuffer and clReleaseMemObject, that should leave you with valid data in system memory. Does this sequence not work for you? It might be better than calling clEnqueueReadBuffer because on many platforms clEnqueueMapBuffer on a buffer allocated with CL_MEM_USE_HOST_PTR will not perform any copies, whereas clEnqueueReadBuffer will always produce a copy.

[QUOTE=utnapishtim;30601]In your scenario, you can use clEnqueueReadBuffer() with blocking_read=true and ptr set to the host memory pointer.
This will synchronize the (host) buffer with the GPU cache. You can then release the OpenCL memory object.
The user-allocated buffer is still valid and contains the result of the GPU computation.[/QUOTE]

I think this will have an extra memory copy. (1 is from cl_mem_in to cl_mem_out, 2 is from cl_mem_out to user allocated host buffer).
Since I’m processing a decoded video sequence, such expense of memory copy is very big, so I want to reduce the memory copy as much as possible

Oh, I haven’t tried this sequence yet. Is it a valid sequence for all the platforms? Or it’s platform related?

[QUOTE=lance0010;30604]I think this will have an extra memory copy. (1 is from cl_mem_in to cl_mem_out, 2 is from cl_mem_out to user allocated host buffer).
Since I’m processing a decoded video sequence, such expense of memory copy is very big, so I want to reduce the memory copy as much as possible[/QUOTE]

Calling clEnqueueReadBuffer() with the pointer used to create the host-allocated buffer won’t make any redundant copy but will only synchronize memory between GPU and CPU if needed (whence the special requirements for this kind of call as described in the note to §5.2.2 of OpenCL specs).

Is the “Calling clEnqueueReadBuffer() with the pointer used to create the host-allocated buffer won’t make any redundant copy but will only synchronize memory between GPU and CPU if needed” defined somewhere in the spec explicitly? I do agree with you on this but it’s better if it’s pointed out or hinted somewhere in the spec.

Honestly, if your buffer is to be accessed by a GPU kernel, you shouldn’t use a host buffer and expect that all transfers will be magically optimized.
If you need a buffer for your GPU kernel, then create a device-allocated buffer. Then handle manually transfers between host and device. Thus you’ll have full control of what happens.

For instance on an AMD device, if the alignment of your memory pointer is correct, a CL_MEM_USE_HOST_PTR buffer may be pre-pinned, giving fast transfers between host and device. However, once used by a device kernel, this buffer is no more pre-pinned and the transfer from device to host will be slow (as will be all subsequent transfers).
On NVIDIA devices, AFAIK only CL_MEM_ALLOC_HOST_PTR buffers are pinned. CL_MEM_USE_HOST_PTR always have slow transfer paths (this can have changed, but once more, this depends on the alignment of the pointer and the device being used).

Because most modern GPUs can execute a kernel and copy memory between host and device at the same time, if the computation time of your kernel and the memory transfer time have the same order of magnitude, you could imagine splitting the computation in two parts, transfer the memory for the second half while the first half is being computed, and transfer the result of the first half while the second half is being computed.
This requires two command queues and a good synchronization though.

[QUOTE=utnapishtim;30610]Honestly, if your buffer is to be accessed by a GPU kernel, you shouldn’t use a host buffer and expect that all transfers will be magically optimized.
If you need a buffer for your GPU kernel, then create a device-allocated buffer. Then handle manually transfers between host and device. Thus you’ll have full control of what happens.

For instance on an AMD device, if the alignment of your memory pointer is correct, a CL_MEM_USE_HOST_PTR buffer may be pre-pinned, giving fast transfers between host and device. However, once used by a device kernel, this buffer is no more pre-pinned and the transfer from device to host will be slow (as will be all subsequent transfers).
On NVIDIA devices, AFAIK only CL_MEM_ALLOC_HOST_PTR buffers are pinned. CL_MEM_USE_HOST_PTR always have slow transfer paths (this can have changed, but once more, this depends on the alignment of the pointer and the device being used).

Because most modern GPUs can execute a kernel and copy memory between host and device at the same time, if the computation time of your kernel and the memory transfer time have the same order of magnitude, you could imagine splitting the computation in two parts, transfer the memory for the second half while the first half is being computed, and transfer the result of the first half while the second half is being computed.
This requires two command queues and a good synchronization though.[/QUOTE]

The reason I want to use CL_MEM_USE_HOST_PTR is it seems can have less memory copy, but it may be slower than CL_MEM_ALLOC_HOST_PTR + explicit copy, (may be affected by different implementations) , right?

To put it simply, a kernel should never access a host memory buffer.

In that case, the OpenCL implementation will either:

  • make a copy of the buffer between host and device before the kernel starts
  • or (if supported) give to the device direct access to host memory though PCIe interconnect at a 10x slower speed than a full copy.

The main interest of host memory buffers is to allow memory transfer at full speed between host and device with clEnqueueCopyBuffer() while keeping CPU and GPU free to work during that transfer thanks to DMA.

Pinning memory (with CL_MEM_USE_HOST_PTR) or allocating pinned memory (with CL_MEM_ALLOC_HOST_PTR) is a slow process (as well as unpinning and deallocating pinned memory), so using a host memory buffer is meaningful only if you intend to use it many times (lots of kernel launches).

If you only intend to create a host buffer, call a kernel and immediately after delete this buffer, you’d better create a device buffer, fill it with clEnqueueWriteBuffer(), call the kernel, and get the result with clEnqueueReadBuffer(). You will save a lot of overhead.

[QUOTE=utnapishtim;30613]To put it simply, a kernel should never access a host memory buffer.

In that case, the OpenCL implementation will either:

  • make a copy of the buffer between host and device before the kernel starts
  • or (if supported) give to the device direct access to host memory though PCIe interconnect at a 10x slower speed than a full copy.

The main interest of host memory buffers is to allow memory transfer at full speed between host and device with clEnqueueCopyBuffer() while keeping CPU and GPU free to work during that transfer thanks to DMA.

Pinning memory (with CL_MEM_USE_HOST_PTR) or allocating pinned memory (with CL_MEM_ALLOC_HOST_PTR) is a slow process (as well as unpinning and deallocating pinned memory), so using a host memory buffer is meaningful only if you intend to use it many times (lots of kernel launches).

If you only intend to create a host buffer, call a kernel and immediately after delete this buffer, you’d better create a device buffer, fill it with clEnqueueWriteBuffer(), call the kernel, and get the result with clEnqueueReadBuffer(). You will save a lot of overhead.[/QUOTE]

For the possible implementation, is there a third option here? As far as I know, Intel CPU & GPU are on the same die and share the last level cache, so they can read/write to the cache and validate the cache to memory for use (maybe with clEnqueueMapBuffer?). For such case, the mapping seems to be more efficient?

Intel CPU and GPU share physical memory so mapping a buffer is very efficient if the following conditions are fulfilled:

  • The buffer is created with CL_MEM_ALLOC_HOST_PTR, or with CL_MEM_USE_HOST_PTR with the pointer aligned on 4KB
  • Use buffers instead of images (images cannot be mapped efficiently)
  • Use clEnqueueMapBuffer() and clEnqueueUnmapMemObject() instead of clEnqueueReadBuffer() and clEnqueueWriteBuffer()

[QUOTE=utnapishtim;30616]Intel CPU and GPU share physical memory so mapping a buffer is very efficient if the following conditions are fulfilled:

  • The buffer is created with CL_MEM_ALLOC_HOST_PTR, or with CL_MEM_USE_HOST_PTR with the pointer aligned on 4KB
  • Use buffers instead of images (images cannot be mapped efficiently)
  • Use clEnqueueMapBuffer() and clEnqueueUnmapMemObject() instead of clEnqueueReadBuffer() and clEnqueueWriteBuffer()[/QUOTE]

I’ll do some verification on Intel CPUs
Thanks utnapishtim.