Clarify CL_MEM_USE_HOST_PTR

Note: This is based on my own understanding, which may be wrong! Please correct me if I am.

The specification is unclear on CL_MEM_USE_HOST_PTR concerning when the host memory contains valid values. The specification should probably state in the definition of the flag that it is only guaranteed to be valid after a clEnqueueMapBuffer has been performed. Everything else is implementation dependent. I’ll explain my problems with the current definition in more detail below.

Table 5.3 states that a property of CL_MEM_USE_HOST_PTR is

OpenCL implementations are allowed to cache the buffer
contents pointed to by host_ptr in device memory. This
cached copy can be used when kernels are executed on a
device.

I had an argument with a co-worker who thought that this implies that whenever a device is not executing a kernel using the buffer, the host memory is guaranteed to be valid. This is not in the specification as it says nothing about the coherency of this caching behavior.

In fact the only statement on the validility of the memory is given on page 74

If the buffer object is created with CL_MEM_USE_HOST_PTR set in mem_flags, the following
will be true:
The host_ptr specified in clCreateBuffer is guaranteed to contain the latest bits in the
region being mapped when the clEnqueueMapBuffer command has completed.

Even though it was not stated that the host memory could be invalid during the definition of CL_MEM_USE_HOST_PTR, it is implied that it is only valid after mapping a pointer. I understand that a CPU implementation might actually have this implied behavior but surely the standard could clearly say what the guaranteed behavior is.

Here is the lifecycle of buffer memory objects, and the access-cycle between host and device as I see it.

Prior to calling clCreateBuffer the application that is running on the host allocates (typically using malloc or calloc) and initializes some host memory. At this time obviously the application can readily access and set this host memory and it now has valid values in it since they were set by the application. Naturally all devices are unaware of this memory, so all devices do not have valid values.

After calling clCreateBuffer the application should no longer access the host memory pointed to by the host pointer. The application should assume that the host memory no longer contains valid values. In other words the clCreateBuffer call has turned over access from the host to the device. As you stated, some OpenCL implementations may actually copy the host memory pointed to by the host pointer into device memory. This is done so that a kernel running on the device can have high-speed access to it. This is true whether or not a kernel is executing, that is, the device now has the valid values and the host does not.

After calling clEnqueueMapBuffer and waiting for it to complete (either by making it blocking or waiting upon its event) the application can access the host memory pointed to by the pointer returned from the clEnqueueMapBuffer. As you quoted, the host memory pointed to by the host pointer specified in clCreateBuffer is guaranteed to contain valid values. Conversely you should also assume that at this time only the host has valid values, and the device does not have valid values.

After calling clEnqueueUnmapMemObject and waiting for it to complete (by waiting upon its event) the application no longer has access to the host memory again, and access has reverted back to the device. The application should assume that the host memory no longer contains valid values. In other words, the device now has the valid values and the host does not.

After calling clReleaseMemObject the application no longer needs access to the buffer memory object and, in turn, its host memory. This implies that the host no longer needs valid values. When the last enqueued kernel’s execution completes on all devices that use this buffer memory object, then the buffer memory object is released by the OpenCL implementation’s runtime. This implies that the device no longer needs valid values. As part of this the memory object’s destructor callback function is called. This function can then deallocate the host memory (typically using free).

The specification should probably state in the definition of the flag that it is only guaranteed to be valid after a clEnqueueMapBuffer has been performed. Everything else is implementation dependent.

You got it partially right. The key phrase to search for is “synchronization point”, and the key section of the spec to read is 3.3.1 “Memory Consistency”:

Memory consistency for memory objects shared between enqueued commands is enforced at a synchronization point.

Synchronization points are: events in a wait list, clEnqueueMapBuffer, clEnqueueMapImage, clEnqueueBarrier and clFinish. I think that the list is complete; search the spec for the phrase “synchronization point” to verify I didn’t miss anything.

I had an argument with a co-worker who thought that this implies that whenever a device is not executing a kernel using the buffer, the host memory is guaranteed to be valid.

Unfortunately your coworker is not right.

Brian’s explanation above is a great way to think about this in less formal terms.

Thanks for clarifying the topic. I understood the subject enough from the current standard to avoid undefined behavior but I think that the standard could definitely state this in a clearer way. A single line could be added to the definition of CL_MEM_USE_HOST_PTR which says,

“Once the buffer is created, the memory referenced by host_ptr is only guaranteed to contain valid values while the buffer is mapped using clEnqueueMapBuffer.”

However, at least this is discussed somewhere in the OpenCL forums and hopefully the concerned people will find it.

Memory consistency for memory objects shared between enqueued commands is enforced at a synchronization point.

Correct me if I’m wrong, but I think that the “Memory consistency” discussed in section 3.3.1 only concerns Buffers from a work-item’s perspective and not the host program trying to access CL_MEM_USE_HOST_PTR memory. It would be mad for a single clEnqueueMapBuffer synchronization point to cause all CL_MEM_USE_HOST_PTR buffers to be copied back to host memory.

In summary, the only way to guarantee the host memory is valid is to use clEnqueueMapBuffer.

I know this is an old thread, but (a) it’s a fantastic explanation for anybody who’s confused on how this works, and (b) I have a question about it. Does this apply to only buffers? or all device memory? That is, if I am going to create a 2D Image object, then create the image data, and then load that data into the image, should I map/unmap before/after I copy the data to account for any implementation-specific behavior?

Thanks in advance,
Spencer

Images should follow the same approach as described above for buffers.