Are the nvidia "best practices" violating the spec?

I am very confused about the conflicting descriptions on creating/mapping/transfering of memory buffers.

From reading the Khronos spec, I see two main ways of doing it. First is without mapping:

init()
{
    mem_object = clCreateBuffer(...);
    host_ptr = malloc(...); /* or just an array */
}

loop()
{
    (write stuff to host_ptr buffer)
    clEnqueueWriteBuffer(..., mem_object, ..., host_ptr, ...);
    clEnqueueNDRangeKernel(...);
}

Second is with mapping. From what I understand, the spec dictates that I unmap after using the host pointer, so for a loop I would map, write (or read), unmap, execute kernel, and then start over from the beginning:

init()
{
    mem_object = clCreateBuffer(...);
}

loop()
{
    host_ptr = clEnqueueMapBuffer(..., mem_object, ...);
    (write stuff to host_ptr buffer)
    clEnqueueUnmapMemObject(..., mem_object, host_ptr, ...);
    clEnqueueNDRangeKernel(...);
}

I’m with you this far. But when reading nvidia’s Best Practices, section 3.1.1 about pinned memory, I’m given a third variant that is a mix of the two:

init()
{
    mem_object = clCreateBuffer(..., CL_MEM_ALLOC_HOST_PTR, ...);
    host_ptr = clEnqueueMapBuffer(..., mem_object, ...);
}

loop()
{
    (write stuff to host_ptr buffer)
    clEnqueueWriteBuffer(..., mem_object, ..., host_ptr, ...);
    clEnqueueNDRangeKernel(...);
}

No matter how I read the spec, I can only end up with the conclusion they violate it here. They map the buffer and leave it mapped. And while mapped they do clEnqueueWriteBuffer(). Citing the spec for clEnqueueMapBuffer: “The contents of the regions of a memory object mapped for writing (i.e. CL_MAP_WRITE is set in map_flags argument to clEnqueueMapBuffer or clEnqueueMapImage) are considered to be undefined until this region is unmapped. Reads and writes by a kernel executing on a device to a memory region(s) mapped for writing are undefined.”

and “The behavior of OpenCL function calls that enqueue commands that write or copy to regions of a memory object that are mapped is undefined.”

So are they teaching us to violate the spec or am I reading it wrong? The reason I ask is we have done it the “nvidia way” so far, but on several fresh versions of AMD drivers, this will segfault. If we change the code to the first, non-mapped, alternative above, it ceases to segfault.

Thanks for any insights.

I think I found the problem… it was me reading nvidia wrong. The following is what they actually recommend:

init()
{
    pinned_mem = clCreateBuffer(..., CL_MEM_ALLOC_HOST_PTR, ...);
    device_mem = clCreateBuffer(...);
    host_ptr = clEnqueueMapBuffer(..., pinned_mem, ...);
}

loop()
{
    (write stuff to host_ptr buffer)
    clEnqueueWriteBuffer(..., device_mem, ..., host_ptr, ...);
    clEnqueueNDRangeKernel(...);
}

I’m hoping this will put an end to our problems.