Creating buffers

Hi I have a question about the clCreateBuffer function. The second argument of this function describes the usage of data in the buffer. Now say if it is CL_MEM_COPY_HOST_PTR, that means that a straight copy happens between the host and device, and after that there is not link/map between the host and device where to set data in the device from the host ptr. But the other two options USE_HOST_PTR & ALLOC_HOST_PTR are a bit confusing.

If I select USE_HOST_PTR does that mean the memory is allocated on the device? If so when does it copy the data from host to the device?
For ALLOC_HOST_PTR does it ever allocate memory on the device, it must at some point copy memory across to the device for execution, when is this?

Iam guessing that the most efficient method is CL_MEM_COPY_HOST_PTR, but offcourse you are assuming that you are not going to change the data.

This is a bit tricky. CL_MEM_USE_HOST_PTR instructs the driver to directly use the memory you have allocated. This may allow for increased efficiency by not requiring a copy to the device (for example in the case of a CPU device) but may not depending on the implementation. E.g., a GPU may still want to copy it for performance instead of accessing it over the PCIe bus, although it will have to keep it in sync if it does so. The ALLOC_HOST_PTR tells the runtime to allocate memory for the object in host-accessible memory, which could be used if you don’t want to specify what allocation to use, but you want to keep it local. As for what is fastest, it will depend on the device and what you are doing. The best is to keep the data on the device for as long as possible, but if the device is a CPU you can avoid a copy by using USE_HOST_PTR.

What is the OpenCL equivalent of CUDA “pinned” memory. I.e. memory guaranteed not to be swapped to disk by the OS?

There are five valid combinations of the following three flags: CL_MEM_ALLOC_HOST_PTR, CL_MEM_COPY_HOST_PTR, and CL_MEM_USE_HOST_PTR.

The combinations are: (1) No flags specified, (2) CL_MEM_COPY_HOST_PTR, (3) CL_MEM_ALLOC_HOST_PTR, (4) CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, and (5) CL_MEM_USE_HOST_PTR.

The first two, (1) No flags specified, and (2) CL_MEM_COPY_HOST_PTR, are non-mappable and require clEnqueueReadBuffer, and clEnqueueWriteBuffer to typically transfer data to/from the host from/to the device. The next two, (3) CL_MEM_ALLOC_HOST_PTR, (4) CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, are mappable and can use clEnqueueMapBuffer, and clEnqueueUnmapMemObject. The last, (5) CL_MEM_USE_HOST_PTR, is also mappable and should use clEnqueueMapBuffer, and clEnqueueUnmapMemObject.

If you are porting an existing application which already allocates its own buffers, then the fifth combination, CL_MEM_USE_HOST_PTR, may be your only choice. However, this might not be the best performance because the buffer may need to be copied from/to host memory to/from device memory. It is generally felt that the first, (1) No flags specified, and third, (3) CL_MEM_ALLOC_HOST_PTR, combinations are better because they do not require a copy and because the buffer can be allocated internally with constraints such as alignment, etc. Depending upon the overhead for a “bulk” read/write data transfer versus a data transfer for each mapped access can help decided between using non-mappable (1) and mappable (3) combinations.

Naturally this depends upon the performance characteristics of each vendor’s hardware and software implementation, that is, your mileage may vary depending upon the OpenCL that you are using.

Thanks bwatt, that’s the most clear explanation I’ve seen of those flags. Maybe add it to the standard? :slight_smile:

Ok but with CL_MEM_ALLOC_HOST_PTR, if the data is allocated on the host, doesn’t it still have to be copied to the device(if it is a graphics cards say) for executions when the kernel executes?

Another silly question, when using CL_MEM_ALLOC_HOST_PTR, how do you get a pointer to the memory the driver allocated in host memory so it can be accessed directly? Or do I have always use ReadBuffer to get the data back out?

jajce85, can’t answer, that is, don’t have the expertise concerning GPUs to tell you the truth, however, I would guess that the vendor needs to allocate the CL_MEM_ALLOC_HOST_PTR memory where it is mappable, therefore, it could be directly allocated on the card and not copied in some cases (for example, when there is only one Device in the Context) if that device memory could be mapped into host storage by the device. If “none” (no flags) is specified implying non-mappable then you have to use read/write to access it. In that case I would also guess that it could be allocated on the card (along with the same caveat as above). Again, these are just thoughts on my part, and would refer to experts from the device vendors answering.

coleb, you get the address by using clEnqueueMapBuffer, or you can always spend the overhead of a copy and use clEnqueueReadBuffer. However, I would assume the former performs better for CL_MEM_ALLOC_HOST_PTR memory. Remember that it might be that mapped memory could be slower to access a scalar or vector word at a time if you do a lot of repetitive accesses, whereas, paying the price for a single bulk read into host memory with faster scalar and vector word access could be better. That’s why I say that your mileage may vary and you need to check out both approaches to determine the best for the hardware you are using.

The first two, (1) No flags specified, and (2) CL_MEM_COPY_HOST_PTR, are non-mappable and require clEnqueueReadBuffer, and clEnqueueWriteBuffer to typically transfer data to/from the host from/to the device. The next two, (3) CL_MEM_ALLOC_HOST_PTR, (4) CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, are mappable and can use clEnqueueMapBuffer, and clEnqueueUnmapMemObject. The last, (5) CL_MEM_USE_HOST_PTR, is also mappable and should use clEnqueueMapBuffer, and clEnqueueUnmapMemObject.

I am not sure that is correct: one can create a buffer with no flag and use clEnqueueMapBuffer to transfer the data.

Regards,

Seb