mapping buffers vs using host pointers

Hello,

I’m currently considering to write an application that would use OpenCL to offload atom distance calculations to the GPU (the specific situation being Monte Carlo simulations of protein folding), and I have a couple of questions in this regard.

To describe the problem briefly: I have, in the host memory, a float array that stores atom coordinates (with 3 * N elements) and another float array of N * N elements that stores the distances between the atom pairs.

Ideally, I’d like to write an OpenCL kernel that takes the coordinates array as input and writes the pair-wise distances into the output distances array. Since these arrays will be used in other parts of the host code (without assuming OpenCL acceleration), I’d like to map them onto OpenCL memory objects on the device memory in order to avoid doing potentially expensive copies from host to device memory and vice-versa.

After reading the latest OpenCL reference, it seems to me that there are, at least, two ways of achieving this:

  1. creating the buffer objects passing CL_MEM_USE_HOST_PTR to clCreateBuffer(), as well as the pointers to the arrays in the host memory. This should use the memory referenced by the host pointers as the storage bits for the memory objects.

  2. creating the buffers in device memory, and then mapping them to address space on host memory with clEnqueueMapBuffer(). In this case, the coordinates and distance pointers defined on the host side should store the pointers returned by the calls to clEnqueueMapBuffer().

First of all: am I correct at all in arriving to these two conclusions? If so, which method should be preferred?

Many thanks!
Andres

Yes, you are correct.
The specific performance differences will depend on the implementation and the device. (E.g., mapping may be much faster on the CPU because it avoids a copy, but if you need to move the data from a GPU anyway, it might not be much faster.)

Do keep in mind that if you tell CL to use a host pointer, you need to make sure that your kernel has finished executing before you access the data or it may not be valid.

Hello,

Thanks for your response. Since my experience with OpenCL is very limited at the moment, I was drawing these conclusions from some OpenGL coding I did recently where I used Vertex Buffer Object (VBO) extensions. VBO’s can be mapped to CPU-memory buffers to avoid explicit data copy between CPU and GPU, and although this mapping involves data moving between the two, the performance is still better than doing explicit copies. But I should do some benchmarks to actually see, in OpenCL, what are the consequences of such mappings. These benchmarks will depend, of course, on the specific hardware being used, driver versions, OS, etc.

Andres

I managed to run my distance calculation kernel which uses a host pointer. Basically, I create the CL memory object by doing:

cldistances = clCreateBuffer(clcontext, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, sizeof(float) * natoms * natoms, distances, NULL);

where distances is the float array in CPU memory. The kernel is quite simple, just:

__kernel void dist_calc( __global float4 *coords, __global char *distmask, __global float *distances, uint natoms)
",
{
unsigned int i = get_global_id(0);
unsigned int j = get_global_id(1);
float4 vi, vj;
vi = coords[i];
vj = coords[j];
int idx = i * natoms + j;
distances[idx] = distance(vi, vj);
",
}

where I set the argument distances with:

clSetKernelArg(cldistkernel, 3, sizeof(cl_mem), (void *)&cldistances)

However, on the CPU side the calculated values won’t be accessible in the host array distances unless I explicitly copy clmemory to distances with a buffer read:

clEnqueueReadBuffer(clqueue, cldistances, CL_TRUE, 0, sizeof(FloatValue) * natoms * natoms, distances, 0, NULL, NULL);

My understanding is that the call to clEnqueueReadBuffer shouldn’t be needed since cldistances is mapped to distances, so the data copying/caching would be resolved implicitly by CL. Am I missing something here?

You do not have to call clEnqueueReadBuffer. clEnqueueMapBuffer is sufficient. This is what the spec states:

If the buffer or image object is created with CL_MEM_USE_HOST_PTR set in mem_flags, the following will be true:

  • The host_ptr specified in clCreateBuffer or clCreateImage{2D|3D} is guaranteed to contain the latest bits in the region being mapped when the clEnqueueMapBuffer or clEnqueueMapImage command has completed.

  • The pointer value returned by clEnqueueMapBuffer or clEnqueueMapImage will be derived from the host_ptr specified when the buffer or image object is created.

Many thanks for bringing the clEnqueueReadBuffer() function to my attention. It seems it is exactly what I was looking for.

So, the proper order to call the all these functions should be as follows:

// Intialization:
cldistances = clCreateBuffer(clcontext, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, …

// Calculation loop:
for (i = 0; i < numsteps;…
{
// Enqueing kernel for execution:
clerror = clEnqueueNDRangeKernel(clqueue, cldistkernel,…

// Enqueing buffer mapping: 
clEnqueueMapBuffer(clqueue, cldistances,...
...

}

Do you agree?

This in fact reminds me of the VBO mapping in OpenGL with the glMapBufferARB() function.