Transfers between host and device memory

Hi there, I have two questions:

First question: I need to transfer data from GPU to CPU and CPU to GPU. To compute the transfer rate I’m timing the transfers using OpenCL Events; It looks like the transfer from GPU to CPU is faster than the transfer from CPU to GPU (12.2GB/s vs 11GB/s). I read somewhere that this behavior is normal, but don’t know why: is it because restrictions imposed by the PCIe or the GPU ?. Any explanation and links will be useful. BTW: I’m using a NVidia C2070 GPU and a PCIe x16 2nd Generation; and the buffer at the host is pinned memory

Second question is: What I actually need is to transfer data from GPU1 to GPU2, so I’m transferring by doing 2 transfers: GPU-CPU and then CPU-GPU using pinned memory. Is there any way to transfer GPU-GPU directly ?. Both GPUs are C2070.

Thanks.

Is there any way to transfer GPU-GPU directly ?

You may want to read about clEnqueueCopyBuffer() and clEnqueueCopyImage() to perform an explicit copy.

Alternatively, you can just create a context with the two GPUs in it and let the OpenCL runtime move data from one device to the other automatically for you. All you need is one command queue for GPU 1 and another command queue for device 2. When you enqueue an NDRange on queue 1, all necessary data will be transferred to GPU 1 automatically if it was not already there.

I actually tried using clEnqueueCopyBuffer(), however the performance was not good:

queue[0] = clCreateCommandQueue(context, device[0], QUEUE_OPTS , &result);
queue[1] = clCreateCommandQueue(context, device[1], QUEUE_OPTS , &result);
// Kernel creation and argument passing:
// - coefx[0], coefy[0] and res[0] “were created in” queue[0] using clEnqueueCopyBuffer(queue[0],…)
// - coefx[1], coefy[1] and res[1] “were created in” queue[1] using clEnqueueCopyBuffer(queue[1],…)
kernel[0] = clCreateKernel(OpenCLProgram, kernel, &err)
err = clSetKernelArg(kernel[0], 0, sizeof(cl_mem), (void*)&(coefx[0]));
err = clSetKernelArg(kernel[0], 1, sizeof(cl_mem), (void*)&(coefy[0]));
err = clSetKernelArg(kernel[0], 2, sizeof(cl_mem), (void*)&(res[0]));
// variables with index 1 were created in queue[1]
kernel[1] = clCreateKernel(OpenCLProgram, kernel, &err);
err = clSetKernelArg(kernel[1], 0, sizeof(cl_mem), (void*)&(coefx[1]));
err = clSetKernelArg(kernel[1], 1, sizeof(cl_mem), (void*)&(coefy[1]));
err = clSetKernelArg(kernel[1], 2, sizeof(cl_mem), (void*)&(res[1]));

err = clEnqueueNDRangeKernel(queue[0], kernel[0], 1, NULL, WSize, GSize, 0, NULL, NULL);
err = clEnqueueNDRangeKernel(queue[1], kernel[1], 1, NULL, WSize, GSize, 0, NULL, NULL);
size = Ntotal * sizeof(float);
offset = size;
// Barriers before timing
clFinish(queue[0]);
clFinish(queue[1]);
start = gettimeofday();
// res[0] and res[1] were created in queue[0] and queue[1] respectively
err = clEnqueueCopyBuffer(queue[1], res[1], res[0], 0, offset, size, 1, &eventTmp, &event);
clWaitForEvents(1, &event);
finish = gettimeofday();
wallTime = finish – start;
openCLTime = clGetEventProfilingInfo(event);

In the above code the wallTime is the time that the copy takes, which is bigger than the time using a transfer GPU0-CPU and then CPU-GPU1

Alternatively, you can just create a context with the two GPUs in it and let the OpenCL runtime move data from one device to the other automatically for you. All you need is one command queue for GPU 1 and another command queue for device 2. When you enqueue an NDRange on queue 1, all necessary data will be transferred to GPU 1 automatically if it was not already there.

Not really sure if I this will actually work, because the variable (say RES in the above code) exists in both GPUs. The main ideain the example above is that each GPU does the same work, but it works on a different piece of data (i.e the same array with different offset), so I’m guessing that the OpenCL runtime will believe that the data is already there, so it won’t try to automatically copy the data, correct ?

Thanks.

Can you elaborate on this part?

// - coefx[0], coefy[0] and res[0] "were created in" queue[0] using clEnqueueCopyBuffer(queue[0],...)

First, coefx, coefy and res are buffer objects, and buffer objects belong to a context, not to a queue. Why do you call clEnqueueCopyBuffer() at all here? It seems like you simply load data into coefx and coefy and then let the GPU do some computations. I don’t see why clEnqueueCopyBuffer() would be necessary.

clFinish(queue[0]);
clFinish(queue[1]);

This may not be a very good idea. What you want is both devices running simultaneously. If you call clFinish() on each device separately, it’s possible that the second GPU will not be doing anything while clFinish(queue[0]) is waiting for the first GPU to finish.

Instead, you may want to call clWaitForEvents() on two events, one from each queue.

Separately, I don’t quite get this either:

// res[0] and res[1] were created in queue[0] and queue[1] respectively
err = clEnqueueCopyBuffer(queue[1], res[1], res[0], 0, offset, size, 1, &eventTmp, &event);

Is it truly necessary for your algorithm to copy the results into a single buffer? If the transfer is not strictly required, you may want to keep the two buffers separate.

In the above code the wallTime is the time that the copy takes, which is bigger than the time using a transfer GPU0-CPU and then CPU-GPU1

Can you show us the two alternative versions of the code for comparison?

Not really sure if I this will actually work, because the variable (say RES in the above code) exists in both GPUs. The main ideain the example above is that each GPU does the same work, but it works on a different piece of data (i.e the same array with different offset), so I’m guessing that the OpenCL runtime will believe that the data is already there, so it won’t try to automatically copy the data, correct ?

I understand that you want each device to do part of the work. I don’t quite get the rest of the statement. There is not a single variable “res”. Instead, you have already divided it into res[0] and res[1]. The OpenCL runtime is fully aware at all times of where each buffer is located, whether it’s on the first or the second GPU and it will use that knowledge to minimize data transfers.

In a simpler way, the kernel takes 2 inputs (coefx and coefy), perform some computations and updates the output (res).

I’m partitioning the kernel execution into my 2 available GPUs in such a way that each GPU performs part of the processing and outputs half of the array “res”. Once the kernel execution in both GPUs is done, the array res[1] in the GPU1 is copied into the array res[0] in the GPU0, obviously no data is overwritten, the data are copied into the unused part of the array.

When the variables coefx, coefy and res are populated before kernel execution, they are tied to a specific queue by doing: (notice that it will be the same for coefx[.] and coefy[.])


res[0] = clCreateBuffer(context, mode, size, NULL, &err);
// h_data is the host buffer with the information to be copied into res[0]
// This line "ties" the variable res[0] to queue[0] and hence to GPU0
err = clEnqueueCopyBuffer(queue[0], h_Data, res[0], 0, 0, size, 0, NULL, NULL);

First, coefx, coefy and res are buffer objects, and buffer objects belong to a context, not to a queue. Why do you call clEnqueueCopyBuffer() at all here? It seems like you simply load data into coefx and coefy and then let the GPU do some computations.

In the above code I specified which queue (and hence GPU) to perform the copy, therefore the array will belong to that GPU

I don’t see why clEnqueueCopyBuffer() would be necessary.

Because I want to specify where to “store” the buffer by specifying the queue (GPU)

clFinish(queue[0]);
clFinish(queue[1]);

This may not be a very good idea. What you want is both devices running simultaneously. If you call clFinish() on each device separately, it’s possible that the second GPU will not be doing anything while clFinish(queue[0]) is waiting for the first GPU to finish.

Instead, you may want to call clWaitForEvents() on two events, one from each queue.

Yes, you are right; in fact I use in my actual code clWaitForEvents()

Separately, I don’t quite get this either:

// res[0] and res[1] were created in queue[0] and queue[1] respectively
err = clEnqueueCopyBuffer(queue[1], res[1], res[0], 0, offset, size, 1, &eventTmp, &event);

Hopefully the above explanations/code clarifies this as well.

Is it truly necessary for your algorithm to copy the results into a single buffer? If the transfer is not strictly required, you may want to keep the two buffers separate.

Yes, it is necessary to copy the results back into a single buffer.

[quote:2syi6k24]In the above code the wallTime is the time that the copy takes, which is bigger than the time using a transfer GPU0-CPU and then CPU-GPU1

Can you show us the two alternative versions of the code for comparison?
[/quote:2syi6k24]
This piece of code copies from GPU1 to CPU and then from CPU to GPU0


// Defines size and offset
size = N/2;
offset = size;
// Creates pinned host buffer
cl_mem PinnedBuf = NULL;
PinnedBuf = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size, NULL, NULL);
float *temp  = NULL; 
temp = (float *) clEnqueueMapBuffer(queue[1], PinnedBuf, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, size, 0, NULL, NULL, NULL);
// GPU1 to CPU 
err = clEnqueueReadBuffer(queue[1], res[1], CL_FALSE, 0, size, temp, 0, NULL, &event);
// CPU to GPU0
err = clEnqueueWriteBuffer(queue[0], res[0], CL_FALSE, offset, size, temp, 1, &event, NULL);

This piece of code copies directly from GPU1 to GPU0


// GPU1 to GPU0
// Assumes that eventTmp is the event tied to the kernel execution on queue[1]
err = clEnqueueCopyBuffer(queue[1], res[1], res[0], 0, offset, size, 1, &eventTmp, NULL);

I understand that you want each device to do part of the work. I don’t quite get the rest of the statement. There is not a single variable “res”. Instead, you have already divided it into res[0] and res[1].

Correct

The OpenCL runtime is fully aware at all times of where each buffer is located, whether it’s on the first or the second GPU and it will use that knowledge to minimize data transfers.

Say, if I defined res[0] of size N/2 to belong to the queue[0] (having the array indexes from 0 to N/2 - 1) and res[1] of size N/2 to belong to queue[1] (having the array indexes from N/2 to N - 1) and at some point in the kernel[0] running on queue[0] there is a reference to the location index “N-1”, does the OpenCL runtime 1) “bring” the correct data, or 2) have some undefined behavior or, 3) seg fault ?

// This line “ties” the variable res[0] to queue[0] and hence to GPU0
err = clEnqueueCopyBuffer(queue[0], h_Data, res[0], 0, 0, size, 0, NULL, NULL);

That line is not necessary to “tie” the variable to a device. In fact it’s causing an additional data transfer that is not needed in the first place.

In the above code I specified which queue (and hence GPU) to perform the copy, therefore the array will belong to that GPU

That’s not necessary.

Because I want to specify where to “store” the buffer by specifying the queue (GPU)

Not necessary.

Yes, it is necessary to copy the results back into a single buffer.

OK, do that at the end. Don’t call clEnqueueCopyBuffer() at the beginning.

temp = (float *) clEnqueueMapBuffer(queue[1], PinnedBuf, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, size, 0, NULL, NULL, NULL);
// GPU1 to CPU
err = clEnqueueReadBuffer(queue[1], res[1], CL_FALSE, 0, size, temp, 0, NULL, &event);
// CPU to GPU0
err = clEnqueueWriteBuffer(queue[0], res[0], CL_FALSE, offset, size, temp, 1, &event, NULL);

That code may or may not work in all implementations.

First, OpenCL 1.1. doesn’t clarify whether a pointer returned by clEnqueueMapBuffer() may be used in other APIs such as clEnqueueReadBuffer().

Second, you didn’t call clEnqueueUnmapMemObject(). Without a call to clEnqueueUnmapMemObject() there is no guarantee that the data that is read from PinnedBuf will match the data that was written into it (you can search the specification for the term “synchronization point”).

In other words, the code above is non-portable.

Say, if I defined res[0] of size N/2 to belong to the queue[0] (having the array indexes from 0 to N/2 - 1) and res[1] of size N/2 to belong to queue[1] (having the array indexes from N/2 to N - 1) and at some point in the kernel[0] running on queue[0] there is a reference to the location index “N-1”, does the OpenCL runtime 1) “bring” the correct data, or 2) have some undefined behavior or, 3) seg fault ?

res[0] is one buffer object. res[1] is an entirely separate buffer object. They both belong to the same context. They do not belong to any particular device. If you access an index out of bounds in res[0] or res[1] in any device results are undefined.