Running kernel on host vs device

jimmyz500 · November 30, 2012, 9:42am

In the Codeproject example:

// create data for the run
float* data = new float[DATA_SIZE];

// Create the device memory vectors
input = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL);

// Transfer the input vector into device memory
err = clEnqueueWriteBuffer(commands, input, CL_TRUE, 0, sizeof(float) * count, data, 0,
NULL, NULL);

// Set the arguments to the compute kernel
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);

// Execute the kernel
err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL);

Question is if I can choose between CL_DEVICE_TYPE_GPU or CL_DEVICE_TYPE_CPU, when executing on the host, how would the kernel use data on the host? It seems to me that in clSetKernelArg, the kernel is always set to use &input, which is on the device, and that doesn’t make sense when running on the CPU.

Any clarification is much appreciated.
-J

Dithermaster · December 1, 2012, 12:51pm

With AMD and Intel OpenCL platform drivers, you can select OpenCL devices that are the CPU instead of the GPU.

The rest of OpenCL works just like it would with a GPU.

With your code, clEnqueueWriteBuffer copies data from CPU memory to another part of CPU memory, and then when you execute your kernel on the CPU, it access that memory.

If you know you are running on the CPU, using clEnqueueMapBuffer can be faster because memory isn’t copied, just ownership changes (when mapped you can access the buffer from your main code, when unmapped from kernels; the map and unmap calls are fast).

jimmyz500 · December 1, 2012, 1:55pm

Dithermaster, thanks very much for your response.

So only when device type is set to CL_DEVICE_TYPE_GPU, does clEnqueueWriteBuffer actually copies the data to the device over PCIe, causing the long delay?

Thx

Dithermaster · December 2, 2012, 7:03am

Yes, for GPU clEnqueueWriteBuffer enqueues a command which will asynchronously copy the data over the PCIe bus. If speed is paramount here, read the vendor documentation on how to maximum speed, for example, by using pinned buffers. You could also switch to a model where you use clEnqueueMapBuffer, which always runs at full PCIe bandwidth.

jimmyz500 · December 2, 2012, 7:42pm

Dithermaster, thanks very much for the explanation.

clint3112 · December 5, 2012, 8:47am

Dithermaster,

where are your infos abput the full PCIe bandwith from? Do you have a source for that?

Thanks in Advance,
Clint3112

Dithermaster · December 8, 2012, 11:41am

From each manufacturer’s OpenCL documentation. The each have guides that recommend the fastest way to transfer data to their devices.

clint3112 · December 11, 2012, 7:32am

And where did you get those specs? I could not find them on the nvidia page. need to know what binary i till get from CL_PROGRAM_BINARY :?