I’ve encountered on one more obstacle in specifications, even when trying to implement “secure” data transfer method with multiple buffers.
Since clCreateKernel returns one object for all devices program has been built on, it’s impossible to use clSetKernelArg with different buffers for different devices. This forces one to make multiple cl_program objects (one for each device), build the programs for their device, and create separate kernels. Ugly.
clSetKernelArg could have optional parameter cl_device_id, since one kernel object for all devices limits the operations with kernel, like in this case.
Tho, nVidia’s OpenCL SDK is offering multi-gpu example, with the following unlogical solution to the problem:
for(unsigned int i = 0; i < ciDeviceCount; ++i )
{
workSize[i] = ...;
// Input buffer
d_Data[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, workSize[i] * sizeof(float), NULL, &ciErrNum);
// Copy data from host to device
ciErrNum = clEnqueueCopyBuffer(commandQueue[i], h_DataBuffer, d_Data[i], workOffset[i] * sizeof(float), 0, workSize[i] * sizeof(float), 0, NULL, NULL);
// Output buffer
d_Result[i] = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, ACCUM_N * sizeof(float), NULL, &ciErrNum);
// Create kernel
reduceKernel[i] = clCreateKernel(cpProgram, "reduce", &ciErrNum);
// Set the args values and check for errors
ciErrNum |= clSetKernelArg(reduceKernel[i], 0, sizeof(cl_mem), &d_Result[i]);
ciErrNum |= clSetKernelArg(reduceKernel[i], 1, sizeof(cl_mem), &d_Data[i]);
ciErrNum |= clSetKernelArg(reduceKernel[i], 2, sizeof(int), &workSize[i]);
workOffset[i + 1] = ...;
}
reduceKernel[i] and clSetKernelArg usage in the example makes no sense to me.