enequeueNDRangeKernel - parallel execution on OpenCL device?

say I have n openCL devices, and that the data, of size d2 has been partitioned into sections such that it complements compute topology, memory buffers have been allocated, etc.

Given something like this:


int loopUnroll = 4; // or whatever you want
cl::CommandQueue queue = cl::CommandQueue( context, device[i]); \ (i \in {0, n}) 
cl::NDRange globalRange(d2 /n) cl::NDRange localRange(loopUnroll); 
queue.enequeueNDRangeKernel(kernel, NullRange, globalRange, Local Range);

Now after executing the above, I am under the impression that this initiates kernel execution on the set of data of size d2/n.

I am unsure if just repeating those steps on device i++ will execute the computation on the OpenCL devices concurrently. Is it the case that I do that and then wait until some openCL function returns a “done computing” ?

I am not sure where it is indicated, (the specific API function), that global computation on a particular global work group is finished. I’m not having much luck looking through the spec and the c++ wrapper literature, and so, I turn to le internet.

I would prefer answers using the C++ API, (I don’t really like C syntax, just a personal preference), but please don’t expend any effort on it, I can translate if required.

You are right that you can enqueue the kernels in a loop over all the devices. This will result in the devices processing the data concurrently.

You find out when a kernel has finished by using the event object that gets returned by enequeueNDRangeKernel. There are functions to wait for events to finish and to query the status of an event, i.e. queued, submitted to the device, started running on the device or finished. Section 5.9 of the OpenCL 1.2 specification explains it all.

Edit: section 3.7 of the OpenCL 1.1 C++ specification lists the C++ equivalent functions.

Gorgeous response, thank you.

So I’ve been trying to allocate this stuff dynamically like so:

std::vector<cl::CommandQueue> deviceQueues;

cl::NDRange globalRange(d2 /nDevices);

cl::NDRange localRange(LOOP_UNROLL);

for(std::vector<cl::Device>::iterator dit = clDevices.begin(); dit != clDevices.end(); dit++){

deviceQueues.push_back(cl::CommandQueue( context, *dit));       

// not sure about this last one here, read OCL spec
deviceQueues.back().enqueueNDRangeKernel(N2kernel, cl::NullRange, globalRange, localRange);

}

Now for every single command queue in the deviceQueues vector, you say that some cl_event object will be returned to those classes. Is it the case then, that after the above for loop, I run through a for loop like this:

for(std::vector<cl::CommandQueue>::iterator qit = deviceQueues.begin(); qit != deviceQueues.end; qit++){

qit-&gt;enqueueBarrier();  

}

In my grandiose imagination, it seems to me that the for loop will get stuck on this command until that first qit kernel finishes, and then it moves on to the next and next, and so on throughout the command queue vector until they are all done. Is that sufficient? Or is it the case that I have to do something like:

while( qit->enqueueBarrier() != CL_SUCCESS){}

Inside that for loop? I don’t want to spam the gpu’s with requests, I’m not sure from the online documentation how euqueueBarrier works exactly.

I just did some more investigating, and came up with the following:

std::vector<cl::CommandQueue> deviceQueues;
std::vector<cl::Event> eventVector;

// Global Range:
cl::NDRange globalRange(N/nDevices);

// Local Range: Number of elements processed by each thread?
cl::NDRange localRange(LOOP_UNROLL);

for(std::vector<cl::Device>::iterator dit = clDevices.begin(); dit != clDevices.end(); dit++){

deviceQueues.push_back(cl::CommandQueue( context, *dit));       
eventVector.push_back(cl::Event());
// not sure about this last one here, read OCL spec
deviceQueues.back().enqueueNDRangeKernel(N2kernel, cl::NullRange, globalRange, localRange, NULL, &eventVector.back());

}

for(std::vector<cl::CommandQueue>::iterator qit = deviceQueues.begin(); qit != deviceQueues.end(); qit++){

qit-&gt;flush();

}

for(std::vector<cl::Event>::iterator eit = eventVector.begin(); eit!= eventVector.end(); eit++){

eit-&gt;wait();

}

I’m confused over whether I need to flush every single CommandQueue object or if I can call a global

cl::CommandQueue::flush();

Also,

can I possibly replace that entire flush/wait sequence with a global

cl::CommandQueue::finish(); ?

I can’t seem to find this in the spec, but does cl::CommandQueue::finish() also perform the functionality of flush()?

I’m guessing I can do away with the event vector altogether and just have 3, (or two, if what I’m asking about above turns out to be true), separate for loops:

for loop 1:
enqueueNDRangeKernel on all devices

for loop 2:
flush on all devices

for loop 3:
finish on all devices