say I have n openCL devices, and that the data, of size d2 has been partitioned into sections such that it complements compute topology, memory buffers have been allocated, etc.
Given something like this:
int loopUnroll = 4; // or whatever you want
cl::CommandQueue queue = cl::CommandQueue( context, device[i]); \ (i \in {0, n})
cl::NDRange globalRange(d2 /n) cl::NDRange localRange(loopUnroll);
queue.enequeueNDRangeKernel(kernel, NullRange, globalRange, Local Range);
Now after executing the above, I am under the impression that this initiates kernel execution on the set of data of size d2/n.
I am unsure if just repeating those steps on device i++ will execute the computation on the OpenCL devices concurrently. Is it the case that I do that and then wait until some openCL function returns a “done computing” ?
I am not sure where it is indicated, (the specific API function), that global computation on a particular global work group is finished. I’m not having much luck looking through the spec and the c++ wrapper literature, and so, I turn to le internet.
I would prefer answers using the C++ API, (I don’t really like C syntax, just a personal preference), but please don’t expend any effort on it, I can translate if required.