two level of parallel

I am confuse with a problem, I want to know if we can do the following:

I have an algorithm and it should run in parallel:
if we have and “array a[][]” , and for each element in this array it will execute the kernel code, - its ok for now- , but if I have 3 different data for “array a”, how we can run 3 set of a[][] in parallel, so here we have two level of parallel:
1.element of a[][] are executing code in parallel
2. 3 different a[][] (run a[][] on three different data set)

I hope its clear.

Perhaps the most natural way to achieve what you describe is this: call clEnqueueNDRangeKernel() once for each of the arrays you have. If you have three arrays there would be three calls to that function.

Hi david,
Thank you for your reply and sorry for the belated reply since I noticed your reply late, I solved it by putting the 3 array together in one large array, but which one is better ur suggestion or the one which I used, and if I use one array which better to use them in one work group or to use more than one and how to decide how many, please notice that the 3 array are different in size.

Thank you.

I would say that running the kernel three times is conceptually better since it is easier to understand what is happening. It will also make it easier should you ever require to run other than three kernels at the time. Proper event and dependency handling will make it clear that the three kernel invocations may run in parallel. The overhead of launching two extra kernels should be small compared to the actual kernels.

The work group question is a bit unclear. It is almost always better to have more than one work group. Do you have any particular reason for choosing only a single work group?

I am new to opencl (and may be this is clear from my questions :slight_smile: ) , OK I was bit fair that launching the kernel will affect the performance a lot.
so I should setup the buffers each time for the different arrays! ( since they are differ in size) and call clEnqueueNDRangeKernel 3 times! can you please give more hint about running the kernels in parallel, how to do that?

sorry my question was unclear, I was asking about how to decide the number of work group which I need and the number of work-items in each work group, is there a way to choose a best choice so I can increase the performance.

Creating the command queue with the out-of-order flag will let the hardware run the three kernel invocations in parallel, but the actual parallelism achieved by this is up to the implementation. Not all hardware can run two kernel invocations concurrently. In that case your combined kernel could be faster since gives the hardware more information and more work at once. How the kernel is actually implemented will have a large influence on this. The upside of the multi-invocation strategy is that you can place the buffers on three different devices, giving true homogeneous computing if you process some of the buffers on a GPU and some of then om a CPU or whatever. This is the most parallel way to do it.

This is a a rather difficult question and one can only give hints to what to look for when deciding on a work group size. This was discussed recently at The Official NVIDIA Forums | NVIDIA.