Nested loops with dependancies and series operations

I have a solver written in FORTRAN, which I would like to port to OpenCL. The solver contains several nested inter-dependent loops on the same level separated by serial operations(see pseudo code).


for( iter )
{
   serial operation first
   for( i ) { generates value A }
   serial operation second
   for( j ) { uses value A & generates value B }
   serial operation third
   for( k ) { uses value B }
   serial operation fourth
}

Would it be better to create a kernel for the outer loop that performs the serial operations and calls separate kernels for the 3 inner loops (see pseudo code below), rather than pass data to and from the host at the start and end of each parallel section?

As the serial operations are compute light, but would have to copy a large amount of data to and from the host for the parallel loops. I am assuming that IO overhead for copying the data to and from the host will be significantly slower than the serial operations and keeping the data on the device would be more efficient.


kernel outer_loop(in_data, out_data)
{
    first_serial;
    call kernel parallel_i_loop(in_data);
    second_serial;
    call kernel paralel_j_loop(in_data);
    third_serial;
    call kernel parallel_k_loop(in_data);
    fourth_serial;
    transfer_to_host(out_data);
}

Or have the outer loop performed on the host calling the kernels for the inner loops and copy the data to and from the host several times.


host outer_loop(data)
{
    first;
    call kernel parallel_i_loop(in_data, out_data);
    in_data = out_data;
    second;
    call kernel parallel_j_loop(in_data, out_data);
    in_data = out_data;
    third;
    call kernel parallel_k_loop(in_data, out_data);
    in_data = out_data;
    fourth;
}

David

There’s no single correct answer to your question. But I can give you some things to consider:

  • OpenCL kernels are typically most effective when you launch hundreds of work items or more, especially if you want the OpenCL code to run on a GPU. So, if your loops you convert into kernels have less iterations than that, you won’t use the machine effectively.
  • OpenCL kernels have some overhead involved in launching kernels, which can be amortized over the kernel execution time if the kernel is long. So if your loops don’t represent only a small amount of work, converting them into kernels won’t use the machine effectively.

Beyond these, there are considerations around data movement, locality, and lane coherence that can be important for your decision. But the considerations above may already rule out one of your solutions.

And of course, the parameters that govern these decisions will vary from one OpenCL platform implementation to another.

Hope this helps!

If i have understood your problem correctly, Data a is used in the second serial operation right? thats why you have to sync your data with the host after the first inner loop? You should consider doing the serial work in openCL as well to ensuhre that the data can stay on the gpu as long as possible.