Generic overlap of data transfers and kernel computation

Hi,

I’ve a program (C code + OpenCL API) with 3 main operations, in a big for cycle:

  1. send data to GPU, (clEnqueueWriteBuffer)
  2. process that data with kernel (clEnqueueNDRangeKernel) and
  3. send output back to CPU. (clEnqueueReadBuffer)

The dependency in each iteration is 1->2->3. Iterations do not interfere with each other.
The kernel is big enough to hide the data transfers times.

Like this:

LINK: Dropbox - Error

As you see in figure, the top one is if I perform everything in-order.
The bottom one, is with overlap, as I desire.

I implement a double buffer scheme (meaning, having 2 input buffers and 2 output buffers and using them interleaved), so that I can perform the data transfers of the next kernel execution while performing the current kernel.
You see, I want to make that (1) of penultimate line while making (2) from previous iteration.

My ambition is to have a time that only depends on the kernel, and everything else is “hidden”.

I’ve out-of-order execution on a single command queue, for a single GPU device.
I do not explicitly wait for events whatsoever.
I just make sure that every operation doesn’t start before the previous dependent operations have finished. I made that using events returned from every single operation I perform.

I’ve defined the dependencies:
in same iteration: 2 not before 1 finish, 3 not before 2 finish.
across 2-2 iterations (as i use double buffer, and every 2 iterations, the kernel uses the same buffer): 1 not before 2 finish (i can’t write on the buffer the kernel still uses), and 2 not before 3 finish (i can’t execute kernel if i haven’t retrieved the data).

The problem is that profiling, the time the cycle takes is the time of the summed operations and not only from the kernel operations (except first and last cycle overhead), meaning, that the overlap is not being done.
Should also be noted that each transfer of data is between 200 and 500MB.

So, it appears that I’ve the event synchronization well done, but the overlap is not occurring. How to you would solve this?
Any ideas? I’ve read in a forum (stackoverflow probably) that I would need more than one command queue to perform multiple operations at the same time, but from the the books I’ve read, it was stated that data transfers can occur simultaneously on a single queue (using DMA).

Hints?

From a NVIDIA OpenCL guide I got this:

"Transfer/Compute Overlap

Separate command queues can always overlap

  • Can use this to overlap transfer and compute
  • Generally best when transfer and compute time is balanced
  • Most useful when data has high reuse

Or directly pass ALLOC_HOST memory to kernel

  • Uses GPUs latency hiding to ensure maximal bus usage
  • Generally best when data has low/no reuse
  • No events needed to synchronize between copy and kernel"

But I also read that CUDA streams, using CUDA API can’t deal with operations at the same time, and so, there’s the need to create many CUDA streams to overlap transfer/Compute, but that OpenCL command queues offered that, without the need to create many of them.

Most drivers don’t implement out-of-order command queues.

The typical way to do what you want is to have separate command queues for the data transfer and compute and use events to create dependencies between the steps. Then, for each iteration, enqueue the following:

  1. upload of N+1
  2. processing of N
  3. download of N-1

Of course you’ll have to prime the pump before the first iteration and drain after the last.

This will enable overlap of transfer and compute.

It is exactly how it is shown in the NVIDIA examples, and I’ve seen it work on my device.

Could you just answer this too?

a) But it should have the double buffering too or not?

b) Thanks for that scheme, but that would only “work” if both 1. , 2. and 3. have similar time durations, right? If one of them is much bigger than the others, the pipeline is stalled at each step by that duration.

c) Regarding the events to create dependencies between the steps, I already had that implemented. Besides the dependencies of 2[N] after 1[N] and 3[N] after 2[N], it is also needed between 1[N] after 2[N-1] (so it doesn’t overwrite the input) and 2[N] after 3[N-1] so it doesn’t overwrite the output.

d) Could you point me to one of those examples (in C) where you have seen this implemented?