Problems with multiple commandQueues

Hello together,

I’m struggling with multiple commandQueues using OpenCL C++ Bindings.

In a multi-GPU environments, I create a CommandQueue for each device. Each CommandQueue “contains” the following commands:

  1. Non-blocking enqueueWriteBuffer for input data 1
  2. Non-blocking enqueueWriteBuffer for input data 2
  3. Non-blocking enqueueWriteBuffer for input data 3
  4. Kernel execution
  5. Non-blocking enqueueReadBuffer for output data

Input data 1-3 are written to the same buffer. Each device gets different data, but the same kernel is executed.

When I run my code on two devices, I have the following ordering of OpenCL-methods, according to NVIDIA’s Visual Profiler:

  1. Non-blocking enqueueWriteBuffer for input data 1 to device 0 (CmdQueue 0)
  2. Non-blocking enqueueWriteBuffer for input data 2 to device 0 (CmdQueue 0)
  3. Non-blocking enqueueWriteBuffer for input data 3 to device 0 (CmdQueue 0)
  4. Kernel execution on device 0 (CmdQueue 0)
  5. Non-blocking enqueueReadBuffer for output data from device 0 (CmdQueue 0)
  6. Non-blocking enqueueWriteBuffer for input data 4 to device 1 (CmdQueue 1)
  7. Non-blocking enqueueWriteBuffer for input data 5 to device 1 (CmdQueue 1)
  8. Non-blocking enqueueWriteBuffer for input data 6 to device 1 (CmdQueue 1)
  9. Kernel execution on device 1 (CmdQueue 1)
  10. Non-blocking enqueueReadBuffer for output data from device 1 (CmdQueue 1)

However, this increases the total time of my application on 2 Devices compared to executing it on 1 Device.

I would like the ordering to be as follows:

  1. Non-blocking enqueueWriteBuffer for input data 1 to device 0 (CmdQueue 0)
  2. Non-blocking enqueueWriteBuffer for input data 2 to device 0 (CmdQueue 0)
  3. Non-blocking enqueueWriteBuffer for input data 3 to device 0 (CmdQueue 0)
  4. Non-blocking enqueueWriteBuffer for input data 4 to device 1 (CmdQueue 1)
  5. Non-blocking enqueueWriteBuffer for input data 5 to device 1 (CmdQueue 1)
  6. Non-blocking enqueueWriteBuffer for input data 6 to device 1 (CmdQueue 1)
  7. Execute kernel on device 0 as soon as all data is available (CmdQueue 0)
  8. Execute kernel on device 1 as soon as all data is available (CmdQueue 1)
  9. Non-blocking enqueueReadBuffer for output data from device 0 as soon as data is calculated (CmdQueue 0)
  10. Non-blocking enqueueReadBuffer for output data from device 1 as soon as data is calculated(CmdQueue 1)

I tried to synchronize the commandQueues with the help of cl::Event as follows:


std::vector<cl::Event> events;
events.push_back(f()); //f is the kernel functor

In the method, where I download the data:


cl::WaitForEvents(events);
//enqueueReadBuffer...

This does not seem to work, since I still get the above mentioned unwanted ordering.

How can I get the ordering I would like to have?

Any help is greatly appreciated!

Could there be something else at play? It should not be necessary to try to force the ordering of commands to be different: since you have two command queues they should run independently of each other and progress more or less at the same time.

Are you absolutely certain that the two command queues are associated with different devices? Can you double and triple check? clGetCommandQueueInfo() with CL_QUEUE_DEVICE should give us the answer.

The other thing I can think of would be some blocking call like WaitForEvents(), a blocking read, clFinish() or similar after step 4 or 5. Can you verify through your profiler that commands 1-10 are CL_QUEUED one right after another without having to wait for any of the previous commands to be CL_COMPLETE?

Also, I’m a bit confused by descriptions like “enqueueWriteBuffer for input data 1”. Do you mean that “data 1” is one buffer object and “data 2” is a different buffer object?

Hi david.garcia,

thanks for the quick reply!

Are you absolutely certain that the two command queues are associated with different devices? Can you double and triple check? clGetCommandQueueInfo() with CL_QUEUE_DEVICE should give us the answer.

I checked this with clGetCommandQueueInfo, and I am certain, that each command queue is associated with a different device.

The other thing I can think of would be some blocking call like WaitForEvents(), a blocking read, clFinish() or similar after step 4 or 5. Can you verify through your profiler that commands 1-10 are CL_QUEUED one right after another without having to wait for any of the previous commands to be CL_COMPLETE?

After each download (enqueueReadBuffer) there is a clFinish for each command queue. This clFinish is necessary to be able to merge the data (when using multiple devices, the input data is split up and after calculation has to be set together again). Do you think the problem could be here?

I could not verify that

commands 1-10 are CL_QUEUED one right after another without having to wait for any of the previous commands to be CL_COMPLETE
, because I could not find such a small grained summary in NVIDIA Visual Profiler.

Also, I’m a bit confused by descriptions like “enqueueWriteBuffer for input data 1”. Do you mean that “data 1” is one buffer object and “data 2” is a different buffer object?

data 1 to 3 are data (simple arrays) created on the host, which are copied to the same buffer on the device. I hope this made it more clear.

After each download (enqueueReadBuffer) there is a clFinish for each command queue. This clFinish is necessary to be able to merge the data (when using multiple devices, the input data is split up and after calculation has to be set together again). Do you think the problem could be here?

If I understood correctly, the code looks roughly like this?


for(int i = 0; i < 2; ++i)
{
    clEnqueueWriteBuffer(queue[i], input1[i], ...);
    clEnqueueWriteBuffer(queue[i], input2[i], ...);
    clEnqueueWriteBuffer(queue[i], input3[i], ...);

    // calls to clSetKernelArg() here

    clEnqueueNDRangeKernel(queue[i], kernel, ...);
    clEnqueueReadBuffer(queue[i], output, ...);
    clFinish(queue[i]);
}

If that’s the case, the call to clFinish() in the first iteration of the loop is forcing all commands in queue[0] to be complete before any of the commands for queue[1] are even enqueued.

Sorry if I misunderstood something. Please correct me if I’m wrong.

Your posted code is not exactly correct.

It is more like this:


for(int i = 0; i < 2; ++i)
{
    clEnqueueWriteBuffer(queue[i], input1[i], ...);
    clEnqueueWriteBuffer(queue[i], input2[i], ...);
    clEnqueueWriteBuffer(queue[i], input3[i], ...);
}

for(int i = 0; i < 2; ++i)
{
// calls to clSetKernelArg() here
    clEnqueueNDRangeKernel(queue[i], kernel, ...);
}

for(int i = 0; i<2; ++i)
{
    clEnqueueReadBuffer(queue[i], output, ...);
    clFinish(queue[i]);
}

In your version, it is clear to me, that first the 0th Queue is completed before any command of the 1st queue is enqueued. That’s what makes me wonder that much, that my above code behaves the same way…

Thanks! Your app looks great from the OpenCL standard point of view. There must be something else going on inside the nVidia driver that is deciding to serialize the execution.

I would contact them with the version of the drivers you are using, the exact model of the GPUs you are using, whether they are connected in SLI and the OS in your system.

I just looked over this thread again, and saw a small mistake made by me.

In my version of the “rough” code the last for-loop is not exactly the same how it is in my programm. In my programm this for-loop is split up into two loops.

So, the complete correct rough code:


for(int i = 0; i < 2; ++i)
{
    clEnqueueWriteBuffer(queue[i], input1[i], ...);
    clEnqueueWriteBuffer(queue[i], input2[i], ...);
    clEnqueueWriteBuffer(queue[i], input3[i], ...);
}

for(int i = 0; i < 2; ++i)
{
// calls to clSetKernelArg() here
    clEnqueueNDRangeKernel(queue[i], kernel, ...);
}

for(int i = 0; i<2; ++i)
{
    clEnqueueReadBuffer(queue[i], output, ...);
}

for(int i = 0; i<2; ++i)
{
    clFinish(queue[i]);
}

But since clEnqueueReadBuffer is non-blocking in my programm, there should be no difference from the OpenCL-spec view, I think!?!

But since clEnqueueReadBuffer is non-blocking in my programm, there should be no difference from the OpenCL-spec view, I think!?!

Even if clEnqueueReadBuffer() was blocking there would be no difference: the code is correct either way.

If the queue execution is lazy then there might be nothing until the finish to trigger anything to be executed at all. By doing a read and then a blocking flush there is nothing to trigger the second queue’s execution until the first has completed.

Does tweaking it to the following help?

for(int i = 0; i<2; ++i)
{
clEnqueueReadBuffer(queue[i], output, …);
clFlush(queue[i]);
}

for(int i = 0; i<2; ++i)
{
clFinish(queue[i]);
}

The spec doesn’t say that anything has to leave the queue until a flush.

I feel dumb now :slight_smile: LeeHowes is right.

Alternatively you could do this:


cl_event final_event[2];

for(int i = 0; i < 2; ++i)
{
    clEnqueueWriteBuffer(queue[i], input1[i], ...);
    clEnqueueWriteBuffer(queue[i], input2[i], ...);
    clEnqueueWriteBuffer(queue[i], input3[i], ...);

    // calls to clSetKernelArg() here

    clEnqueueNDRangeKernel(queue[i], kernel, ...);
    clEnqueueReadBuffer(queue[i], output, ..., &final_event[i]);
}
clWaitForEvents(2, final_event);

clWaitForEvents() performs an implicit flush.

Hey,

thanks for sticking to the problem.

Neither clFlush nor clWaitForEvents solved the problem…The commandQueues are still executed one after another…

Oh, what a bummer! Did you get any attention from NVidia’s customer support? Please let us know if this ever gets resolved.

I just posted a thread in NVIDIA’s OpenCL forum. I will keep you up to date!

I just had an idea the other day:
Is it possible that I implicitly synchronize my CommandQueues without noticing it? Since they every time are executed one after another, could that be a possibility? How could I recognize such synchronization?