Non-blocking write buffer problem with multiple contexts

I encounter the problem with non-blocking clEnqueueWriteBuffer when I use multiple contexts concurrently. Within a program I run, there is one in-order-execution cl_command_queue and one cl_context. In each program, there is at least one gpu task, and tasks can run concurrently. Note that tasks within one program use the same command queue. I run multiple programs at the same time, and sometimes some program generate wrong outputs.

The following code is one gpu task:


cl_mem _clmem1 = clCreateBuffer(context, CL_MEM_READ_WRITE, n * sizeof(float), NULL, &err);
cl_mem _clmem2 = clCreateBuffer(context, CL_MEM_READ_WRITE, n * sizeof(float), NULL, &err);
clEnqueueWriteBuffer(queue, _clmem1, CL_FALSE, 0,n * sizeof(float), input, 0, NULL, NULL); //non-blocking write
clSetKernelArg(clkern, 0, sizeof(cl_mem), &_clmem1);
clSetKernelArg(clkern, 1, sizeof(cl_mem), &_clmem2);
size_t workdim[] = {N};
clEnqueueNDRangeKernel(queue, clkern, 1, 0, workdim, NULL, 0, NULL, NULL );
clEnqueueReadBuffer(queue, _clmem, CL_FALSE, 0, n * sizeof(float), output, 0, NULL, &eventout); //non-blocking read
{
clGetEventInfo(eventout, CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &ret, NULL);
}while(ret != CL_COMPLETE);
print output
clReleaseMemObject(_clmem1);
clReleaseMemObject(_clmem2);

I run some experiments to find out what’s wrong. When these 2 conditions apply, the programs sometimes don’t do the right things:

  1. non-blocking write buffer
  2. multiple contexts (or command queues)
    When only one program runs at a time (one context), it always does the right thing. When multiple programs run concurrently but with blocking write, they always do the right thing as well.

To narrow down the problem a bit more, I call the blocking read for “_clmem1” before call clEnqueueNDRangeKernel to see where the data get changed. I find out that the data is different from “input” before the kernel is run, and the “input” which resides on the host memory is till the same. Therefore, there is something wrong with clEnqueueWriteBuffer.

I test more to see weather it’s really because of the multiple contexts, so now I run only one program that has one context but multiple command queues. The result is it also sometimes generate a wrong output.

I’m using OpenCL 1.0 in CUDA 3.2 on NVIDIA driver 260.19.36. My machine is Linux x86_64.

Thank you so much for reading this long description of the problem I encounter. I’m really appreciated your attempt to help. I’ll be super happy if anyone knows what’s going on and gives me suggestions of how to make the programs work properly. It’s very crucial for me to make this work.

First, the term “task” has a different meaning in OpenCL. What you describe are multiple commands, such as clEnqueueWriteBuffer, clEnqueueNDRangeKernel, etc.

Second, there is no synchronization between different contexts. If you use multiple contexts, commands will be executed in any order.

Also, there is no synchronization between different command queues even if they are in the same context. This is a simplification but it’s good enough for today.

I strongly recommend calling clFinish() before calling before calling clGetEventInfo(). That way you don’t need to call clGetEventInfo() inside a loop.

Finally, is it possible that you are freeing variable “input” right after calling clEnqueueWriteBuffer()? When do you free that variable?

Thank you so much for responding.

Thanks for clarification, and sorry for misusing the term.

Yes, I do know that. However, each program has its own context, and the programs are independent, so I don’t need any kind of synchronization between them.

Same thing, when I use different command queues on the same context, they are completely independent, or else I explicitly pass the events as event_wait_list in the function.

The above code that I showed is only parts of my program. The real code is extremely complicated. Anyhow, the reason I use clGetEventInfo is that when the event is not complete, the cpu thread can go do something else while waiting. If I use clFinish(), then it will behave like blocking read which defeats the purpose of what I’m trying to do.

I’m pretty sure I don’t because after the everything finishes (after clEnqueueReadBuffer is complete), I print the input, and it still have the same value.

However, each program has its own context, and the programs are independent, so I don’t need any kind of synchronization between them.

Are we talking about different applications here? That is, different processes? When you say “program” it makes me think of OpenCL program object.

If there’s an issue where running one process works fine but running multiple processes causes one of them to fail, that sound like a bug in the driver. You may want to contact your hardware vendor. They will need a short application that reproduces the issue.

Anyhow, the reason I use clGetEventInfo is that when the event is not complete, the cpu thread can go do something else while waiting.

That’s cool. You will still need a call to clFlush() because otherwise there’s no guarantee that the event will ever complete. Your code could enter an infinite loop.

The above code that I showed is only parts of my program. The real code is extremely complicated.

Can you reproduce the issue even with the simplified code you showed us? The problem could be somewhere else.

Yeah, sorry I didn’t make it clear. By program, I mean application. One application has one context and one command queue (but it is multi-threading), and only one thread (gpu manager thread) invokes calls to gpu. When I run one application, it works find, but when I run more than one, some of them generate wrong outputs.

I see. I’ll do that.

I haven’t tried, but I should. That’s a very good suggestion. I’ll let you know what the outcome is. Thanks a lot.

It’s fixed. The bug is on my side. Nothing to do with gpu. Thank you so much for helping anyway.