I encounter the problem with non-blocking clEnqueueWriteBuffer when I use multiple contexts concurrently. Within a program I run, there is one in-order-execution cl_command_queue and one cl_context. In each program, there is at least one gpu task, and tasks can run concurrently. Note that tasks within one program use the same command queue. I run multiple programs at the same time, and sometimes some program generate wrong outputs.
The following code is one gpu task:
cl_mem _clmem1 = clCreateBuffer(context, CL_MEM_READ_WRITE, n * sizeof(float), NULL, &err);
cl_mem _clmem2 = clCreateBuffer(context, CL_MEM_READ_WRITE, n * sizeof(float), NULL, &err);
clEnqueueWriteBuffer(queue, _clmem1, CL_FALSE, 0,n * sizeof(float), input, 0, NULL, NULL); //non-blocking write
clSetKernelArg(clkern, 0, sizeof(cl_mem), &_clmem1);
clSetKernelArg(clkern, 1, sizeof(cl_mem), &_clmem2);
size_t workdim[] = {N};
clEnqueueNDRangeKernel(queue, clkern, 1, 0, workdim, NULL, 0, NULL, NULL );
clEnqueueReadBuffer(queue, _clmem, CL_FALSE, 0, n * sizeof(float), output, 0, NULL, &eventout); //non-blocking read
{
clGetEventInfo(eventout, CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &ret, NULL);
}while(ret != CL_COMPLETE);
print output
clReleaseMemObject(_clmem1);
clReleaseMemObject(_clmem2);
I run some experiments to find out what’s wrong. When these 2 conditions apply, the programs sometimes don’t do the right things:
- non-blocking write buffer
- multiple contexts (or command queues)
When only one program runs at a time (one context), it always does the right thing. When multiple programs run concurrently but with blocking write, they always do the right thing as well.
To narrow down the problem a bit more, I call the blocking read for “_clmem1” before call clEnqueueNDRangeKernel to see where the data get changed. I find out that the data is different from “input” before the kernel is run, and the “input” which resides on the host memory is till the same. Therefore, there is something wrong with clEnqueueWriteBuffer.
I test more to see weather it’s really because of the multiple contexts, so now I run only one program that has one context but multiple command queues. The result is it also sometimes generate a wrong output.
I’m using OpenCL 1.0 in CUDA 3.2 on NVIDIA driver 260.19.36. My machine is Linux x86_64.
Thank you so much for reading this long description of the problem I encounter. I’m really appreciated your attempt to help. I’ll be super happy if anyone knows what’s going on and gives me suggestions of how to make the programs work properly. It’s very crucial for me to make this work.