Deadlock occurs with N threads, N command queues & N Kernels

Hi there,

I have a rather simple case of one program with one kernel that is being called by many threads. At each call site, the thread acquires a command queue and kernel object that are not used by any other threads. It then allocates a read only and write only buffers. This is shown in the code below. After a while, the application will lock up and all threads and stopped at wait on single object (Windows 7). It looks like most to all threads are inside of clEnqueueWriteBuffer or clReleaseMemObject. The latter surprises me as I would expect clReleaseMemObject to really not do much. Is there anything wrong the pattern that I am using? I am using AMD OpenCL.

Thx,
S.


		program = getProgram( programId ); // return already create program
		kernel = aquireKernel( program, kernelID ); // only one thread can use this kernel object at any one time

		entryCount = width * height;
		inputSize = entryCount * channelCount * sizeof( inType );
		outputSize = entryCount * channelCount * sizeof( outType );

		commandQueue = aquireCommandQueue(); // only one thread can use this command queue at any one time

		err = CL_SUCCESS;

		inMem = clCreateBuffer( openclContext, CL_MEM_READ_ONLY, inputSize, NULL, &err );
		err |= clEnqueueWriteBuffer( commandQueue, inMem, CL_TRUE, 0, inputSize, in, 0, NULL, NULL ); // often locks up
		assert( err == CL_SUCCESS );

		outMem = clCreateBuffer( openclContext, CL_MEM_WRITE_ONLY, outputSize, NULL, &err );

		err |= clSetKernelArg( kernel,  0, sizeof(cl_mem), &inMem );
		err |= clSetKernelArg( kernel,  1, sizeof(cl_mem), &outMem );
		err |= clEnqueueNDRangeKernel( commandQueue, kernel, 1, NULL, &entryCount, NULL, 0, NULL, NULL );

		// actual enqueue call == render
		err |= clFinish( commandQueue );

		// read buffer back
		err |= clEnqueueReadBuffer( commandQueue, outMem, CL_TRUE, 0, outputSize, out, 0, NULL, NULL );
		assert( err == CL_SUCCESS );

		err |= clReleaseMemObject( outMem ); // often locks up
		err |= clReleaseMemObject( inMem );

		releaseCommandQueue( commandQueue );

		releaseKernel( kernelID, kernel );

I have tried different things, some of which seem to help:

[ol]
[li]Use thread local storage so that there is 1 to 1 mapping between thread and command queue. Buffers are still allocated & deallocated after running the kernels[ul]
[/li][li]Did not seem to make a difference
[/li][/ul]
[li]Use thread local storage so that there is 1 to 1 mapping between thread and command queue but buffers are re-used instead of being deallocated[ul]
[/li][li]Runs a lot longer but eventually crashes trying to do a copy deep inside open amdocl64.dll. I was not able to track the source of the memory.
[/li][/ul]
[li]Use thread local storage and protect entire code around a critical section[ul]
[/li][li]That seems to work but that completely destroys the parallelism. Performance becomes horrendous.
[/li][/ul]
[/ol]

At this point, I think it will be better to have a single thread do the scheduling and use events to figure out when the work is done.

Thoughts?

The command queue is a host side entity. You shouldn’t really need more than one of them (in my experience). The OpenCL APIs are thread-safe (http://www.khronos.org/message_boards/showthread.php/6788-Multiple-host-threads-with-single-command-queue-and-device). You should just pile things down your one single shared command queue.

This might help you with some info as to how threadsafe the OpenCL APIs are: http://www.khronos.org/message_boards/showthread.php/6788-Multiple-host-threads-with-single-command-queue-and-device .