clEnqueueMarker

I have modified an example program from Episode Six to insert extensive error checking. I finally got it to execute under Mac OS X 10.6.3
and inserted this code:

	cl_event kernelEvents;
	if(clEnqueueMarker(cmd_queue, &kernelEvents) != CL_SUCCESS)
		printf("
clEnqueueMarker failed!

");
	else 
		printf("
clEnqueueMarker succeeded!

");
	cl_command_type kEventinfo;
	size_t kEventinfoRet;
	int printcount = 0;
	for(int i = 0;i < itmax;i++) {
		do {
			++printcount;
			clGetEventInfo(kernelEvents, CL_EVENT_COMMAND_TYPE, (size_t)sizeof(kEventinfo), &kEventinfo, (size_t *) &kEventinfoRet);
		} while(kEventinfo != CL_COMMAND_MARKER);
	}
	clReleaseEvent(kernelEvents);
	printf("Printcount = %d
", printcount);

I hoped to prove that I could do other program housekeeping while the GPU completed its task but the program seems to block.
The clEnqueueMarker call succeeds, but things seem to be locked up until the GPU is done. I thought it might just be any display was
locked out but printcount is only incremented once.

I have tried creating the command queue with CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE but that fails with a message that it is
a valid command but unsupported. My iMac has an ATI 4670 which the stats report “Commands execute in order”, so that is probably
the cause of the clCreateCommandQueue failure to accept the property.

I also tried passing a cl_event to clEnqueueNDRangeKernel but that failed to prevent the blocking.

It seems hard to believe that this implementation demands the equivalence of a clFinish after every command if the GPU does not support
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE!

What am I doing wrong here? Any help would be appreciated.

Jim

I don’t know if it’s you problem, but you must know that queuing a command will not really executing it.
You have to flush your command queue to be sure execution of all queued commands begins.

I thought of that, but the program was executing. Just to try to even out the delay I went back and put in a clFlush after each command was queued. This had very little effect since most of the operation was during the actual kernel execution.

The OpenCL documentation on this website says

“clFlush only guarantees that all queued commands to command_queue get issued to the appropriate device. There is no guarantee that they will be complete after clFlush returns.”
This implies that a clFlush should force the queue to be flushed to the device and then clFlush should immediately return. Of clFinish it says
“clFinish does not return until all queued commands in command_queue have been processed and completed.”

A distinct difference in operation. That is not the case on the implementation in Mac OS X 10.6.3 and on the ATI Radeon 4670. I have proven that today.

During the execution of the program there is a 4 second delay when the GPU executes the program. I have print statements before and after clFlush so that I can see the delay. Warning, you will note that the number of print statements actually printed before the delay varies during each execution, probably they are occurring on a different thread but the 4 second delay is easily detected.

Using the debugger I inserted a breakpoint on the clFlush line and the 4 second delay did not occur before the breakpoint but did occur immediately after I pressed continue. I removed that breakpoint and inserted one on the line after the clFlush. Then the 4 second delay would occur before the breakpoint.

So clFlush() is blocking program execution just as if it was a clFinish() on my iMac. That is going to make OpenCL very limiting. The only way I see around this is to cue up small kernels which execute in about a second but the downside will be a lot of overhead in moving the data back and forth between the CPU and the GPU.

Unless I am missing something and I wish I were.

Jim H

The problem is perhaps Apple driver.
They seem to be a bit buggy, and I hope the next Apple update will change something.