Wondering when I should use clFlush or clFinish.

My kernels take about 5 seconds to run with clFinish() after each of them is enqueued. When I removed all the clFinish(), it takes only 2.2 seconds while the results are exactly the same. I only used a single command queue, and in this case do I have to call clFinish or clFlush?

The spec doesn’t seem to explain how a command queue works in detail. According to it, although clEnqueueReadBuffer performs an implicit flush, there is no guarantee that the queue will be complete after clFlush returns. That sounds to me that anyway a clFnish() has to be called in order to ensure all the tasks in a queue are finished before calling clEnqueueReadBuffer to transfer the data back to CPU.

So could anyone tell me why I still got correct results after all the clFinish() have been removed? Is it just an accident or this is the right way to use OpenCL?

Thanks in advance.

The spec doesn’t seem to explain how a command queue works in detail. According to it, although clEnqueueReadBuffer performs an implicit flush, there is no guarantee that the queue will be complete after clFlush returns.

Quick rule of thumb: for most applications it’s not necessary or to call clFlush() or clFinish() at all. Doing a final blocking call to clEnqueueReadBuffer() or clEnqueueMapBuffer() to read back your data is enough.

Only blocking calls to clEnqueueReadBuffer() (and similar) perform an implicit flush. They also guarantee that the command will be complete before the call returns to the application. See these snippets from the spec:

“Any blocking commands queued in a command-queue and clReleaseCommandQueue perform
an implicit flush of the command-queue. These blocking commands are clEnqueueReadBuffer, clEnqueueReadBufferRect, clEnqueueReadImage, with blocking_read set to CL_TRUE; clEnqueueWriteBuffer, clEnqueueWriteBufferRect, clEnqueueWriteImage with blocking_write set to CL_TRUE; clEnqueueMapBuffer, clEnqueueMapImage with
blocking_map set to CL_TRUE;
or clWaitForEvents.”
[5.13]. In other words, only blocking reads implicitly flush the queue. Non-blocking reads do not flush the queue.

“If blocking_read is CL_TRUE i.e. the read command is blocking, clEnqueueReadBuffer does
not return until the buffer data has been read and copied into memory pointed to by ptr
[5.2.2]. That is, when a blocking call to clEnqueueReadBuffer() has returned to the application, the event associated with that command is already complete.

Think of the command queue as a shopping list.

You don’t write one item down, go to the shop, find it, buy it, bring it back, pack it in the fridge, then sit down and add another item to the list and repeat do you? It would take forever - but this is precisely what your first test case is doing. You’re adding the ‘travelling time’ on-top of the ‘finding and buying time’ for every item.

clEnqueue*() - adding an item to the bottom of the list.
clflush() - leaving the house to go to the shop.
clfinish() - returning to the house with a basket of stuff.

If you just write down all the items on the list, and then go to the shop - you only have to count the travel time once. This grouping happens at every stage - e.g. you use a basked in the shop so you only have to check-out once, a car so you can carry it all, etc.

If you’re doing a really large amount of work, clFinish() can send the kids down on a bicycle with a partial shopping list to get started whilst you’re still busy completing the list or doing other things. If you break it up properly and have enough to do (and enough kids), you can end up with everyone being busy along the ‘pipeline’, fully utilising every part of the system. e.g. someone at the checkout, someone scanning the shelves, a few people in transit in each direction, etc. It might take a while to get the first result, but after that you get a steady ‘stream’ of ‘stuff’ coming back - at a rate much higher than if you did it in individual trips, even with a car, and with higher efficiently to boot.

clFinish() can send the kids down on a bicycle with a partial shopping list to get started whilst you’re still busy completing the list or doing other things

In that scenario you want a clFlush(), not a clFinish(). A clFinish() won’t return to the app until all work has finished completely, thereby preventing the app from doing work concurrently while the GPU is busy. clFlush() does allow both the app and the device to work at the same time.

Even then, for beginners it’s better to stick to the simple rule of never calling either clFlush() nor clFinish(). I’ve never seen an example in this forum of a clFinish() that was truly necessary nor a clFlush() that was justified.

Thanks a lot.
It seems to me that a blocking call to clEnqueueReadBuffer() only ‘blocks’ itself, but cannot make other tasks, which are already in the command queue, blocking. So is there any possibility that the ReadBuffer operation starts before all the other tasks are finished?

Cheers

Thanks for your simile. So the problem is, what happened to me is that I just wrote down a shopping list, without leaving the house to shopping, nor returning with a basket. Things I want to buy turned up in my house automatically.

Ahh yeah, just a typo there.

It seems to me that a blocking call to clEnqueueReadBuffer() only ‘blocks’ itself, but cannot make other tasks, which are already in the command queue, blocking. So is there any possibility that the ReadBuffer operation starts before all the other tasks are finished?

When the application calls clCreateCommandQueue() it selects whether the queue will have in-order or out-of-order execution. By default queues have in-order execution, which means that if you enqueue a command A and later a command B it is guaranteed that B will not start running until A is finished. This is explained in the glossary and section 5.1.

In other words, enqueuing a sequence of commands and ending with a blocking ReadBuffer will behave in a sensible way.

Fair enough. This answers exactly what confused me. Thanks.

Short question, say I wanna call multiple times the same kernel for whatever reason, do I have to call clFinish in between enqueueNdRange ? … or do I have the guarantee that the previous kernel will have finished before executing the next one (here , the same kernel) ?