clEnqueueReadBuffer is incredibly slow when called infrequently

Details: I’m on OS X, Iris Pro GPU - and I’m fairly new to OpenCL.

I have a few different buffers created through clCreateBuffer and some kernel tasks that operate on them.

What I am trying to do is run my kernel tasks as many times as I can within 1/60th of a second, and then copy one of the buffers to host memory so that I can render the result. I don’t want to copy the buffer out for rendering every time as it’s unnecessary to do that more than the frame requires it.

Here is the weird thing. If I call clEnqueueReadBuffer() every time after running my kernel code it takes about 6 milliseconds to complete. However, if I run my kernel code in a loop until 1/60 seconds have elapsed (so many iterations) and then call clEnqueueReadBuffer(), it takes about 4 to 5 SECONDS to complete.

Why is this happening, and how can I avoid this massive hit?

Incidentally, I’m actually using the EasyCL wrapper, so this is what is actually being called.

void CLWrapper::copyToHost() {
if(!onDevice) {
throw std::runtime_error(“copyToHost(): not on device”);
}
//cl->finish();

cl_event event = NULL;

error = clEnqueueReadBuffer(*(cl->queue), devicearray, CL_TRUE, 0, getElementSize() * N, getHostArray(), 0, NULL, &event);
cl->checkError(error);
cl_int err = clWaitForEvents(1, &event);
clReleaseEvent(event);
if (err != CL_SUCCESS) {
throw std::runtime_error("wait for event on copytohost failed with " + easycl::toString(err) );
}
deviceDirty = false;
}

When you say “if I run my kernel code in a loop until 1/60 seconds have elapsed” what you are really doing is enqueueing as many kernels as you can in 1/60 of a second. Then you try to read the results back, but you have to wait for all of them to finish. If you really only want 1/60 second of kernel execution, you need to use events to track kernel completion and stop enqueueing when you’re approaching 1/60 second, then enqueue your read.

Also, I’ve had issues with using events on macOS, so maybe just put clFinish after every handful of kernel enqueues. This will block CPU thread until the enqueued kernels have finished.

I’ve noticed this issue too. If it’s an issue, that is. I don’t see how this would fix it, Dithermaster. Can someone explain this in a bit more detail, please?