Details: I’m on OS X, Iris Pro GPU - and I’m fairly new to OpenCL.
I have a few different buffers created through clCreateBuffer and some kernel tasks that operate on them.
What I am trying to do is run my kernel tasks as many times as I can within 1/60th of a second, and then copy one of the buffers to host memory so that I can render the result. I don’t want to copy the buffer out for rendering every time as it’s unnecessary to do that more than the frame requires it.
Here is the weird thing. If I call clEnqueueReadBuffer() every time after running my kernel code it takes about 6 milliseconds to complete. However, if I run my kernel code in a loop until 1/60 seconds have elapsed (so many iterations) and then call clEnqueueReadBuffer(), it takes about 4 to 5 SECONDS to complete.
Why is this happening, and how can I avoid this massive hit?
Incidentally, I’m actually using the EasyCL wrapper, so this is what is actually being called.
void CLWrapper::copyToHost() {
if(!onDevice) {
throw std::runtime_error(“copyToHost(): not on device”);
}
//cl->finish();
cl_event event = NULL;
error = clEnqueueReadBuffer(*(cl->queue), devicearray, CL_TRUE, 0, getElementSize() * N, getHostArray(), 0, NULL, &event);
cl->checkError(error);
cl_int err = clWaitForEvents(1, &event);
clReleaseEvent(event);
if (err != CL_SUCCESS) {
throw std::runtime_error("wait for event on copytohost failed with " + easycl::toString(err) );
}
deviceDirty = false;
}