Results 1 to 2 of 2

Thread: clEnqueueReadBuffer is incredibly slow when called infrequently

  1. #1
    Join Date
    Jun 2018

    clEnqueueReadBuffer is incredibly slow when called infrequently

    Details: I'm on OS X, Iris Pro GPU - and I'm fairly new to OpenCL.

    I have a few different buffers created through clCreateBuffer and some kernel tasks that operate on them.

    What I am trying to do is run my kernel tasks as many times as I can within 1/60th of a second, and then copy one of the buffers to host memory so that I can render the result. I don't want to copy the buffer out for rendering every time as it's unnecessary to do that more than the frame requires it.

    Here is the weird thing. If I call clEnqueueReadBuffer() every time after running my kernel code it takes about 6 milliseconds to complete. However, if I run my kernel code in a loop until 1/60 seconds have elapsed (so many iterations) and then call clEnqueueReadBuffer(), it takes about 4 to 5 SECONDS to complete.

    Why is this happening, and how can I avoid this massive hit?

    Incidentally, I'm actually using the EasyCL wrapper, so this is what is actually being called.

    void CLWrapper::copyToHost() {
    if(!onDevice) {
    throw std::runtime_error("copyToHost(): not on device");

    cl_event event = NULL;

    error = clEnqueueReadBuffer(*(cl->queue), devicearray, CL_TRUE, 0, getElementSize() * N, getHostArray(), 0, NULL, &event);
    cl_int err = clWaitForEvents(1, &event);
    if (err != CL_SUCCESS) {
    throw std::runtime_error("wait for event on copytohost failed with " + easycl::toString(err) );
    deviceDirty = false;

  2. #2
    Senior Member
    Join Date
    Dec 2011
    When you say "if I run my kernel code in a loop until 1/60 seconds have elapsed" what you are really doing is _enqueueing_ as many kernels as you can in 1/60 of a second. Then you try to read the results back, but you have to wait for all of them to finish. If you really only want 1/60 second of kernel execution, you need to use events to track kernel completion and stop enqueueing when you're approaching 1/60 second, then enqueue your read.

    Also, I've had issues with using events on macOS, so maybe just put clFinish after every handful of kernel enqueues. This will block CPU thread until the enqueued kernels have finished.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Proudly hosted by Digital Ocean