Can I stream data from the device to the host?

I am creating a program which does a statistical test over a large space of problems, but the result from each computation may be thrown away if it is not higher than a specific threshold. For those statisticians, I am doing a chi-square test and calculating the p-value. Now, since this space I am working on is so large and could require almost a petabyte in total data transfers, it would be profitable to take advantage of the high memory bandwidth available to GPU’s. But, what I’m having difficulty figuring out is how can I have the host query the device so that it could read from the output buffer. By this, I mean, if there is a specified ordering imposed on the space, then analyzing the n-th member on the space then I could have two output buffers,


__kernel void statTest(
    int *output_p-values,
    int *output_indices,
    ...)
{
// kernel program here
}

where the i-th indice of output_p-values and output_indices would contain the p-value for the n-th text, and n for the index of the n-th test. It seems like I could use an atomic counter that is incremented each time a value passes the threshold. What is not clear is how can I exchange data from the device to the host when the buffer is full, or all the computations are complete. I’ve read about clEnqueueMapBuffer, but it’s not clear if it stops kernel operations while reading a buffer from the device. How can I implement this?

Only two things can stop a kernel: The kernel completes, or the OS kills it because it takes more than a few seconds to run. That’s it.

The kind of “interactive” thing you’re looking for is not currently an OpenCL feature.

You also can’t make long running kernels that stay resident on the GPU just computing away. The OS will kill them after a few seconds. The GPU is a shared resource and it is needed for UI.

I’d suggest instead dividing up your work into chunks that can complete in dozens of milliseconds. You can keep track of how many results have been finished using a counter in a buffer. Kernels can access it, or in-between kernel execution the host can check it. But they both can’t access it at the same time, you need to arbitrate access in the host code.

So something like:

  1. Let’s say approximately 5% of the work items produce results and the result buffer can hold 10 results.
  2. Zero the results counter and result buffer.
  3. Queue up 50 work items
  4. Map the counter and result buffer and copy out results (likely approx. 5 plus or minus)
  5. Loop back to 2

In the future you could take advantage of OpenCL “pipes” for this but it is an OpenCL 2.0 feature and there are no implementations yet.