I am creating a program which does a statistical test over a large space of problems, but the result from each computation may be thrown away if it is not higher than a specific threshold. For those statisticians, I am doing a chi-square test and calculating the p-value. Now, since this space I am working on is so large and could require almost a petabyte in total data transfers, it would be profitable to take advantage of the high memory bandwidth available to GPU’s. But, what I’m having difficulty figuring out is how can I have the host query the device so that it could read from the output buffer. By this, I mean, if there is a specified ordering imposed on the space, then analyzing the n-th member on the space then I could have two output buffers,
__kernel void statTest(
int *output_p-values,
int *output_indices,
...)
{
// kernel program here
}
where the i-th indice of output_p-values and output_indices would contain the p-value for the n-th text, and n for the index of the n-th test. It seems like I could use an atomic counter that is incremented each time a value passes the threshold. What is not clear is how can I exchange data from the device to the host when the buffer is full, or all the computations are complete. I’ve read about clEnqueueMapBuffer, but it’s not clear if it stops kernel operations while reading a buffer from the device. How can I implement this?