Sorting List & Counting Valid Entries

Hello,

I have a large 1D array of data I want to process (I am OK up to this point with OpenCL) After I have processed the data some of the elements will be valid and some not, I would like to then pack them into an output buffer so that the valid ones come first and the invalid ones follow. e.g.:

[1, 0, 1, 0, 1, 1, 0, 0] after sorting would be: [1, 1, 1, 1, 0, 0, 0, 0]
If possible I would like the number of the number of valid elements also.

I have seen some sorting examples in Nvidia’s OpenCL SDK and I guess that one of those might suit my needs as far as the sorting goes, though if anyone reading this had any other suggestions (Or have any links to simpler sorting examples) i would be very happy to hear them.

Regarding returning the number of valid elements, would this be something best suited to the CPU? I have read about atomic operations which seem to be on the right path for counting in OpenCL, but have heard it is slow.

The output buffer will be used for instancing in OpenGL so ideally i’d like to avoid reading back to the Host if possible.

Thanks for any help,

After I have processed the data some of the elements will be valid and some not, I would like to then pack them into an output buffer so that the valid ones come first and the invalid ones follow. e.g.:

[1, 0, 1, 0, 1, 1, 0, 0] after sorting would be: [1, 1, 1, 1, 0, 0, 0, 0]
If possible I would like the number of the number of valid elements also.

Good news! This problem is well researched. Google “parallel prefix sum” or “parallel scan opencl” and you should get some good explanations of what this is and how to implement it efficiently in OpenCL/CUDA.

Once you understand prefix sums you will see the packing operation you described above is trivially implemented on top of them (see this lecture for details).

Regarding returning the number of valid elements, would this be something best suited to the CPU? I have read about atomic operations which seem to be on the right path for counting in OpenCL, but have heard it is slow.

Atomics won’t be necessary. Prefix sums are computed very efficiently in GPUs so it won’t be necessary to use the CPU if you don’t want to.

Thank you very much, I’ll check out the links and post back if i have any questions.