I have a large 1D array of data I want to process (I am OK up to this point with OpenCL) After I have processed the data some of the elements will be valid and some not, I would like to then pack them into an output buffer so that the valid ones come first and the invalid ones follow. e.g.:

[1, 0, 1, 0, 1, 1, 0, 0] after sorting would be: [1, 1, 1, 1, 0, 0, 0, 0]
If possible I would like the number of the number of valid elements also.

I have seen some sorting examples in Nvidia's OpenCL SDK and I guess that one of those might suit my needs as far as the sorting goes, though if anyone reading this had any other suggestions (Or have any links to simpler sorting examples) i would be very happy to hear them.

Regarding returning the number of valid elements, would this be something best suited to the CPU? I have read about atomic operations which seem to be on the right path for counting in OpenCL, but have heard it is slow.

The output buffer will be used for instancing in OpenGL so ideally i'd like to avoid reading back to the Host if possible.

Thanks for any help,