Hi everyone,

I was wondering you guys are aware of any algorithm that can help me fill an global array, smaller than the total number of threads..

This is what I'm trying to do:

Say you have an input buffer which is the same size as the total number of threads, and an output buffer with a smaller size.

Code :
__kernel void Kernel1(__global const int *input, __global int *output) {
	int thread_id = get_global_id(0);
	if(input[thread_id] < 1000) {
		// Fill output buffer

So this is the main idea. The problem is that the output buffer is smaller, so I have no idea of how to syncronize all threads to write to it...Some will even not do so.

Do you guys know of any way I can do this while still achieving some degree of performance? I've looked at the Scan and Reduction papers, but I'm not sure if they can help in a stituation like this...