Fill array

Hi everyone,

I was wondering you guys are aware of any algorithm that can help me fill an global array, smaller than the total number of threads…

This is what I’m trying to do:

Say you have an input buffer which is the same size as the total number of threads, and an output buffer with a smaller size.


__kernel void Kernel1(__global const int *input, __global int *output) {

	int thread_id = get_global_id(0);
	
	if(input[thread_id] < 1000) {
		// Fill output buffer
	}
}

So this is the main idea. The problem is that the output buffer is smaller, so I have no idea of how to syncronize all threads to write to it…Some will even not do so.

Do you guys know of any way I can do this while still achieving some degree of performance? I’ve looked at the Scan and Reduction papers, but I’m not sure if they can help in a stituation like this…

Regards

Not sure that I’ve understood what you are trying to do, but my understanding is that you want each thread checks its input value and based on that value perhaps writes something to the output buffer. So you can not beforehand determine if a particular thread will write or not, and you want the output buffer to be packed at the end, i.e. all writes to be to consecutive locations in memory.

There is an example in the NVIDIA CUDA SDK that does this when performing the marching cubes algorithm. In short, it runs the kernel twice where the first run outputs the number of elements each thread will produce and the second run does the actual writing, using the previous output to know where each thread should write.