Hey everyone!

I'm still learning the details of OpenCL, but I can write some basic problems. However, I'm stuck here with an issue when incorporating a "for" loop inside the kernel: results, up to that point, are consistent across the kernel and across repetitions (multiple calls to the kernel execution). However, when I add the last part of the kernel with the "for" loop, results are wrong. The kernel code is the following:

Code :
#define Npart 400
// Resample: build vector of cumulative offspring of particles
__kernel void resample( __global float *prob, __global int *offspring, __global float *u){
	int gid = get_global_id(0);
	float r = Npart*prob[gid];
	offspring[gid] = min(Npart,r + u[gid]);
// Conversion: build vector of ancestors given cummulative offspring of particles
__kernel void conversion( __global int *offspring, __global int *index){
	int gid = get_global_id(0);
	int start;
	int offs;
	if(gid == 0){
		start = 0;
		start = offspring[gid-1];
	offs = offspring[gid] - start;
	for(int i = 0; i<offs; i++)
		index[start+i] = gid;

Kernels are executed sequentially (first, the "resample" kernel, then the "conversion" kernel), and the last loop is the part of the code giving problems. What happens is that the final values of "index" are positioned in wrong places, and the final output changes every time with multiple calls to the set of kernels. I understand that the memory calls, in this case, is not "coalesced", as some items from buffer "offspring" generate several replications of the value to "index", while some items generate none.

If you need more information, please let me know. Any help here is useful.

Thanks in advance!