Loop inside kernel generating different results

Hey everyone!

I’m still learning the details of OpenCL, but I can write some basic problems. However, I’m stuck here with an issue when incorporating a “for” loop inside the kernel: results, up to that point, are consistent across the kernel and across repetitions (multiple calls to the kernel execution). However, when I add the last part of the kernel with the “for” loop, results are wrong. The kernel code is the following:

#define Npart 400

// Resample: build vector of cumulative offspring of particles
__kernel void resample( __global float *prob, __global int *offspring, __global float *u){

	int gid = get_global_id(0);
	
	float r = Npart*prob[gid];

	offspring[gid] = min(Npart,r + u[gid]);
}

// Conversion: build vector of ancestors given cummulative offspring of particles
__kernel void conversion( __global int *offspring, __global int *index){

	int gid = get_global_id(0);
	int start;
	int offs;
	
	if(gid == 0){
		start = 0;
	}
	else{
		start = offspring[gid-1];
	}
	offs = offspring[gid] - start;
	
	for(int i = 0; i<offs; i++)
		index[start+i] = gid;
}

Kernels are executed sequentially (first, the “resample” kernel, then the “conversion” kernel), and the last loop is the part of the code giving problems. What happens is that the final values of “index” are positioned in wrong places, and the final output changes every time with multiple calls to the set of kernels. I understand that the memory calls, in this case, is not “coalesced”, as some items from buffer “offspring” generate several replications of the value to “index”, while some items generate none.

If you need more information, please let me know. Any help here is useful.

Thanks in advance!

If I’m correct, you have a concurrent write in that for loop. this has to be an atomic operation to be correct.
Concurrent writes on the same memory location is not defined.

Thanks for your reply! That is true: I just double-checked the results of the kernel and there is a concurrent write in the loop. I’m rewriting the code in order to avoid this issue.

Thanks again!

I might need some extra help here: I’ve been thinking about how to write the problem in the last “for” loop of “conversion” kernel, but the only idea I had seems to be quite inefficient. To summarize, consider:

  1. the output of offs is a sequence of numbers showing how many times a given entry is repeated in the output vector.
  2. the output of offspring is a cumulative sequence of starting points to build the output vector.
    The idea I had is to build a kernel executed max(offs)-times, passing as arguments the value of offspring. In each execution, the kernel operates over all items with a common value of offs. It doesn’t sound good, but that’s the best I got.

Anybody with another idea?

Thanks in advance again!