Newbie question about lock step

Hello, I’m beginning programming with openCL. I’m using it on my CPU because at this time I don’t have a compatible GPU Card.

I think that my question should be easy to solve.

Suppose I have an array of 4 integers [0, 2, 0, 0] and a kernel function like this:


__kernel void 
tst(__global int *s,
	__global int *answer)
{
	int gid = get_global_id(0);
	if(s[gid] == 0 && s[gid - 1] > 0) {
		s[gid] = s[gid - 1];
	}
	answer[gid] = s[gid];
}

What I’m expecting is that the 4 elements will be evaluated in parallel and the result will be [0, 2, 2, 0]. Instead I get [0, 2, 2, 2], as if elements were processed sequentially.

What I’m doing wrong in your opinion?

Many thanks for your help
:slight_smile:

The CPU will, of course, process elements serially since it’s not a parallel processor. (If you have multiple cores some of them will be processed in parallel, but you’ll never know which ones.)

It looks like your program is relying on some ordering of execution and memory accesses, that is, synchronization between work-items. (I.e., thread 2 depends on whether thread 1 executed before or after it.) Such synchronization is only permitted in OpenCL between members of a single work-group (so you can’t do it across all work-items) and you must use explicit barrier calls to enforce it.

Basically your code has a data race whereby the result will be different depending on which work-items are executed in which order, and the only control over this you can get in OpenCL is explicit synchronization between work-items in a single work-group. If you re-wrote your code under any threading model you’d see similarly unpredictable results.