Bool AND reduction

Hello everyone.

I’m working in an algorithm that has to iterate doing some computation (a specific kernel) in an image until no changes are done (it always converges, no worries). In OpenCL, I’d need each thread to report whether it changed anything or not, to know if I need to run the kernel again.

The most trivial way I can think of to solve this is by using a global boolean. If a thread changes anything, they would set that bool to true, and the host would simply check that value. However, this sounds quite costly. I know I could split this reduction using shared memory to optimize it, but there must be an easier and better way to do this.

Is there a way to send some sort of signal from a thread to the host? There’s no extra data I need to transfer, I just need to be notified with a signal or anything.

Thanks everyone!

Why do you think it would be costly?

Mainly for two reasons:

  1. Many threads could have done changes, so that’s many accesses to global memory.
  2. I need to transfer that boolean back and forth every time

But maybe I’m wrong.
PS: I think you’re answering both this and my SO question . You are awesome.

if (*global_bool == false){
           atomic_cmpxchg (global_bool, false, true);//Probably have to use int instead
}

This ensures that, on average, 1 atomic store will be performed. And since every thread is going to read this bool, it is likely to be cached most of the time. In theory, anyway, you should profile on your own. And remember you can do workgroup-wide reduction without atomics, though it might be a premature optimization. It’s probably a good idea to use host memory to store this bool. Or better yet, use device-side enqueue if OCL 2.0 is available.

That looks like a good idea. I see how the bool being kept in cache helps with the conditional, but I don’t understand a few things though: Why does it have to be atomic when it’s only a write? How is the ‘cmpxchg’ part helping here? Thanks for your answer!

Atomic ensures every workitem immediately sees this change. I’m not entirely sure cmpxchg achieves anything when atomic_store exists, but what happens without atomic is undefined behavior and may vary from device to device. Maybe it will work, maybe it won’t, maybe it works now and will break after a driver update. Literally no reason to waste your time finding out what will happen.

If your work-items only make a transition from false to true (or 0 to 1), a simple write to global memory is enough as long as you always write the same value:

__global int *flag;

if (<my_condition>) *flag = 1;

In that case, you are guaranted that *flag will be set to 1 at the end of the kernel execution if at least one of the work-item has fulfilled <my_condition>.
This means that you have to set *flag to zero before executing your kernel (with a host write or a device task), and that your work-items cannot read an up-to-date value of *flag.

However this works well to signal that one work-item has done something. I have already used this pattern for iterative picture analysis with success.

Find an exact line in the OpenCL standart that states “guaranteed”. Writing into the same memory location without synchronisation is an undefined behavior, and driver developers adore to screw with undefined behavior. I’m well aware that there is no reason to believe it does not actually work on any current or future hardware, but when the difference is so miniscule, why take risks? There is too much software and games lost forever because of “I dunno, it works for me” and this adds more potential corpses to the pile.

Don’t confuse memory state with memory operation. Writing some data to a memory location is a memory operation that will be realized whether there is synchronization or not. Without synchronization, you cannot guarantee ordering or consistency among reads and writes. But if the only action on a memory location is exactly the same write by one or several work-items, you are assured that this memory location will contain the result of the write once memory has been flushed, whether by an explicit flush or an implicit one (such as done at the end of the kernel execution).