Shared Data Structures

Is there a way to calculate a max between a large number of processes. For instance I have a global datastructure that I want to store the max in and 19 Billion threads to calculate the max over. Obviously storage of the entire array of numbers is impossible. It would seem that the atomic_max function isn’t defined for floats, so if I do this:


float tmp = max(overall_max, local_max);
if(tmp == local_max)
    atomic_xchg(&overall_max, tmp);

it seems like there would be some chance that threads would interfere with each other. I am also trying to save another statistic about the max, called maxt, but this also suffers from the same problem


float tmp = max(overall_max, local_max);
if(tmp == local_max) {
    atomic_xchg(&overall_max, tmp);
    atomic_xchg(&overallt, maxt);
}

This is filled with race conditions. For instance: What if two threads simultaneously find that they are the new max, then the higher one beats the lower one to the atomic_xchg?

Is there any way to do this?

Have you considered using atom_cmpxchg() instead of atom_xchg()?

If I use atomic_cmpxchg it would still leave open the option that I am working updating based on out of date information.

I think I figured out how to do it using __local. I will post with the results. I’ve spent too much time with normal threads, this is completely different. Is there any guarantee that all members of a work group will execute at the same speed - aka 1 clock for one clock? If that is the case, then barring different branches one thread can never get ahead of another. In that case this would work (I think):


while(local_max < overall_max) {
    atomic_xchg(&overall_max, local_max);
}

if(local_max == overall_max)
    atomic_xchg(&overallt, maxt);

If I use atomic_cmpxchg it would still leave open the option that I am working updating based on out of date information.

This is basically how to do it (there could be syntax errors):


float old_max = local_max;
do
{
    old_max = as_float(atomic_cmpxchg((global int*)global_max, as_int(old_max), as_int(local_max)));
}
while(old_max < local_max);

That said, instead of making all work-items try to fight to access the global max it would be better if you did a parallel reduction.

Is there any guarantee that all members of a work group will execute at the same speed - aka 1 clock for one clock?

No, there’s no guarantee about the order of execution of work-items or work-groups.

Gotcha, very helpful. Yeah I am switching to __local memory which gets reduced. That should remove all these race conditions.