Thanks for the reply!
You mention:
Each work item in a subgroup will reach a subgroup function
Does this mean each work-item will wait at the subgroup function for the rest of the work-items, even if this waiting doesn’t provide memory ordering guarantees? I understand your example, as the undefined behaviour arises due to a data-race on the ‘array’ memory locations. This means that sub-group functions do not disallow data-races (this is very useful to know!). However, I think my question is probably better captured by this piece of code (which has no potential data-races). Assume the output array is initialised to 0 and is the size of a subgroup. Additionally, assume there is only one subgroup executing the piece of code.
a: int x = 0;
b: if (sub_group_id() == 0) { x = 1; }
c: while (sub_group_any(x)) {
d: output[sub_group_id()] = 1;
e: x = 0;
}
Is this piece of code well-defined? And is it guaranteed that ‘output’ will now contain all 1’s? The execution we are worried about is this:
Say subgroup work-item 0 gets priority in executing. It executes statement b and then gets to statement c. It knows that locally x == 1, so locally it knows that sub_group_any will be true. If there is no implied barrier, then subgroup work-item 0 could continue executing (without waiting) based on local knowledge. It continues to statement d and then e. When subgroup work-item 0 returns to c, its local x is now 0, and it cannot continue until more information is acquired (i.e. by the execution of other subgroup work-items). Now the other subgroup work-items start execution, they get to statement c and the sub_group_any(x) will now evaluate to false (based on the current values of x in the subgroup). This means that the other subgroup work-items do not get to execute statement d and ‘output’ contains only a single 1.
If an implicit execution barrier is provided, then the above execution is disallowed, because subgroup work-item 0 will have to wait at the first instance of sub_group_any(), even though locally it knows that it can continue. Likewise, in SIMT execution (e.g. CUDA warps), the above execution is disallowed because work-items will execute in lock-step, disallowing the interleaving described above.
The other option I see is that the above code is undefined, violating this line of the specification:
These built-in functions must be encountered by all work-items in a subgroup executing the kernel
If this is the case, how would we make the above code defined? I imagine maybe by placing a sub_group_barrier immediatly after instruction c inside the while-loop? But this isn’t clear to me from reading the specification.
Thanks again!