clarification request: barrier()

rickbagley · May 8, 2013, 3:09pm

Hello,

We request some clarification regarding implementation of OpenCL
barriers, and in particular, barriers inside conditionals.

Some setup for our question:

The OpenCL 1.2 (revision 19) specification (section 6.12.8
Synchronization Functions) states:

“If barrier is inside a conditional statement, then all
work-items must enter the conditional if any work-item enters the
conditional statement and executes the barrier.”

Let’s call this “the OpenCL barrier rule”.

This wording appears to state a program-counter level
restriction. Each barrier in the code is treated as a different
execution barrier.

Therefore, we believe the OpenCL code:

__kernel
void foo(…)
{
…
if (<condition>)
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE) // barrier A
else
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE) // barrier B
…
}

would NOT execute in compliance with OpenCL if work-items from any
particular work group evaluate the condition differently, even though
all work-items from that group would be guaranteed to execute either
barrier A or barrier B.

For example, if there are 3 thread groups in one work group, one
thread group may execute barrier A, and two thread groups may execute
barrier B. We believe this is NOT compliant with OpenCL. Correct?

The results of incorrect execution of the barrier appear to be
implicitly undefined. We did not find any language in the
specification which specifically described how implementation
should handle incorrect barrier execution.

We know that failure to satisfy the OpenCL barrier rule could well
result in a race condition on hardware. Naturally, vendors may seek to
protect their devices from hangs due to incorrect programs.

Now our question:

Would it be a correct implementation of OpenCL if some
combination of software and hardware contrived to enforce the
OpenCL barrier rule.

That is, if the OpenCL code as programmed fails to execute in a
way to satisfy the OpenCL barrier rule, nevertheless, the
implementation enforces it: if one work-item in a work-group
reaches a barrier, all work-items in that workgroup reach that
same barrier.

Such a method, if it can be devised, appears to go beyond merely
protecting hardware from hangs, to enabling forward progress on
code that would otherwise halt (or cause an exception to be
thrown). This has the subtle consequence of perhaps changing the
semantics of the input program.

We ask for a ruling in this matter, and otherwise would
appreciate any further comments or corrections.

Thank you.

roger512 · May 9, 2013, 4:11pm

Hello,

the compiler will prolly factorize it that way :

barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE)

if (<condition>) {
…
}else{

…
}

I don’t think it is possible. AFAIK barrier are called to ensure write in local/global before the barrier are done by all work items. So it is a synchronization point in the kernel. I suppose it works like that,
when one work items encounter a barrier, the scheduller swap to another work items until he finds the barrier the first work items encountered. If it doesn’t encounter one, the work items won’t be able to synchronize (error).
I don’t see how the program/hardware could guess a good reconvergence point for those two work items by itself.

rickbagley · May 13, 2013, 9:32pm

Thank you for taking the time to respond to our query. But your reply was not helpful to us.

[QUOTE=roger512;29273]
the compiler will prolly factorize it that way :

barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE)

if (<condition>) {
…
}else{

…
}.[/QUOTE]

We appreciate there is redundancy in this example. For the purposes of this discussion, let’s assume that the code will be executed as shown.

[QUOTE=roger512;29273] I don’t think it is possible. AFAIK barrier are called to ensure write in local/global before the barrier are done by all work items. So it is a synchronization point in the kernel. I suppose it works like that,
when one work items encounter a barrier, the scheduller swap to another work items until he finds the barrier the first work items encountered. If it doesn’t encounter one, the work items won’t be able to synchronize (error).
I don’t see how the program/hardware could guess a good reconvergence point for those two work items by itself.[/QUOTE]

We appreciate these comments, which question our scenario. But you do not address our question. If our scenario could be implemented, would the resulting execution of code compiled from an OpenCL program be correct?