Do i have divergence with (bool)?val1:val2 operator?

o4kareg2 · September 12, 2012, 12:11pm

Hi GPGPU hackers!
Is that any point to try using (boolean expression) ? val1 : val2
instead of if… else… , if the ()? operator still does have the same divergence?

Photovore · September 13, 2012, 8:11pm

I’d like to hear someone answer your question definitively, but here’s what I have to say:

where I had:

if ( a < b ) c=30;
else if ( a > b ) c=40;
else c=35;

… which is the equivalent of:

c = ( a < b ) ? 30 : ( a > b ) ? 40 : 35;

what i ended up implementing is:

c = ( a < b ) * 30 + ( a > b ) * 40 + ( a == b ) * 35;

… which of course does all three calculations and then selects only the one that it wants, by multiplying the others by zero and adding them all together, with no branching. [And, for what it’s worth, my calculations are much more complex than simple assignments, lots of trigonometry. Also, some of my instances of this have 6 or 8 options, 6 or 8 complex calculations to perform…]

… AND, for me, this was faster, by just a teeeeensy little bit, than your trinary operator. (The trinary operator had been suggested to me by someone, and I did try it, but what I wrote above was just a tiny bit faster, by which I mean it was hardly noticeable, perhaps 1ms on a 24ms kernel. But, I kept it!..)

…

my take-home message for you? Try it every way you can think of and keep the one that’s best in your situation!

[there is also a Select() statement that for me seemed equivalent to (a?b:c) in performance terms iirc]

[edit: when i vectorized into float4s i had to account for “true” becoming -1 instead of 1; just something to be aware of if you’re going to dig in that far!]

chippies · September 16, 2012, 11:51am

For Nvidia GPUs, if-then-else produces branching, unless the compiler optimises away a simple if-statement. Using the ternary operator form results in both code paths being scheduled but only one actually stores a result. There is no warp divergence in this case since every thread executes all instructions. There is no guarantee that any other hardware works this way - check the specific vendor’s documentation.

Using Photovore’s method is essentially the same, but will always be branch-free on all types of GPUs. Be careful of type conversions as logical operations generate integers, which will the typecast to floats if you are working with floats. Photovore’s example doesn’t suffer from that since all his numbers are integers. Depending on the architecture, these typecasts may be very costly operations, making the ternary a better choice, possibly.

As Photovore put it, the best is to test the different methods, and don’t forget to read the hardware vendor’s documentation.

Photovore · September 16, 2012, 3:31pm

Thanks for your input, chippies! I have nothing to add to your excellent comment but wish to expand on, and clarify, my experience:

>Photovore’s example doesn’t suffer from that since all his numbers are integers.
… the example doesn’t , but my code sure does! o.o

I was vastly oversimplifying; I use plenty of floats, float4s, and transcendental functions. Here is an actual, simple example from my code:

#define FRUGIC(l,m) as_float4( (l) & as_int4( (float4) (m) ) ) // (handles TRUE being -1 for float4s)
#define FROTYP float4 // running in vectorized mode (float4s instead of floats)
#define nata(x) atan( (FROTYP) x ) // casting a float to float4
#define sbaf(x) fabs(x) // well, I thought I had a problem with plain-old fabs at some point earlier!..
…
viv =
( FRUGIC( ( xd == 0.f && yd > 0.f ), (FROTYP) ( PI - PId2 ) ) \

FRUGIC( ( xd == 0.f && yd < 0.f ), (FROTYP) ( PI + PId2 ) ) \
FRUGIC( ( yd > 0.f && xd > 0.f ), (FROTYP) ( nata( yd/xd ) ) ) \
FRUGIC( ( yd > 0.f && xd < 0.f ), (FROTYP) ( PI - nata( yd/sbaf(xd) ) ) ) \
FRUGIC( ( yd < 0.f && xd > 0.f ), (FROTYP) ( PIx2 - nata( sbaf(yd)/xd ) ) ) \
FRUGIC( ( yd < 0.f && xd < 0.f ), (FROTYP) ( PI + nata( sbaf(yd)/sbaf(xd) ) ) )
);

… so, plenty of casts, but still a bit more efficient!

And, this is a rather simple example … some of these constructs have 8 options, full of #defines on multiple lines, that expand the construct after pre-processing into dozens of lines of code! And, still the most economical for my task! ~O!O~

o4kareg · September 18, 2012, 6:24am

Ok i will for sure remember the trick, and probably one can make this a little better
for example one can remember the compare (a> b) result instead of calculating this twice and maybe using XOR instead of equal (XOR takes usually less enstructions on CPU,maybe on GPU as well)?

Photovore, this macros are realy horrible, isn’t in better to use
inline funktions since the functions will be inlined anyway?

Is that mo efficient, if you have some branches to do with one digit, convert
in to vector and make only one vector operation?

Photovore · September 19, 2012, 5:21am

o4kareg, you’re right, my macros are super ugly!

You gave me one idea and I spent a few hours replacing the “+” above with “|” where possible and making them even uglier (you have to use as_int4() to OR together float4s, and then do as_float4() when you’re done) – but, there was no improvement on i7, xeon, nvidia, amd. I had thought that OR might be faster than addition.

Don’t understand xor vs == but it’s been a long day and will think more later. You’re definitely right about pre-calculating, not only (a>b), but also many other intermediate values. Will do that some day.

I will look at “inline” functions; second time in two days that I have heard that term. But I worry about the overhead of passing arguments; maybe inlining eliminates that.

There is definitely efficiency in doing things as a single vector operation – when I went from floats to float4s, it didn’t speed up 4x, but over 2x. I forget the exact numbers but it got much better on both cpus and both gpus.

o4kareg2 · September 19, 2012, 5:38pm

Oh, at this night( IMAO) i have found a nice solution with cheap operations even without addition and multiply:

You have:


if(a>b) c = 30
else if(a<b) c = 35
else c = 40; //,so by a = b

You make:


unsigned int a_bigger = a <b;
a_bigger --; // now a_bigger = 0xFF...  if a > b ,else 0 if a <= b

unsigned int not_equal = a^b; // 0 if a == b,  else some digit if a!= b
not_equal = ! not_equal; // 1 if a == b , else 0
not_equal --; // now its 0 if a == b ,else 0xFF...

unsigned int b_bigger = (~a_bigger) & not_equal;

unsigned int equal = ~not_equsal;

result = (30 &a_bigger) | (35 & b_bigger) | (40 & equal);

/////////{added} OR YOU CAN AS WELL:

b_bigger = a>b;
b_bigger --;
equal = ~(a_bigger |b_bigger); /////// here ::::::( ~(0 | 0 ) = 0xFFF, else ~(0| 0xFFFFF) = 0;

You have:
2 compare, 2 sub operations, 7 binar-logical operations. Are there instructions
for compare and logical operation with vectors? Are floats realy nesessary or you can perform
your task with integers? As sayd, conversions costs a lot. If you have some long operations like tangens or what is it, probably are there integer calculations algorithms with magical numbers or something, or even some paralel

Photovore · September 26, 2012, 1:34pm

Hi o4kareg2!

I will take your suggestions into consideration, and I thank you for the effort and thought involved. (The past few days, though, I have concentrated on a different area that should result in much more speed-up; see my other post, later today!)
> Are there instructions for compare and logical operation with vectors?

Yes, all of the logical operations in opencl that work with (i.e.) ints will also work with int2, int4, int 8…
But, if I want to use these logical operations on floats, I have to use (for example) as_int4( my_float4_value ). As I understand it, this is not a conversion, but simply instructs the compiler to ignore the fact that it’s a float4 and treat it as if it were an int4, so you can do the logical operation, and then as_float4( the_int4_value ) just tells the compiler to treat is as if it were a float4 – with no actual conversion and no instruction cycles lost.

> Are floats realy nesessary or you can perform your task with integers?

Floats are pretty much needed. Well, to be more precise I do lots of sin, cos, tan, many operations that only make sense to do with floating-point values, and I have many values that are fine-grained over a range from 0.0 to 1.0. I suppose the whole thing could_possibly be rewritten to use integer arithmetic – I could multiply all my float values by 10000, for example, and assume an implied decimal place at render time. But then I’d have to find (or write) a library for all the transcendental functions, with maybe lookup tables and linear interpolation instead of (i.e.) doing a real “sine” calc, or use some other simulation algorithm as you suggest. Not worth it, in my opinion, because: my kernel already runs in 24 ms, and any improvement only matters to me if it crosses a 16.7ms boundary. By which I mean that I’m generating live video frames, and a millisecond saved doesn’t help unless it gives me a whole extra frame. (That’s why I’m focusing on the subject of my other post, which I will put up later as a fresh new question.)

>conversions costs a lot

Yes they do, but I just want to point out that most of those casts [all the (FROTYP) stuff] do not actually get invoked every time I use the macro. By far most of the vars are already float4 so no conversion needs to be done. Mostly, that’s in the macro to handle the few cases when the macro is called with a float1 that needs to be fanned out to a float4.