I’m working on some code and I am getting DRASTIC performance changes between the two lines of code:
c0o.xyzw += (float4)(fa,fb,fc,fd);
c1o.xyzw += (float4)(fa2,fb2,fc2,fd2);
c2o.xyzw += (float4)(fa3,fb3,fc3,fd3);
c3o.xyzw += (float4)(fa4,fb4,fc4,fd4);
versus
c0o.xyzw = (float4)(fa,fb,fc,fd);
c1o.xyzw = (float4)(fa2,fb2,fc2,fd2);
c2o.xyzw = (float4)(fa3,fb3,fc3,fd3);
c3o.xyzw = (float4)(fa4,fb4,fc4,fd4);
The first one runs lightning fast (0.01 sec). The second one slows my kernel down to 18 seconds.
Note that c0o …c30o are uninitialized float4’s … is it just discarding the memory write because it is writing to uninitialized memory? Does opencl initialize the stack variables at all?