Hello again, one last time. Everything works but my kernel’s write to the output buffer object.
Note that this works without vectorization on both AMD and nVidia, and works WITH vectorization on intel Xeon CPU.
Without vectors, I do this:
__kernel void MyKernel( __global mystruct * vh, __global uint * obuf )
...
__global uint *dad = obuf + offset;
... figure stuff ...
unsigned int dout;
dout = (unsigned char)round(pif.r*255.f) << 8 | (unsigned char)round(pif.g*255.f) << 16 | (unsigned char)round(pif.b*255.f) << 24;
dad[0] = dout;
(pif in this instance is a struct with float elements r, g, and b)
(maybe the assignment to dout shouldn’t work, as uchars are shifted more than 8 bits before the or, but it does… and was eliminated as a source of the problem; see below. Also I don’t need the intermediate pointer dad; it’s a legacy holdover, but removing it doesn’t help)
… vectorizing now; mind you all these methods work on CPU with vectors, and on all devices without vectors
one method:
uchar4 d0, d1, d2, d3; // ( pif now has float4 elements r, g, and b )
d0 = (uchar4) ( (uchar)0, (uchar)round(pif.r.s0*255.f), (uchar)round(pif.g.s0*255.f), (uchar)round(pif.b.s0*255.f) );
d1 = (uchar4) ( (uchar)0, (uchar)round(pif.r.s1*255.f), (uchar)round(pif.g.s1*255.f), (uchar)round(pif.b.s1*255.f) );
d2 = (uchar4) ( (uchar)0, (uchar)round(pif.r.s2*255.f), (uchar)round(pif.g.s2*255.f), (uchar)round(pif.b.s2*255.f) );
d3 = (uchar4) ( (uchar)0, (uchar)round(pif.r.s3*255.f), (uchar)round(pif.g.s3*255.f), (uchar)round(pif.b.s3*255.f) );
dad[0] = as_uint(d0);
dad[1] = as_uint(d1);
dad[2] = as_uint(d2);
dad[3] = as_uint(d3);
I can do the calculations of d0 - d3 above, but the assignments to dad[0] - dad[3] result in a system hang and hard boot.
… !BUT! …
I can perform any TWO of the assignments, so long as they’re not adjacent. I.E., I can set dad[0] and dad[2], or 0 and 3, or 1 and 3, but not 0 and 1, etc.
(This is how I know that the kernel’s calculations are correct; I can pick and choose which columns I want to see, just not more than 2 of every 4!)
similarly:
uint4 dout;
dout.s0 = (uint) d0;
dout.s1 = (uint) d1;
dout.s2 = (uint) d2;
dout.s3 = (uint) d3;
vstore4( dout, 0, (__global uint *) dad );
the vstore works, but again I can’t do all the presets to dout’s subelements. I can do any one, and probably any two non-adjacent (sorry; I’ve tried a dozen different ways to do this, and it’s all running together now).
The failure mode is that Lion hangs when clFinish is called.
It doesn’t help if I make obuf __global uchar*, or __global uchar16*, and approriately arrange my data for the assignment – always works on cpu, always hangs Lion on clFinish…
sorry so rambling; been chasing down this last little hitch all day after making the major breakthroughs in fixing the logic earlier this weekend … have tried many variations but maybe I’m doing something obviously wrong that will ring a bell with one of you readers…
any thoughts off the top of your head?