Vectorizing for AMD; can't write output buffer elements

Hello again, one last time. Everything works but my kernel’s write to the output buffer object.

Note that this works without vectorization on both AMD and nVidia, and works WITH vectorization on intel Xeon CPU.

Without vectors, I do this:


__kernel void MyKernel( __global mystruct * vh, __global uint * obuf )
...
__global uint *dad = obuf + offset;
... figure stuff ...

unsigned int dout;
dout = (unsigned char)round(pif.r*255.f) << 8 | (unsigned char)round(pif.g*255.f) << 16 | (unsigned char)round(pif.b*255.f) << 24;
dad[0] = dout;

(pif in this instance is a struct with float elements r, g, and b)

(maybe the assignment to dout shouldn’t work, as uchars are shifted more than 8 bits before the or, but it does… and was eliminated as a source of the problem; see below. Also I don’t need the intermediate pointer dad; it’s a legacy holdover, but removing it doesn’t help)

… vectorizing now; mind you all these methods work on CPU with vectors, and on all devices without vectors

one method:


uchar4 d0, d1, d2, d3;			// ( pif now has float4 elements r, g, and b )
d0 = (uchar4) ( (uchar)0, (uchar)round(pif.r.s0*255.f), (uchar)round(pif.g.s0*255.f), (uchar)round(pif.b.s0*255.f) );
d1 = (uchar4) ( (uchar)0, (uchar)round(pif.r.s1*255.f), (uchar)round(pif.g.s1*255.f), (uchar)round(pif.b.s1*255.f) );
d2 = (uchar4) ( (uchar)0, (uchar)round(pif.r.s2*255.f), (uchar)round(pif.g.s2*255.f), (uchar)round(pif.b.s2*255.f) );
d3 = (uchar4) ( (uchar)0, (uchar)round(pif.r.s3*255.f), (uchar)round(pif.g.s3*255.f), (uchar)round(pif.b.s3*255.f) );
dad[0] = as_uint(d0);
dad[1] = as_uint(d1);
dad[2] = as_uint(d2);
dad[3] = as_uint(d3);

I can do the calculations of d0 - d3 above, but the assignments to dad[0] - dad[3] result in a system hang and hard boot.

… !BUT! …

I can perform any TWO of the assignments, so long as they’re not adjacent. I.E., I can set dad[0] and dad[2], or 0 and 3, or 1 and 3, but not 0 and 1, etc.
(This is how I know that the kernel’s calculations are correct; I can pick and choose which columns I want to see, just not more than 2 of every 4!)

similarly:

uint4 dout;
dout.s0 = (uint) d0;
dout.s1 = (uint) d1;
dout.s2 = (uint) d2;
dout.s3 = (uint) d3;
vstore4( dout, 0, (__global uint *) dad );

the vstore works, but again I can’t do all the presets to dout’s subelements. I can do any one, and probably any two non-adjacent (sorry; I’ve tried a dozen different ways to do this, and it’s all running together now).

The failure mode is that Lion hangs when clFinish is called.

It doesn’t help if I make obuf __global uchar*, or __global uchar16*, and approriately arrange my data for the assignment – always works on cpu, always hangs Lion on clFinish…

sorry so rambling; been chasing down this last little hitch all day after making the major breakthroughs in fixing the logic earlier this weekend … have tried many variations but maybe I’m doing something obviously wrong that will ring a bell with one of you readers…

any thoughts off the top of your head?

Crashes are from out of bounds access. Make sure dad is in the range of obuf.

e.g. if you’re doing 4x as much data per threadid, did you remember to reduce your global work size by 4 as well?

The fact it works with unadjacent elements is probably a coincidence - since there’s no way to allocate unaligned device memory it can’t be an alignment problem.

Note that if you don’t use a result the compiler will usually optimise it’s calculation away too, so commenting out the writes has more of an impact than just eliminating them. e.g. it might be something to do with reading the struct input.

Incidentally AMD’s compiler works just fine with most scalar code for GPUs, so if that’s all you’re trying to do you’re probably wasting your time. Their current devices are VLIW at the instruction level, not SIMD, so vectorisation isn’t very important, unlike for CPUs. You’d probably get more benefit on their platform for this problem by uisng the amd media ops extensions.

(Thanks for your reply; intended to respond sooner; laptop drive crashed; rebuilding now but want to address this.)

> Crashes are from out of bounds access. Make sure dad[x] is in the range of obuf.

It is. I can write to the last four-byte item if I don’t first assign adjacent subelements of that item.

>e.g. if you’re doing 4x as much data per threadid, did you remember to reduce your global work size by 4 as well?

Yes.

>The fact it works with unadjacent elements is probably a coincidence - since there’s no way to allocate unaligned device memory it can’t be an alignment problem.

>Note that if you don’t use a result the compiler will usually optimise it’s calculation away too, so commenting out the writes has more of an impact than just eliminating them. e.g. it might be something to do with reading the struct input.

Understood. But the writes are no longer commented out; they do execute; it hangs only if I first assign all adjacent subelements within the item which is then written. Also, I see example code, vectorized, which writes to the output buffer as I do. It seems to be something about assembling my individual bytes into 4-byte (or 16-byte) chunks before writing them. Perhaps I’ll try just going to byte-length writes altogether – though I thought consolidating writes to global memory, to reduce their number, was desirable. [Yes, I will try one-byte writes. “Make it work first, make it pretty later.”]

>Incidentally AMD’s compiler works just fine with most scalar code for GPUs, so if that’s all you’re trying to do you’re probably wasting your time. Their current devices are VLIW at the instruction level, not SIMD, so vectorisation isn’t very important, unlike for CPUs. You’d probably get more benefit on their platform for this problem by uisng the amd media ops extensions.

Now this I do not completely grok. More reading is perhaps necessary.

The nVidia 330m in the laptop is rated at 182 GFlops. The 5870 in the big machine is rated at 1.72 TFlops (numbers from my memory). So, theoretical maximum of almost 10x the throughput. Yes, the scalar code as you say works just perfectly on the 5870, but at only about twice the throughput of the 330m. This is about right if I’m only using one element of the VLIW ALU in each SP (not sure I’m using the right term) – each can handle four floats and one double simultaneously. I have read in a few places about the importance of vectorizing code for the 5870 in order to take advantage of those resources. Your opinion differs on this detail; I obviously have some reading to do.

In any case, it’s all done except for the final write at completion of kernel execution. It’s doing all the calculations correctly; I just can’t see all of them at the same time.

And (if I remember correctly) it is faster than the scalar code, but it’s not 4 times as fast. I’m not at home now to recheck this and state it more precisely, but I think it’s less than twice as fast. More optimizations in order. I’m using XCode under OSX, and I don’t know how (or if it’s possible) to use the precise profiling tools available to those using nV or AMD SDKs; I’m looking at msecs per frame, which is chunked to my display’s temporal resolution.

[Just glanced at the media ops extensions; they may be useful…]

Thank you for your help and insights. I really need to air this topic over on the AMD board, don’t I?

Oh fun.

If you do post to the amd forums, i’d say try to get a test-case that demonstrates the problem that you can include.

Anyway, I don’t really have anything to add apart from about this:

>Incidentally AMD’s compiler works just fine with most scalar code for GPUs, so if that’s all you’re trying to do you’re probably wasting your time. Their current devices are VLIW at the instruction level, not SIMD, so vectorisation isn’t very important, unlike for CPUs. You’d probably get more benefit on their platform for this problem by uisng the amd media ops extensions.

Now this I do not completely grok. More reading is perhaps necessary.

The nVidia 330m in the laptop is rated at 182 GFlops. The 5870 in the big machine is rated at 1.72 TFlops (numbers from my memory). So, theoretical maximum of almost 10x the throughput. Yes, the scalar code as you say works just perfectly on the 5870, but at only about twice the throughput of the 330m. This is about right if I’m only using one element of the VLIW ALU in each SP (not sure I’m using the right term) – each can handle four floats and one double simultaneously. I have read in a few places about the importance of vectorizing code for the 5870 in order to take advantage of those resources. Your opinion differs on this detail; I obviously have some reading to do.

If you set the environmental variable:
GPU_DUMP_DEVICE_KERNEL=2
When you run your amd code it will spit out the isa dump (i.e. annotated assembly language) for the board you’re running on. From this it will be obvious how much of the VLIW processor you’re using. You’ll get the instruction number, and then the x/y/z/w/t unit the parts of the instruction are executing on. You wont get many that have only x: parts. This will show the peak alu activity your code will run at, but doesn’t show memory latencies. If you try this you’ll notice vector elements aren’t always processed by the same lane; they’re basically scheduled independently.

OTOH accessing memory by vector types is (generally) more efficient than by scalar types, so increased performance could just be because of that.

FLOPS are one thing, but you need to get the data to the ALU first, so it’s hard to compare devices with different designs. Apart from compilers, the hardware is so different.

  1. Have not yet tried your suggestion; looks good; have not yet learnt how to set envars under Cocoa … however, I have through some other means earlier looked at the IR on nVidia … perhaps same will work for AMD…

  2. Still reconstructing computer; using a borrowed much-smaller drive from my day job at Uni; BUT … decided that, as this is a temp disk, I’d go straight to Lion and see what I might see…

  3. SO, under Lion everything works on both CPU and GPU. SO, SO, SO … this is a problem with AMD (under Apple) only, and not with OCL in general… (I think!..) (i.e. all vectorized code under Snow Leopoard gave “llvm compiler has failed to compile a function” – BUT, under Lion on nV everything works like it should (I am very, very far from a beginning C programmer!!) . . . . )

  4. So, I will post this on the AMD forum. Can’t have this stopping the show. If necessary, I am positively surprised by the performance on Xeon CPUs, and again, if necessary, I will pay bux for an 8-core or 12-core Mac Pro. The 8-core might do the job itself (only when vectorized) … might perhaps need 12. Stunning that the CPU performance is high enough to consider not using GPU at all … BUT, I’d rather not spend those bucks! An obvious workaround is to send half of the job to the GPU and do half of each frame on the CPU. BUT, I’d need to solve the AMD vectorizing problem firstly!)

Thanks for your suggestions. I’ll look again at the AMD media ops extensions; they looked kinda in the area that might help, but at first glance I didn’t see anything immediately solutionary … however, I am absolutely driven at this point, and shall keep pushing until this is done and dusted.

All fingers and toes crossed!

edit: almost no data at all going in; bunch of data coming out … it’s all generated algorithmically based on X,Y position from globalid…

Weirdness update . . .

If I do this:

uint i0,i1,i2,i3;											//\\ FIGURE UINTs
	i0 = (uint)round(pif.r.s0*255.f) << 8 | (uint)round(pif.g.s0*255.f) << 16 | (uint)round(pif.b.s0*255.f) << 24;
	i1 = (uint)round(pif.r.s1*255.f) << 8 | (uint)round(pif.g.s1*255.f) << 16 | (uint)round(pif.b.s1*255.f) << 24;
	i2 = (uint)round(pif.r.s2*255.f) << 8 | (uint)round(pif.g.s2*255.f) << 16 | (uint)round(pif.b.s2*255.f) << 24;
	i3 = (uint)round(pif.r.s3*255.f) << 8 | (uint)round(pif.g.s3*255.f) << 16 | (uint)round(pif.b.s3*255.f) << 24;

	dad[0] = i0;											//\\ WRITE UINTs
	dad[1] = i1;
	dad[2] = i2;
	dad[3] = i3;

then I can NOW do any THREE of the above writes without crashing the 5870; perfect results. ANY three, but if I do all four it hangs Lion. Closer, right?..:slight_smile:

Now, I’ve used this exact same code before, and been restricted to (at first) a single write, then (later) any two non-adjacent writes; now I can do three? I have moved things around and made trivial changes but have no concrete idea as to what change has made this now work (further than it did before).

Have yet to post on AMD forum. Should be able to even give a shortened kernel that does just those writes as a failure example. Been focusing on other stuff. So, this msg is just an update for anyone who may have been following along.

Well, I think the crashes are normal. What you would need to do here in this context is try to range the dad inside obuf. Perhaps, the graphic card could also be responsible why you are experiencing the crashes. Try and see if the problem persists after replacing it with another one.

Hi, I have a similar…ish problem (on OS X 10.6.8 with AMD graphics card, OpenCL 1.0).

The following code works fine:


__kernel void testScalar(__global float4 *P){

  int i = get_global_id(0);
 
  float f;
  f = 137; 
    
  if(i > 100){
    f = 14;
  }else{ 
    f = 17;
  }

  P[i].x = f;
//  vstore4(f,0,(__global float*)&P[i].x);
}

I’m getting the values back and it just works. However if I try to use vector types I get the “cvmsErrorCompilerFailure: LLVM compiler has failed to compile a function” error.

Namely the following code:


__kernel void testVector(__global float4 *P){

  int i = get_global_id(0);
 
  float2 f;
  f.x = 137; 
    
  if(i > 100){
    f.x = 14;
  }else{ 
    f.x = 17;
  }

  P[i].x = f.x;
}

Does not compile. Both testScalar() and testVector() work fine on my Linux with Nvidia card (OpenCL 1.1).

Slightly different program, with this kind of procedure wrapped inside a local function compiled, however the float variable ended up being set to zero when the if statement was present (and was not trivial).

Am I experiencing the same possible bug or do I violate some rules here?

Am I experiencing the same possible bug or do I violate some rules here?

It definitely looks like a bug. I would report it to the hardware vendor.

Hi, folks. I apologize for not posting my solution on here before now; I had posted on the AMD board and nobody helped, but that’s where I was when I fixed it. Here is an excerpt from my last post on that thread;

"
GOT IT!

Well, that one was weird from nose to tail, let me tell you:

You’ll recall that I was doing something like


uint i0,i1,i2,i3; float4 r,g,b; // (obuf is __global uint *)

i0 = (uint) round( r.s0 * 255.f ) << 8 | (uint) round( g.s0 * 255.f ) << 16 | (uint) round( b.s0 * 255.f ) << 24;
i1 = (uint) round( r.s1 * 255.f ) << 8 | (uint) round( g.s1 * 255.f ) << 16 | (uint) round( b.s1 * 255.f ) << 24;
i2 = (uint) round( r.s2 * 255.f ) << 8 | (uint) round( g.s2 * 255.f ) << 16 | (uint) round( b.s2 * 255.f ) << 24;
i3 = (uint) round( r.s3 * 255.f ) << 8 | (uint) round( g.s3 * 255.f ) << 16 | (uint) round( b.s3 * 255.f ) << 24;

obuf[offset ] = i0;
obuf[offset+1] = i1;
obuf[offset+2] = i2;
obuf[offset+3] = i3;

. . . what I could do without crashing was either: 1) all four of the assignments, but only three of the writes, or 2) all four of the writes but only three of the assignments. If I tried to do all four of both, Lion hung.

SO!

What fixed it?

Removing the call to round().

Makes perfect sense, doesn’t it? Well, it might if I had access to the source code of the compiler, and I might also need detailed schematics of the 5870. … It turns out that the call to round() is not needed anyway.

Of course now that it works it’s been collapsed to a single uint4 operation on the float4s, followed by a single vstore4().
"

… SO, my original problem on this thread looks like it could be a bug too, but I don’t have the luxury of waiting for a fix; I’m just happy that I found a way around it!

tronin, yes I would have tried swapping cards, if I had another AMD card around, because the bug made me think that errors were caused only when I was using all adjacent elements of a VLIW ALU, or something like that. Wish I knew why that fixed it.

notzed, I do want to look at the binaries again now, and I went back to what worked before – calling clGetProgramInfo with CL_PROGRAM_BINARIES, but I’m not getting it to work now. Don’t know if your environment variable would work under Xcode. I do get something back, but it appears to be an actual binary file with just a little bit of (human-readable) delimiting information at either end. Big one for GPU, smaller one for CPU, but similar looking wrapper.

This used to work under Snow Leopard; I have an old binary for the nVidia I pulled back then. Perhaps it’s time to try updating from Xcode 4.1 to 4.2…