What is a uint4?

Hi,

I’m somewhat confused by the uint4/float4 data type. I was told previously to use uint4’s for extra speed when coding OpenCL kernels for my HD4870 GPU, but there seems to be a little ambiguity in terms of what a uint4 is. A quick Google search tells me it’s simply a 4-byte unsigned integer, but deeper research reveals in the context of OpenCL it’s actually a vector type, consisting of 4 integers woven into one variable. How does this work? Is it like a typedef struct where you can access each element of the object? How big are the four integers? Can anyone show me or link me to some example code showing how to set and retrieve these (including the four integer elements) and also show me how to define the uint4 data type correctly in non-OpenCL C code (so I can declare uint4’s in my calling application and pass them into the OpenCL kernel).

Thanks

OpenCL has vector types for everything except booleans.
E.g.,:
float, float2, float4, float8, float16, uchar2, uchar16, etc.

These are vector types that are designed to map well to the underlying hardware. On an SSE CPU, for example, a float4 might map to SSE instructions and therefore run faster. Since AMD cards are 4-way vector devices, you want to use 4-way vectors to get the best performance. If you just use floats, the compiler will only use 1 of the 4 ALUs at a time.

You can read how to access vector types in the OpenCL spec, but the basics are:

float4 vec;
vec.x = 0;
vec.y = 0;
vec.z = 0;
vec.w = 0;

or

vec.s0 = 0;
vec.s1 = 0;
vec.s2 = 0;
vec.s3 = 0;

And you can do:

float4 vecA, vecB;
vecA = vecB.xxyy

I’m not sure how to get them to work transparently in C, but you can always do the following:

float* data = (cl_float*)malloc(sizeof(cl_float)4length);

then to access X:
data[4i+0] = …
Y:
data[4
i+1] = …

etc.

Okay thanks so those are 4-byte ints?

ints are defined to be 32bits in OpenCL. All the type sizes are specified in the spec to be uniform across all platforms.

I should clarify my previous statement: on all platforms, char is 8bit, short is 16bit, int is 32bit, float is 32bit, and long is 64bit. This holds regardless of whether the platform is 32bit or 64bit. (E.g., on a 32bit CPU and a 64bit CPU an int will always be 32bits.)

Okay, so do you get a speed benefit from simply switching data types or do you have to use them in a certain way?

For example, supposing I have some code which declares four integers, then loops through them doing stuff, eg:

int a=1,b=2,c=3,d=4;

for (i=0;i<1000;i++) {
a += i;
for (j=0; j<1000;j++) {
b+= j;
for (k=0; k<1000; k++) {
c+= k;
for (l=0;l>1000;l++) {
d+= l;
}
}
}
}

Obviously this is a really crude example (its really early in the morning here), but assuming that instead of just += there was something a bit more involved going on within the loops, would I see a benefit from changing the four ints to a uint4 or would I need to actually use special OpenCL math functions specifically designed to operate on uint4’s?

If you are doing the same operation on all four elements, then you will (on vector architectures) get 4x the performance. E.g.,

int4 a, b;
a +=b;

is equivalent to:

a.x += b.x;
a.y += b.y;
a.z += b.z;
a.w += b.w;

So if your architecture supports it, you’ll do one instruction instead of 4. There’s nothing magical here. This is just the standard SIMD/short-vector performance benefit.

And what if I’m doing a different operation on all four vector elements? No performance benefit?

Depends on what the hardware and compiler support. Some SIMD implementations (e.g., Larabee) can do lots of fancy stuff via masks if the compiler can figure it out. I don’t know what AMD does or doesn’t support, unfortunately.