options for passing short array

I want to pass a short array of floats as input to every work item – say less than ten floats.

I gather you can’t do the following, because fruits is a __private float pointer which is not allowed as a kernel argument (6.5).

__kernel void simpleKernel( global float* notrelavant , float fruits[2] )
{
  ...  // not allowed, but gives the most natural coding syntax for my problem
  .... // case.  i.e.   for (i=0; i< 2; i++) sum += fruits[i];
}

I can think of two alternatives. Are there others? What are the performance implications?

First, we can do the exact same thing using a structure.


typedef struct {
   float one;
   float two;
} structOfFruits_t;

__kernel void simpleKernel( global float* notrelavant, structOfFruits_t fruits )
{
  ...  //this method confirmed to work
}

Or, we can declare a small ~10 element memory buffer in constant or global memory.


__kernel void simpleKernel( global float* notrelavant, constant float* fruits )
{
  ...  //all work items will likely access fruits[0], fruits[1], etc. nearly simultaneously
}

Can we speculate how the compiler would treat this case on a GPU? Which would win out in a race, the private memory structure, or the constant memory array?

By the way,
the AMD OpenCL compiler catches the first case error,
“error: kernel pointer arguments must point to addrSpace global, local, or constant.
float fruits[2]”
The Apple OpenCL implementation does not. (I’m on OS X 10.6.8 ) Instead I get some EXC_BAD_ACCESS from within the clSetKernelArg() function call on the host. I’m thinking about reporting it…

Thanks for your thoughts.
-Noah

I wish the first method worked by the way. A fixed size array[N] and a struct should be interoperable. They both can be resolved to some number of bytes in private memory that must be filled by clSetKernelArg().

I think you’re making it more complicated than you need to/not quite understanding the spec.

Just use global float *fruits or constant float *fruits (if they are truly constant this will be faster on some hardware). It’s just a requirement that kernel arguments be non-private (and makes obvious sense). If the array is indexed using a non-compile-time-constant, then private arrays just sit in global memory anyway, so the performance characteristics should be the same.

For a kernel argument though - if you can use constant use it, otherwise use global. They’re the only options! Using structs or arrays is up to you, but often arrays are more efficient on parallel hardware (search ‘structure of arrays’).