I want to pass a short array of floats as input to every work item – say less than ten floats.
I gather you can’t do the following, because fruits is a __private float pointer which is not allowed as a kernel argument (6.5).
__kernel void simpleKernel( global float* notrelavant , float fruits[2] )
{
... // not allowed, but gives the most natural coding syntax for my problem
.... // case. i.e. for (i=0; i< 2; i++) sum += fruits[i];
}
I can think of two alternatives. Are there others? What are the performance implications?
First, we can do the exact same thing using a structure.
typedef struct {
float one;
float two;
} structOfFruits_t;
__kernel void simpleKernel( global float* notrelavant, structOfFruits_t fruits )
{
... //this method confirmed to work
}
Or, we can declare a small ~10 element memory buffer in constant or global memory.
__kernel void simpleKernel( global float* notrelavant, constant float* fruits )
{
... //all work items will likely access fruits[0], fruits[1], etc. nearly simultaneously
}
Can we speculate how the compiler would treat this case on a GPU? Which would win out in a race, the private memory structure, or the constant memory array?
By the way,
the AMD OpenCL compiler catches the first case error,
“error: kernel pointer arguments must point to addrSpace global, local, or constant.
float fruits[2]”
The Apple OpenCL implementation does not. (I’m on OS X 10.6.8 ) Instead I get some EXC_BAD_ACCESS from within the clSetKernelArg() function call on the host. I’m thinking about reporting it…
Thanks for your thoughts.
-Noah