CL_INVALID_WORK_GROUP_SIZE on OSX Lion

Hi All,

I am using JOCL as the binding for OpenCL.

Exact same code running OK (and fast) on Snow Leopard (OpenCL 1.0) but is throwing the following error on Lion (OpenCL 1.1):

com.jogamp.opencl.CLException$CLInvalidWorkGroupSizeException: can not enqueue 1DRange CLKernel [id: 140699706121896 name: IntegrateHHStep]
with gwo: null gws: {256} lws: {256}
cond.: null events: null [error: CL_INVALID_WORK_GROUP_SIZE]

I am using the following code to define the local workgroup size and global worksize for the I/O buffers:

// Length of array to process
int elementCount = models.size();
// Local work size dimensions for the selected device
int localWorkSize = min(device.getMaxWorkGroupSize(), 256);
// rounded up to the nearest multiple of the localWorkSize
int globalWorkSize = roundUp(localWorkSize, elementCount);
// results buffers are bigger as we are capturing every value for every item for every time-step
int globalWorkSize_Results = roundUp(localWorkSize, elementCount*timeConfigSteps);

If I set the localWorkSize to 0, so that it will pick-up automatically a work-size this eventually works on Lion but performance (when compared to previous performance on Snow Leopard) goes down a lot:

int elementCount = models.size();
int localWorkSize = 0;
int globalWorkSize = elementCount;
int globalWorkSize_Results = elementCount*timeConfigSteps;

Can someone explain what could be going on here? Is the error about the local or the global workgroup size and has anyone a clue how to troubleshoot/fix this?

Any help appreciated!

Thanks!

int localWorkSize = min(device.getMaxWorkGroupSize(), 256);

That part is the problem. The maximum work-group size actually depends on the kernel you are trying to run. You are only querying the maximum that the device can theoretically support.

In the C bindings you would need to call clGetKernelWorkGroupInfo(…, CL_KERNEL_WORK_GROUP_SIZE, …) to query the maximum work-group size that you can use in a particular kernel.

Hi David, thanks for your help.

That part is the problem. The maximum work-group size actually depends on the kernel you are trying to run. You are only querying the maximum that the device can theoretically support.

Can you articulate further on this - how would my kernel affect the maximum work-group size Yeah, I am a noob!

In the C bindings you would need to call clGetKernelWorkGroupInfo(…, CL_KERNEL_WORK_GROUP_SIZE, …) to query the maximum work-group size that you can use in a particular kernel.

This sounds promising, so basically I need to find the equivalent of this in JOCL.

Can you articulate further on this - how would my kernel affect the maximum work-group size Yeah, I am a noob!

One of the key factors is that the hardware has a limited supply of general-purpose registers (GPRs). The more work-items you want to execute simultaneously, the more GPRs you will need. There may be other limiting factors.