CL_DEVICE_MAX_WORK_GROUP_SIZE vs. CL_KERNEL_WORK_GROUP_SIZE

Hi, after a lot of reading I would like to clarify following:

I’m getting following values:
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_KERNEL_WORK_GROUP_SIZE: 256

So if I understand everything correctly, then CL_KERNEL_WORK_GROUP_SIZE gives as the ‘ultimate’ number of work-items that can be assigned to 1 work-group. And this we can find out only after we create a kernel. Anything above this value will couse errors, right? So basically returned value by CL_DEVICE_MAX_WORK_GROUP_SIZE can be ignored (no use of it if in reality we can use 1/4th of the size).

Additionally to this the spec. says “The OpenCL implementation uses the resource requirements of the kernel (register usage etc.)” So in theory does this mean that if I rewrite my kernel the returned value of CL_KERNEL_WORK_GROUP_SIZE can differ? If yes, are there some unwritten rules of how to achieve maximum value? For example the same as CL_DEVICE_MAX_WORK_GROUP_SIZE?

Thank you very much for any kind of feedback.

So if I understand everything correctly, then CL_KERNEL_WORK_GROUP_SIZE gives as the ‘ultimate’ number of work-items that can be assigned to 1 work-group. And this we can find out only after we create a kernel. Anything above this value will couse errors, right? So basically returned value by CL_DEVICE_MAX_WORK_GROUP_SIZE can be ignored (no use of it if in reality we can use 1/4th of the size).

That is correct.

So in theory does this mean that if I rewrite my kernel the returned value of CL_KERNEL_WORK_GROUP_SIZE can differ? If yes, are there some unwritten rules of how to achieve maximum value? For example the same as CL_DEVICE_MAX_WORK_GROUP_SIZE?

Right. Different kernels will have different maximum work-group sizes. This mostly depends on the number of general-purpose registers that are needed to run your kernel, which in turn depends on the number of __private (function scope) variables in your kernel.

David: thank you very much for the reply!

If we query CL_KERNEL_WORK_GROUP_SIZE, do we also have to query CL_KERNEL_LOCAL_MEM_SIZE and CL_KERNEL_PRIVATE_MEM_SIZE and check that they are in fact less than CL_DEVICE_LOCAL_MEM_SIZE and CL_DEVICE_PRIVATE_MEM_SIZE (actually this last one doesn’t seem to exist, is there any reason for that?).

The way that I’m interpreting CL_KERNEL_WORK_GROUP_SIZE is that passing any local_size less than or equal to it is guaranteed to work (maybe that’s an overly bold statement because other unrelated things could go wrong). Anyway, just wanted to double check my understanding.

If we query CL_KERNEL_WORK_GROUP_SIZE, do we also have to query CL_KERNEL_LOCAL_MEM_SIZE and CL_KERNEL_PRIVATE_MEM_SIZE and check that they are in fact less than CL_DEVICE_LOCAL_MEM_SIZE and CL_DEVICE_PRIVATE_MEM_SIZE (actually this last one doesn’t seem to exist, is there any reason for that?).

Good question. If you want to increase the probability that your call to clEnqueueNDRangeKernel() will succeed, then yes, you should call clSetKernelArg() for all arguments and then check whether CL_KERNEL_LOCAL_MEM_SIZE is less than or equal to CL_DEVICE_LOCAL_MEM_SIZE.

There is no CL_DEVICE_PRIVATE_MEM_SIZE because realistically any device will spill private memory into global memory under the hood if it doesn’t have enough private memory on chip.