clEnqueueNDRangeKernel max global_work_size

Is there another maximum than 2^address_bits for the global_size?

I try to pass a bit more than 100 000 000 and clEnqueueNDRangeKernel return INVALID_VALUE.

I work under vista64 with last nVidia drivers.

No. The spec puts no limits on the size except for the size of the type holding the data, which is a size_t. The OpenCL implementation is responsible for breaking up your requested global size into something that will work on the hardware. So if your size is not being accepted it sounds like a bug with the Nvidia implementation.

I would suggest you make sure that your global size is a multiple of 32, though. Otherwise the driver may be forced to use a local size that is non-optimal for the card and you will get substantially worse performance.

Thanks.

To be accurate I use (134217728,1,1) for the global and (512,1,1) for he local.

Without any nVidia account, is it possible to report OpenCL driver bugs somewhere? Perhaps nVidia guys read this forum?

Did you verify via clGetKernelWorkgroupInfo that that local size is valid for your kernel? Since that’s the maximum the hardware supports, it will only be okay if your kernel is using very few registers. (Or you can just pass in NULL for the local size.)

Very interesting, actually I put 512, the max of my device capability. But now I’ve tried to let the implementation choose the right local size and the problem is the same.

If we use to much register, should the return not be CL_OUT_OF_RESOURCES ?

You have to put in a number for the total local workgroup size that is less than or equal to what is returned by the clGetKernelWorkgroupInfo call. I believe you should get CL_INVALID_WORK_GROUP_SIZE if the size is not valid.

Thank you very much, I forgot this “clGetKernelWorkgroupInfo” call… It’s just essential.

In fact I have an invalid_value as soon as the ratio is over 65535 … I guess this ratio is stored by drivers in a too short variable :)…

That sounds like a hardware limit that the Nvidia driver is not handling correctly. I would suggest filing a bug against them if you can.

I can’t with no account (I just retried to create one…).

195.181 nVidia beta drivers don’t fix the problem.

Can it be possible to create a zone on this forum to report implementation bugs, that manufacturers could read?

In NVIDIA OpenCL guide we can read :

The maximum size of each dimension of a grid of thread blocks is 65535;

So there is a limitation, fixed by nVidia, which is’nt define by specification and that we can’t query with API… This constraint limits a bit the portability :expressionless:

It is strange that you can use arbitrary sizes on Nvidia cards on MacOS X but not with Nvidia’s own drivers. All they would have to do is breakup the runtime kernel execution into 65k sized chunks since each execution is necessarily independent.