OpenCL Ndrange Global Size/Local Size

BiS · January 3, 2013, 3:01am

Hello,

As far as i understand, ndrange global size should be a multiple of local size.
But in case it’s not, how does OpenCL handle it? (better said, does OCL handle it?).

I mean, how many “groups” of size = local_size will be launched.

For example, which one would be right for global size 1000 and local_size 512?:
a) We’ll have 2 groups of size 512.
b) We’ll have 1 group of size 512.
c) we’ll have 1 group of size 512 and a group of size 488.

And… it’s strictly defined by the standard? or it’s implementation dependant.

Thanks!.

BiS · January 3, 2013, 4:00am

Well, on AMD implementation it looks like the kernel wont even launch so i think that answers my question

utnapishtim · January 3, 2013, 7:18am

According to OpenCL specification, clEnqueueNDRangeKernel should fail and return CL_INVALID_WORK_GROUP_SIZE.

Dithermaster · January 3, 2013, 9:33am

It is defined by the standard. You must make it a multiple.

The standard way of dealing with non-multiple desired global work sizes is to use the rounded-up value for clEnqueueNDRangeKernel, but pass the desired global size as kernel parameters, then check for global ID inside the kernel to see if it is inside the desired work size. For example, to process a 1920x1080 image with a 32x32 local work size. Global work size must be 1920x1088. The kernel might look like:

__kernel void Example_Kernel
(
    __read_only   image2d_t imgSrc,
    __write_only  image2d_t imgDst,
    int       width,
    int       height
)
{
    int x = get_global_id(0);
    int y = get_global_id(1);
    
    if ((x < width) && (y < height))
    {
    ... // do work here
    }
}

For getting started, you can leave local work size unspecified, and let the runtime come up with one, but if you have odd or prime global sizes, it might use 1x1 which is not optimal.

BiS · January 4, 2013, 6:52am

Yeah thanks, did so :).

I’m working on a opencl “middleware”, so needed to know every possible combination, but it looks than rounding-up works fine. That’s good for me i think, after all it’s the same approach than cuda :).