As far as i understand, ndrange global size should be a multiple of local size.
But in case it’s not, how does OpenCL handle it? (better said, does OCL handle it?).
I mean, how many “groups” of size = local_size will be launched.
For example, which one would be right for global size 1000 and local_size 512?:
a) We’ll have 2 groups of size 512.
b) We’ll have 1 group of size 512.
c) we’ll have 1 group of size 512 and a group of size 488.
And… it’s strictly defined by the standard? or it’s implementation dependant.
It is defined by the standard. You must make it a multiple.
The standard way of dealing with non-multiple desired global work sizes is to use the rounded-up value for clEnqueueNDRangeKernel, but pass the desired global size as kernel parameters, then check for global ID inside the kernel to see if it is inside the desired work size. For example, to process a 1920x1080 image with a 32x32 local work size. Global work size must be 1920x1088. The kernel might look like:
__kernel void Example_Kernel
(
__read_only image2d_t imgSrc,
__write_only image2d_t imgDst,
int width,
int height
)
{
int x = get_global_id(0);
int y = get_global_id(1);
if ((x < width) && (y < height))
{
... // do work here
}
}
For getting started, you can leave local work size unspecified, and let the runtime come up with one, but if you have odd or prime global sizes, it might use 1x1 which is not optimal.
I’m working on a opencl “middleware”, so needed to know every possible combination, but it looks than rounding-up works fine. That’s good for me i think, after all it’s the same approach than cuda :).