can host determine number of workgroups? related to reduce

I need to perform a reduce operation (i.e maximum ) on a bunch of scalar variables that are ancillary results of each work item. Each work item produces one scalar. I want the maximum among all the work items.

I was thinking I would break the reduce into two parts.
First, work items in a workgroup use shared local memory to produce the reduced value for that workgroup.
Second, the single value per workgroup is transferred back to the host. And the host performs the final reduce among all workgroups.

But how can the host know how many workgroup results there are to reduce? (and the arrangement in memory?)

Inside the kernel, we have get_num_groups(d) and get_group_id(d) which could be used to index into the global array for output.

I don’t see anything for host-side API calls that can produce the equivalent of get_num_groups(d).
Not before, during, or after kernel ND range execution. Am I missing something?

The closest I’ve seen, the CL_KERNEL_WORK_GROUP_SIZE property, represents the maximum work group size and thus doesn’t help.

I’ve got one idea so far, which is to give the first work item the special responsibility to write the values of get_num_groups(d) to the first DIM elements of the global output array. With this approach, the host would have to allocate the array with maximum size (assuming a minimum workgroup size of 1). And it will be trickier (multi-step) to transfer the values to host memory while avoiding unused memory locations.

Oh, and I am aware of the reqd_work_group_size attribute. I just don’t want to use it.

Thanks.

The developer generally knows the number of work-groups when they enqueue a kernel and specify the global and local sizes; for each dimension it‘s global[i]/local[i], and take the product of those if it‘s a multi-dimensional kernel.

Some SDKs allow you to enqueue kernels only specifying the global sizes. In my opinion it is better to manually set the local sizes rather than letting that SDKs runtime do it for you. If you let the runtime set the local sizes then the host won‘t know how many work-groups before executing the kernel. You could allocate some array with the maximum number of work-groups and write to a kernel argument what the actual number of work-groups was. However, the maximum number of work-groups is not well defined. There is obviously at least one work-item per work-group, so at most global_size number work-groups, but that‘s not very helpful because generally there are many work-items per work-group, so the number of work-groups could be less than that absolute upper bound.

In summary, always specify the local sizes so you‘ll know the number of work-groups.

Unless this is the final result from your algorithm that you need on the cpu side, you’re better off doing both stages gpu-side.

Just use a single compute unit, e.g. make global size == local size.