I need to perform a reduce operation (i.e maximum ) on a bunch of scalar variables that are ancillary results of each work item. Each work item produces one scalar. I want the maximum among all the work items.

I was thinking I would break the reduce into two parts.
First, work items in a workgroup use shared local memory to produce the reduced value for that workgroup.
Second, the single value per workgroup is transferred back to the host. And the host performs the final reduce among all workgroups.

But how can the host know how many workgroup results there are to reduce? (and the arrangement in memory?)

Inside the kernel, we have get_num_groups(d) and get_group_id(d) which could be used to index into the global array for output.

I don't see anything for host-side API calls that can produce the equivalent of get_num_groups(d).
Not before, during, or after kernel ND range execution. Am I missing something?

The closest I've seen, the CL_KERNEL_WORK_GROUP_SIZE property, represents the maximum work group size and thus doesn't help.

I've got one idea so far, which is to give the first work item the special responsibility to write the values of get_num_groups(d) to the first DIM elements of the global output array. With this approach, the host would have to allocate the array with maximum size (assuming a minimum workgroup size of 1). And it will be trickier (multi-step) to transfer the values to host memory while avoiding unused memory locations.

Oh, and I am aware of the reqd_work_group_size attribute. I just don't want to use it.