get_group_offset

I have a use case for a get_group_offset function. Suppose you wish to program some reduction algorithm using multiple devices. One option is to use implicit buffer transfers using a subbuffer for each device (i.e., zero-offset method). However, the more classic option is to explicitly manage buffer transfers using a common buffer among all the devices (i.e., non-zero-offset method). This works rather well for the input buffers because there is a global_offset, but the output buffer of a two-step reduction implementation needs some group_offset. Currently there are a couple relatively easy workarounds: pass a compiler option -D GROUP_OFFSET (but potentially requires rebuilding the kernel), or calculate group_offset = (global_offset + global_size) / local_size - num_groups. I propose simplifying things a bit for usability and completeness by including a get_group_offset function.

BTW, I strongly think it was a mistake to use “num_groups” instead of “group_size” as it doesn’t follow the naming convention expected from global_{id, size}, local_{id, size}, and group_id. People hate when languages establish patterns and then buck them.