global work offset in OpenCL 1.1

Hi,

with OpenCL 1.1 it is possible to define an offset to your NDRange when launching a kernel. However, according to the spec (see 3.2) this offset is only affecting the global ID, but not the workgroup ID. In other words, your workgroup IDs will always start with 0, no matter what the offset is.

It was always my intuition that the following is true: get_global_id() / get_local_size() = get_group_id()
Therefore, when I first saw the offset parameter I thought that it also affects the workgroup IDs accordingly (assuming that the offset is a multiple of the workgroup size). But this is not the case.
As explained by Micah Villmov from AMD here, this may be because it is easier/faster to implement in hardware.

But I’m wondering how the offset can be used given that the workgroup IDs will ignore the offset. Imagine you want to partition a task between two GPUs, e.g. half of the work on GPU1 and the rest on GPU2. Now I would assume that setting the offset for GPU2 to half the problem size would do the trick. However, if a work-item uses get_group_id() to identify the part of work it’s been assigned this wouldn’t work.

Are there any other (non hardware-related) reason for this behaviour?

Refer to section 3.2 (specifically the following equation on page 24) of the 1.1 spec:

Given a global ID and the work-group size, the work-group ID for a work-item is computed as:
(wx, wy) = ( (gx – sx – Fx) / Sx, (gy – sy – Fy) / Sy )

The number of work-groups are not affected by the offset and therefore the work-group ID is not impacted as well.

Consider for example if you are enqueuing a kernel to operate on a 2D image. The global work-size is w (width of the image) x h (height of the image). You could tile the image across multiple devices. For example, let’s say we are going to tile the image across 2 devices so we would clEnqueueKernel on device 0 with global_work_offset (0, 0) and global work-size (w/2, h) and clEnqueueKernel on device 1 with global_work_offset(w/2, 0) and global work-size(w/2, h).

The number of groups executing the kernel on device 0 (or 1) is not impacted by the global_work_offset. You do need the global_work_offset to compute the correct global_id which in this case is the pixel location in the image the work-item is operating on.

With regards to your comment " However, if a work-item uses get_group_id() to identify the part of work it’s been assigned this wouldn’t work." why would this be the case? Can you elaborate.

The number of work-items is also not affected by the offset, but the work-item IDs are…

With regards to your comment " However, if a work-item uses get_group_id() to identify the part of work it’s been assigned this wouldn’t work." why would this be the case? Can you elaborate.

Say you want to compute a matrix-vector multiplication where one work-group operates on each row. Then you would use get_group_id() to identify the row and get_local_id() to identify the elements in a row for a specific work-item.
If you would now partition the work between two GPUs by setting the offset for the 2nd kernel launch to height/2 ( * local_work_size), this offset will be ignored unless you explicitly call get_global_offset(). In my opinion this shouldn’t be necessary, just like you don’t have to explicitly query the offset when calculating the global ID.