Work-groups scheduled to/on compute units

If I recall correctly a work-group runs on one and only one compute unit, and multiple work-groups may or may not be executed concurrently on a compute unit.

However, I can‘t seem to find details about how work-groups are scheduled to/on compute units. What if anything may be said about the order of execution between work-groups (and similarly work-items). Is work-group k guarrenteed to be submitted to a compute unit before work-group k+1? I think it‘s obvious nothing can be said about the order that they finish, but something should be able to be said about the order they start.

What can be said about how work-groups are assigned to compute units? Is it a static or dynamic schedule? Is there any concept of a chunksize of work-groups (e.g., assigning two work-groups to a compute unit at a time)? If there are m compute units and k work-groups with k less than or equal to m, do increasing work-groups get scheduled to non-decreasing (or strictly increasing) compute units, both starting from zero? I mention nondecreasing compute units because I don‘t know if a compute unit could get multiple work-groups in such an example.

I think it would be nice if the specification mentioned more about these concepts, or explicitely said they are implementation dependent. OpenMP has a section on scheduling and MPI mentions orders of communications.

One other question, what happens in the case of 2D or 3D NDRange? Do things get scheduled just as if the work-group ND index array was flattened into a 1D index array? So (1, 0) starts before (0, 1), or vice-versa?

I can‘t seem to find details about how work-groups are scheduled to/on compute units.

That is going to be implementation-dependent.

Is work-group k guaranteed to be submitted to a compute unit before work-group k+1?

No, it’s not. There are no guarantees. No guarantees actually translate into more freedom of implementation and greater performance, so it’s a good thing for you.

What can be said about how work-groups are assigned to compute units?

It’s also going to be implementation-dependent.

Do things get scheduled just as if the work-group ND index array was flattened into a 1D index array?

You know my answer by now :slight_smile: Implementation-dependent.

I think it would be nice if the specification mentioned more about these concepts, or explicitely said they are implementation dependent.

I agree. You could try creating a bug report into http://www.khronos.org/bugzilla. Believe it or not, the group actively listens to those bug reports – even if it doesn’t always follow-up with the original reporter after action is taken.

Thank you David for your reply.

If everything pertaining to scheduling work-groups to compute units is implementation-dependent, then it seems impossible for an ISV to load balance between compute units. It also somewhat limits our control over memory access patterns.

Must we entirely rely on the hardware vendor‘s implementation to perform all load balancing? Is there anything we can do?

If you mean load balancing within a given device, I would not worry about it – either the hardware or the driver will do the job for you. If you mean load balancing across different devices, it’s unfortunately a HardProblem™.

Actually I was thinking about load balancing within a given device. Since the hardware and driver can’t know a priori about all algorithms going to be submitted to it, wouldn’t it be best for the user to manage the load balancing whenever possible? In reality it seems like a trade-off, the hardware/driver knows more about the device’s current loads, but the user knows more about the device’s future loads. And if hardware/driver manages all the load balancing that also seems to suggest parallelizing the algorithm more fine grained than coarse grained.

Does that mean it’s better to submit requests as bug reports or in this forum’s “Suggestions for next release” section? It looks like there is very little discussion or feedback there and everything is marked as NEW, even things more than a year old.

Actually I was thinking about load balancing within a given device.

It boils down to this: if it’s a GPU, the hardware will do load balancing for you. If it’s a CPU, the runtime will do load balancing for you. It’s in the implementation’s best interest to do load balancing as well as it’s possible.

Within a device I don’t think there’s much the application could do to improve the status quo. It’s a different story when multiple devices are involved.

Does that mean it’s better to submit requests as bug reports or in this forum’s “Suggestions for next release” section?

That’s not my intention. Notice that in this case you are looking for clarification on the spec. Currently the spec is very vague about load balancing and you wanted to know at least whether scheduling is implementation-defined.