Assignment of work_items to work_groups

AMD’s MatrixMultiplication example utilizes shared (ie, __local) memory to reduce the number of fetch ops. In doing so, the code implicitly assumes that workgroup of blocksize x blocksize contains a work_items pointing to a contiguous block of the input matrices. More precisely, it assumes that within a workgroup, if work_item with get_local_id() = (0,0) has get_global_id() = (m, n), then the work_item with get_local_id() = (a, b) must have get_global_id() = (m+a, n+b). All kinds of requirements about the divisibility of global dims by work_group dims hints at this layout, but as far as I can tell the spec doesn’t come out and require that work_groups be assigned as contiguous blocks. Am I missing something, or is this missing from the spec? Knowing how workgroups are allocated will allow efficiencies like the MatrixMultiplication example.

Going one step further, knowing how workgroups are assigned to SIMD engines is also important. While I don’t have an example in mind of needing to know the absolute SIMD engine ID, knowing that two workgroups will execute on the same SIMD engine (at different times, obviously) would be very useful to some code I’m working on.

More precisely, it assumes that within a workgroup, if work_item with get_local_id() = (0,0) has get_global_id() = (m, n), then the work_item with get_local_id() = (a, b) must have get_global_id() = (m+a, n+b). All kinds of requirements about the divisibility of global dims by work_group dims hints at this layout, but as far as I can tell the spec doesn’t come out and require that work_groups be assigned as contiguous blocks.

Yes, this is explained in section 3.2 of the spec.

Going one step further, knowing how workgroups are assigned to SIMD engines is also important. While I don’t have an example in mind of needing to know the absolute SIMD engine ID, knowing that two workgroups will execute on the same SIMD engine (at different times, obviously) would be very useful to some code I’m working on

Standard OpenCL certainly doesn’t provide such information. Chances are that on most implementations the assignment of work-groups to compute units is dynamic anyway (i.e. work-groups are executed in whichever compute unit becomes available).

thanks for pointing out 3.2
I had overlooked that, focusing on the section about enqueueing kernels.

As to the SIMD engine issue, yes, dynamic scheduling is likely. However, a function to identify which SIMD engine a workgroup is on at run-time seems like it should be possible.

That is an interesting idea. Could you elaborate on how it would improve performance in your use case?

I’m writing a simulation involving a large number of “sites” (> 10^6) arranged as an array. At each update, each site must generate a random number. Even only modestly good pseudo-random number generators (rng’s) require about 25 words of state. In addition, good practice seeks to minimize the number of independent rng’s. Using a single rng on the GPU would be extremely inefficient, as you face considerable contention for the same memory locations (and that assumes you have enough synchronization to do this at all, which I’m not sure you do). If you know which SIMD you are on, you could get away with (work_group_size * number_of_SIMD engines) rng’s, assuming that each work_group is processed to completion before another work_group start. Using atomic operations is may be possible to reduce this even further, but I haven’t given that much thought. About all I can come up with lacking the SIMD id is to use a separate rng for each site. This is bad practice (as noted above), plus incurs huge space overhead, since a site is represented by a single word.