A few technical questions about work-items and wavefronts

Hello!

I’ve been reading this document http://developer.amd.com/gpu_assets/ATI … _Guide.pdf

to get to know more about how OpenCL actually works with GPU.

Though there are a few things a don’t understand or I’m not sure to.

First, is it necessary that one workgroup works on one compute unit? Because as I have understood it, a compute unit can execute only one task.

As concerns workgroups sizes, is it possible that they’re much bigger than the number of cores on a compute unit?

Now about wavefronts. If I take a card with compute units with 16 cores each, what is the maxsize of a wavefront?

Thank you very much if you can help!

Arthur

Yes it is absolutely necessary. For local memory to be efficient, not to mention the caching and other mechanisms, all the transistors have to be on the same physical device.

It’s only an abstraction too. A compute unit doesn’t need to run just one task - new gpu’s support running different work-loads concurrently, and even older ones can run multiple instances of the same code concurrently.

As concerns workgroups sizes, is it possible that they’re much bigger than the number of cores on a compute unit?

The global work size is effectively unbounded. If it’s bigger than the device is capable of it just runs the task multiple times.

Now about wavefronts. If I take a card with compute units with 16 cores each, what is the maxsize of a wavefront?

Thank you very much if you can help!

Arthur

A wavefront has nothing to do with the number of cores. It’s an AMD GPU specific term that refers to internal instruction dispatch and scheduling within an individual CU, and is always 64 on existing hardware.

You probably mean maximum work-group size. On AMD this is 256 work items. Note that this is not the maximum number of work items running concurrently on a given CU, just the maximum size of a work-group which is a programmer view of the task execution. Running multiple work-groups concurrently on the same CU is the way memory latency is hidden and is basically equivalent to ‘hyper threading’.

The maximum work-size is effectively unbounded, and the work will be spread across all cores iteratively until complete. The driver will try to fit as many as possible on a given CU and across all cu’s on a device - but how many can fit depends on the resources they require and the hardware capabilities.

The opencl spec + the amd + nvidia app guide introductions make it pretty clear how this all works. These machines are NOT just ‘very wide SMP’ cpus which seems to be where your questions are coming from.

Thank you very much notzed for your answer, I think I understand better how it works now! And sorry for my late answer, I was having problems with my Internet connection!

Just to be sure I understood it well: if I work with data of a total size N, and I know I have x compute units, the size of my work-groups should be smaller than N/x. For instance if I define a work-group size of N, all the data will be computed by only one CU right? Which is not the best way…

So should I always define a work-group size of N/x for better performances?

I think ideally, say you have x compute units (ie. streaming multiprocessors) then yes, you should have N/x however it isnt that simple. For example, they should form warps or wavefronts as mentioned in the nvidia and amd programming guides respectively for best performance from the architecture.

For example, if you have 32 workitems in a workgroup on an nvidia fermi (and say there are 16 workgroups) then this will map exactly to the architecture. However, in the programming guide they recommend using multiples of 32 (64 etc.) so that it can hide the latency better (by doing other things while you wait for a read from global memory). Basically, you should experiment and (micro)benchmark. It does make sense, you just have to read around a bit :slight_smile: NVIDIA OpenCL Programming guide is an excellent starting point.

Normally you just set 1 work item == 1 data to calculate, and that is your only constraint. If it’s a small problem it wont take long, and if it’s a big one it will parallelism as much as it can. I would suggest if you do this you always make sure the global worksize is a multiple of 64 (round up), so it will fit well on all gpus.

For more advanced algorithms that require local memory for instance you may have a specific local size requirement - use a multiple of 64 for the local work-size here.

For very advanced algorithms you might fit the problem to the CU count on the specific card, in this case you are still only worrying about having ‘N’ wavefronts per CU, and N should be >= 4 or so - it depends on the algorithm and it’s memory access/ALU interleave.

Note that in either case the work-size is independent of the hardware CU count; it’s more important to worry about it running well on a single CU rather than how other CU’s are used. It should then scale up/down well depending on the problem size, and if CU’s are idle they can be utilised for other purposes by the system.