Global workgroup size and performance

So I’m playing around with a many-body system to test out what kind of processing power I can get out of my FX 3800.

I can get a decent framerate with 10000 particles or less when the global work-group size is the same as the particle count (or slightly larger to be exact, 10240, as the local size is 512).

Now in the kernel I must loop through every other particle and compute the attraction/repulsion and so on, and this loop is what uses most of the computing power.

Since the GPU is capable of a much larger work-group size than 10k, I tried use a multiple of the particle count as the global work-group size. For example

global_ws = local_ws*static_cast<size_t>(ceil(static_cast<float>(N*M)/static_cast<float>(local_ws)));

Where N is the particle count and M is the multiple. In the kernel I would the divide the particles in M groups and use

get_global_id(0)%N

To determine which particle I’m working on and

(get_global_id(0) - get_global_id(0)%N)/N

to determine which subset of the particles to compute the interaction from.

However it seems the performance gain is much lower than I’d expected. At M = 2 it is slightly faster but if I set M = 10, or higher, it is slower.

So does anyone have a clue as to why this would happen? Should not more work groups lead to more parallel computations and thus make it faster?

You’ve already got the hardware pretty busy with that much work, so there probably isn’t much extra to be gained. And many more threads just increases the overheads and might even cause the job to be batched across multiple runs (== much slower). Extra threads can really only hide memory latency, but if it’s saturated or you are ALU bound they can’t help.

If you are able to arbitrarily set the size, then try to match the hardware details so you have the workgroup some multiple of ‘Kernel Preferred work group size multiple’, and the (total global worksize / local worksize) to be some some integer multiple of ‘Max Compute Units’. But once you’re over a certain size problem this wont make much difference either (well from a few tests).

Depending on the problem and how the solution is implemented, reducing the LWS to 64 or 128 might have more of an impact than increasing the global work size and having each work-group do less work.

OT: Do you really have to do all those acrobatics for simple arithmetic in C plus plus? Damn. I missed a bullet there …

OT: Do you really have to do all those acrobatics for simple arithmetic in C plus plus? Damn. I missed a bullet there …

There are several ways to write that expression in C++. For example, you can use constructors instead of static casts:

global_ws = local_ws*size_t(ceil(float(N*M)/local_ws));

Thanks, you are right. Reducing local work-size to 128 more than halved the time used for computations.

Could also be done like this (at the risk of having one superfluous multiple of local_ws):

local_ws*(( N*M)/local_ws + 1)

However shorter code isn’t always better or clearer I’d say.