Compute shaders extremely slow

Silverlan · April 30, 2017, 1:51am

I need to do some heavy calculations on the GPU, where I need to iterate through a huge amount of data (particles), so I was going to use a compute shader.
However, I noticed that the execution of the compute shader takes extremely long, far more than it should in my eyes.
I wrote a very simple test-shader to narrow the problem down:


#version 440

layout(local_size_x = 1,local_size_y = 1,local_size_z = 1) in;

void main(void)
{
	for(int i=0;i<1000000;++i)
	{

	}
}

The above code just runs that for-loop a million times and does absolutely nothing else. On my CPU, that would take about 0.4ms.
If I time the execution of the compute shader using timestamp queries, it ends up at approximately 32ms, which concurs with my framerate.
Decreasing the iteration count, or removing the loop entirely, improves the performance, so I can say for a fact that the compute shader is the cause.
I’ve tried this in both my own program, and one of shasha willems examples, with the same result. (Comparison: With the for-loop and without the for-loop.)

How is that possible?

krOoze · April 30, 2017, 6:20am

Seems like the driver fails to optimize.
If the posted code is complete then the whole loop should have been elided (i.e. ```i[\VAR] can be scratched from dependency graph, because no output rely on it).

Some would disagree, but IMO this kind of optimization should be done even earlier on ```glslangValidator[\VAR] level. Indeed when I try this with SDK 1.0.46, it does generate a SPIR-V loop.

But either way the driver should expect any kind of SPIR-V from any kind of source and optimize this away too.

As for GPU vs CPU (assuming the CPU code wont throw the loop away, for fair comparison), you must also consider that:

GPUs generally work on lower clocks
GPU has to perform the whole shader for each work-item (that is N times, where you control the N), while CPU only has to compute it once. It would show if N is higher than the amount of threads the GPU can do concurrently.

Silverlan · April 30, 2017, 6:29am

[QUOTE=krOoze;42222]
But either way the driver should expect any kind of SPIR-V from any kind of source and optimize this away too.[/QUOTE]
Well, even if it doesn’t optimize anything, the execution time really shouldn’t be that bad. My GPU isn’t that old (AMD R9 200 series), a loop like that should never take 32ms (It should certainly be faster than the CPU?)

//EDIT:

[QUOTE=krOoze;42222]
As for GPU vs CPU (assuming the CPU code wont throw the loop away, for fair comparison)[/QUOTE]
The CPU-code I used looked like this, to make sure nothing’s optimized away:


	uint32_t v = 0;
	auto t = std::chrono::high_resolution_clock::now();
	for(uint32_t i=0;i<1'000'000;++i)
	{
		v += i;
	}
	auto tDelta = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() -t).count() /1'000'000.0;
	std::cout<<v<<std::endl; // Output v, to make sure compiler doesn't optimize anything
	std::cout<<"Delta time in ms: "<<tDelta<<std::endl;

[QUOTE=krOoze;42222]
2) GPU has to perform the whole shader for each work-item (that is N times, where you control the N), while CPU only has to compute it once. It would show if N is higher than the amount of threads the GPU can do concurrently.[/QUOTE]
I’m not sure what you mean by ‘work-item’. The shader is invoked exactly once each frame, so it should be only computed once?

Salabar · April 30, 2017, 10:06am

https://imgur.com/a/X2hXM
Each iteration of your loop takes 24 cycles, 1m of those is 24 * 10^6 cycles. Your GPU runs roughly at 1 GHz, so it performs 1000 * 10 ^ 6 cycles per second. 24 / 1000 = 24 ms, which is perfectly in line with your result considering launch latencies and measurement error. And your GPU probably runs at ~800 MHz. GPUs are meant for massively parallel computations, not very long loops.

JoeJ · April 30, 2017, 11:12am


layout(local_size_x = 1,local_size_y = 1,local_size_z = 1) in;

This line means that only one out of 64 (or 32 on Nvidia) threads have work at all, all other threads will do nothing.


layout(local_size_x = 64,local_size_y = 1,local_size_z = 1) in;

or

layout(local_size_x = 8,local_size_y = 8,local_size_z = 1) in;

With those settings all 64 threads of a wavefront have work and you should see a boost in performance.
(You need to make sure the workgroup size is a multiple of 64)


layout(local_size_x = 256,local_size_y = 1,local_size_z = 1) in;

With a setting like this a AMD GPU would join 4 wavefronts so 256 threads can access the same LDS memory.

But to really saturate a GPU you need to feed it woth a LOT more of work.
Example AMD Fury has 64 Compute Units, each can work on 10 wavefronts a 64 threads, so you need work for 40960 threads for the best case.
(The GPU switches to another of those 10 wavefronts in flight if it waits on memory operations and uses this mechanism to hide memory latency and bandwidth issues)

Silverlan · April 30, 2017, 12:17pm

Hm… maybe I should explain what I’m actually trying to do.
I’ve implemented a water surface simulation using a height field on the CPU:

All of the particles are then mapped to a vertex buffer each frame.
However, I have no actual use for the particle data on the CPU, so I thought I might as well move the entire simulation to the GPU, especially since the particle amount grows exponentially with the surface area of the water.

Is there any reason not to make use of all threads?
Do I need to synchronize the data somehow?
How do I know what to choose for the ‘local_size_*’ values? Just test and see what works best?

JoeJ · April 30, 2017, 1:20pm

Seems you’re new to compute, i always recommend reading the chapter from OpenGL Super Bible. Tells you all important stuff in a few pages.
(But they used barriers in wrong order, e.g. memoryBarrierShared(); barrier(); is correct, the other way around may fail on some Hardware)

The reason you need to think about threads in groups is because it is how the hardware works.
The CU operates all those 64 threads in lock step. Each thread executes the same instructions, only the data differs. (Branches may become expensive because they get serialized with this except ALL threads make the same decission)

For your actual work a efficient solution would be:

Each workgroup operates on a cluster of 8x8 particles, and you dispatch enough workgroups the whole fluid is covered.
You do no loop inside the shader, instead each single thread operates only on a single particle (You use things like gl_LocalInvocationID and gl_WorkGroupID to do proper indexing - can get a bit confusing initially )

I assume your particles need also the data from their neighbours. This is where it get’s interesting after you’re through with the basics:
The 8x8 workgroup loads its whole block of particle data into LDS (called ‘shared’ in GLSL).
Now each thread can access all neighbouring data without the need to read it from memory again, which would be very bandwidth heavy.
(You still have special cases at the borders where information is missing, you could solve this by loading a 10*10 block)
The point is: You load each particle only once instead 1+4or8 times. The performance improvement will be huge.

For further details this is a nice one: developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/
(There is no difference between OpenCL and Compute that matters)

How do I know what to choose for the ‘local_size_*’ values? Just test and see what works best?

In cases where you have a choice (like yours) you may need to use different numbers for different hardware, if you really need every bit of performance you can get.
It’s just a matter of profiling and target hardware. Mostly 64, 128 or 256 (and 32 only on NV)

But this is important:
Larger workgrops means you get more LDS per workgroup because LDS is per CU (about 64KB) and can be joined.
So LDS usage and problem size are the important factors in cases where you have some limits.

Silverlan · May 1, 2017, 9:15am

Thank you, that helped me quite a bit.
I’m still not quite sure how to resolve the neighbor issue however.
My shader currently looks like this:


#version 440

layout(local_size_x = 8,local_size_y = 8,local_size_z = 1) in;

[...]

void main(void)
{
	uint x = gl_GlobalInvocationID.x;
	uint y = gl_GlobalInvocationID.y;
	if(x >= u_surfaceInfo.width || y >= u_surfaceInfo.height)
		return;
	// Simulate the particle
	uint ptIdx = y *u_surfaceInfo.width +x;
	WaterParticle pt = u_particles.particles[ptIdx];

	pt.height = min(pt.height *pt.velocity,u_surfaceInfo.maxHeight);

	// solve_edges(pt);

	float v = pt.targetHeight -pt.height;
	pt.height = min(pt.height -v *u_surfaceInfo.stiffness,u_surfaceInfo.maxHeight);

	float delta = pt.height -pt.oldHeight;
	pt.velocity = delta;
	pt.oldHeight = pt.height;

	// Output position for rendering
	u_vertices.vertices[ptIdx].position.xyz = vec3(
		u_surfaceInfo.origin.x +y *u_surfaceInfo.spacing,
		u_surfaceInfo.origin.y +pt.height,
		u_surfaceInfo.origin.z +x *u_surfaceInfo.spacing
	);
}

The size of the particle field can be somewhat arbitrary. In my test case it has a width of 181 and a length of 267. With a local work group size of 8 respectively (1 for z), I’ve set the work group counts (in the dispatch-call) to 23, 34 and 1.
Since the total number of particles is not a multiple of 64, I have to make sure it doesn’t try to simulate particles that don’t actually exist:


	uint x = gl_GlobalInvocationID.x;
	uint y = gl_GlobalInvocationID.y;
	if(x >= u_surfaceInfo.width || y >= u_surfaceInfo.height)
		return;

Would it be better to increase the particle field to a multiple of 64 and get rid of the if-statement, simulate all particles and just ignore the exceeding particles that I don’t need?

The code I’ve posted works, however the ‘solve_edges’ part is still missing. This is where the neighbors come in, I need to make sure that all neighbors of the current particle have reached this point in the code (but no further).
Now, as I understand it, all particles within a work group would be computed in parallel, so within a work group this would apply.
As you’ve mentioned that will be problematic at the edges of a 8x8 block, but I’m not quite sure what you mean with this part:

If I change the local workgroup size from 8x8 to 10x10, it’s not a multiple of 64 anymore, I thought that was a problem?
Also, since I now would have overlaps for each block, that would mean that all particles which lie near edges would be simulated multiple times.
I think I’m just misunderstanding what to do here…

There are certain ‘events’ that can occur that affect the water surface, e.g. a water splash when an object hits the surface.
What would be the best way to handle something like that? I could write a separate compute shader and execute that whenever a splash occurs, but I’m not sure if that’s the ‘right’ way of doing it.

JoeJ · May 1, 2017, 1:05pm

Would it be better to increase the particle field to a multiple of 64 and get rid of the if-statement, simulate all particles and just ignore the exceeding particles that I don’t need?

Probably yes, but it should not make a big difference. Range checks like you currently do are not bad, because in most cases all threads will make the same decission.
So you loose only the performance of the check itself but more important you need to load width and hight to do the check. The load should have the worst effect on performance so this is where your win might come from.

One thing i see is you seem to use AOS like data like this:



// AOS:
struct Particle
{
float hight;
float oldhight;
//...
} particles[100];

// SOA:
float particleHight[100];
float particleOldHight[100];


// example:

float v = hight[x] - oldHight[x];

Now looking at the example, note that first ALL threads load from hight and next all threads load from oldHight in parallel.
This is why SOA is usually better (if your struct is large enough), because this way we load from adjacent (or at least closer) memory locations.
Those things can matter a lot, worth to try out.

I need to make sure that all neighbors of the current particle have reached this point in the code (but no further).

you usually solve this on the API level:
Do one dispatch that processes all data up to the point where you need to sync.
Insert a memory barrier to ensure all data has been written.
Do the next dispatch that proceeds from there.

To make this work you may need multiple copies of data, use double buffering, divide your algorithm to smaller parts, etc.

If I change the local workgroup size from 8x8 to 10x10, it’s not a multiple of 64 anymore, I thought that was a problem?
Also, since I now would have overlaps for each block, that would mean that all particles which lie near edges would be simulated multiple times.
I think I’m just misunderstanding what to do here…

Yep that’s misunderstood.
I make this simpler one dimensional example where we want to blur a long horizontal line of pixels:



layout(local_size_x = 64,local_size_y = 1,local_size_z = 1) in;
 
[...]

shared float lds[64+2];

void main(void)
{
uint threadID = gl_LocalInvocationID.x;
uint leftmostPixelIndex = gl_WorkGroupID.x * 64;

float leftNeighbour = pixelBuffer[leftmostPixelIndex + threadID - 1]; // we store this value in a register

lds[threadID] = leftNeighbour; // we put the value also to LDS so other threads have access

if (threadID < 2) lds[threadID + 64] = pixelBuffer[leftmostPixelIndex + threadID + 64 - 1];

memoryBarrierShared(); barrier(); // now we have loaded 64 pixels plus left and right neighbour



float blurredResult = 
leftNeighbour * 0.25 + // we could load this one from LDS, but it's a lot faster to use the register we have anyways.
lds[threadID + 1] * 0.5 + // our pixel
lds[threadID + 2] * 0.25; // right neighbour

blurredPixelBuffer[leftmostPixelIndex + threadID] = blurredResult;
}

So i still use a 64 wide workgroup but two threads need to do an extra read to get 66 values.
Another option would be to read only 64 pixels and output just 62 at the cost of using a little more workgroups and having 2 idle threads.

Not sure what’s better and it should not matter much, but we could assume using 256 wide workgroups would be better than using just 64, because there would be less overlap.
However, this would mean to join different hardware CUs together and accessing the LDS memory from another CU has additional cost,
so it’s again a matter of profiling to find the sweet spot.

What is really important to understand is that we use fast LDS both for inter thread communication and as a kind of user controlled cache.
We don’t know something similar from CPUs, so together with the locksteped threadgroup behaviour those are the two major things to think about.
They define why and how we implement algorithms differently on CPU / GPU.