Would it be better to increase the particle field to a multiple of 64 and get rid of the if-statement, simulate all particles and just ignore the exceeding particles that I don’t need?
Probably yes, but it should not make a big difference. Range checks like you currently do are not bad, because in most cases all threads will make the same decission.
So you loose only the performance of the check itself but more important you need to load width and hight to do the check. The load should have the worst effect on performance so this is where your win might come from.
One thing i see is you seem to use AOS like data like this:
// AOS:
struct Particle
{
float hight;
float oldhight;
//...
} particles[100];
// SOA:
float particleHight[100];
float particleOldHight[100];
// example:
float v = hight[x] - oldHight[x];
Now looking at the example, note that first ALL threads load from hight and next all threads load from oldHight in parallel.
This is why SOA is usually better (if your struct is large enough), because this way we load from adjacent (or at least closer) memory locations.
Those things can matter a lot, worth to try out.
I need to make sure that all neighbors of the current particle have reached this point in the code (but no further).
you usually solve this on the API level:
Do one dispatch that processes all data up to the point where you need to sync.
Insert a memory barrier to ensure all data has been written.
Do the next dispatch that proceeds from there.
To make this work you may need multiple copies of data, use double buffering, divide your algorithm to smaller parts, etc.
If I change the local workgroup size from 8x8 to 10x10, it’s not a multiple of 64 anymore, I thought that was a problem?
Also, since I now would have overlaps for each block, that would mean that all particles which lie near edges would be simulated multiple times.
I think I’m just misunderstanding what to do here…
Yep that’s misunderstood.
I make this simpler one dimensional example where we want to blur a long horizontal line of pixels:
layout(local_size_x = 64,local_size_y = 1,local_size_z = 1) in;
[...]
shared float lds[64+2];
void main(void)
{
uint threadID = gl_LocalInvocationID.x;
uint leftmostPixelIndex = gl_WorkGroupID.x * 64;
float leftNeighbour = pixelBuffer[leftmostPixelIndex + threadID - 1]; // we store this value in a register
lds[threadID] = leftNeighbour; // we put the value also to LDS so other threads have access
if (threadID < 2) lds[threadID + 64] = pixelBuffer[leftmostPixelIndex + threadID + 64 - 1];
memoryBarrierShared(); barrier(); // now we have loaded 64 pixels plus left and right neighbour
float blurredResult =
leftNeighbour * 0.25 + // we could load this one from LDS, but it's a lot faster to use the register we have anyways.
lds[threadID + 1] * 0.5 + // our pixel
lds[threadID + 2] * 0.25; // right neighbour
blurredPixelBuffer[leftmostPixelIndex + threadID] = blurredResult;
}
So i still use a 64 wide workgroup but two threads need to do an extra read to get 66 values.
Another option would be to read only 64 pixels and output just 62 at the cost of using a little more workgroups and having 2 idle threads.
Not sure what’s better and it should not matter much, but we could assume using 256 wide workgroups would be better than using just 64, because there would be less overlap.
However, this would mean to join different hardware CUs together and accessing the LDS memory from another CU has additional cost,
so it’s again a matter of profiling to find the sweet spot.
What is really important to understand is that we use fast LDS both for inter thread communication and as a kind of user controlled cache.
We don’t know something similar from CPUs, so together with the locksteped threadgroup behaviour those are the two major things to think about.
They define why and how we implement algorithms differently on CPU / GPU.