Ok, i mentioned earlier, but i’ll give a more detailed example of why i want this functionality in Vulkan.
I’ll replace my tree approach with a easier example to imagine: Think we have a nested grid of 16 levels with volumetric light information.
As the camera moves forward the grid follows and new cells become visible, but we don’t have the compute power to calculate the data for all of them immideately, instead we interpolate their values from their parent cells which are already in view since the last frame.
So we have a compute shader that writes a list of new cells and the count to a indirect dispatch buffer for each grid level.
Then for each level we do the interpolation from parent with indirect dispatch, followed by a memory barrier to guarantee that we have valid data for the next lower level.
I have lots of cases like this. They care for camera movement, interpolation to hide popping from switching levels, building mip map pyramids and so forth.
Most of them are guaranteed to produce work only for 2-4 of 16 levels per frame, but i can’t tell which levels. So for all of those tasks i need to record all indirect dispatches and the memory barriers for all 16 levels.
At the moment the only option to avoid those zero dispatches would be to download stats, regenerate the command buffer and upload it to GPU every frame.
So if i could skip over prerecorded commands from GPU it would be a big win.
To test this, i change the max level count for my algorithm. My current scene requires at least 13 levels - runtime is 2.0 ms. Changing the level count to 20 i get 2.1 ms. So +5% even the workload is the same (the added levels have no data, the slowdown comes from zero dispatches and pointless barriers).
Notice that with this test most zero dispatches are still there, so i assume i could win at least 10% with the ability to remove them all.
I know other developers working on completely different algorithms but with the exact same problem, it’s a common issue and modern APIs are not modern enough here. I don’t know of anything that needs more improvement than this.
I’m not sure if DX12 can really fix it with ExecuteIndirect. Maybe we do not only need to set a command count, but also a start index.
E.g. i compile all my level dependent shaders 16 times changing the level by altering a define.
If i can set only the count, i may need to start allways either from level zero or from level 15, so this would fix only half the problem.
To fix this i’d need to write the level as well from the work planning shader and when doing the work read it each time. This costs additional time, a scalar register and worst the possibility to optimize level 0 for the fact it has no children.
Probably this is very different from what typical graphics devs love about ExecuteIndirect (i’m no praphics pipeline expert), but it’s worth to point out.
The more we leave the ‘brute force everything’ paradigm towards more work efficient algorithms, the more obvious the problem becomes. And we need to do so to solve open problems like proper realtime GI.