Directx12 ExecuteIndirect equivalent

Osurac · March 24, 2017, 11:34am

Hi,

I have found that nVidia introduced the VK_NVX_device_generated_commands extension which seems equivalent to or better than the Directx12 ExecuteIndirect counterpart. Is there a timeline on when this will become part of the Vulkan standard API and not a per vendor extension? Is there anything currently in the standard Vulkan API that comes close to DirectX12’s ExecuteIndirect feature?

Salabar · March 24, 2017, 11:45am

cmdDispatchIndirect does exactly what ExecuteIndirect does. VK_NVX_device_generated_commands extension was frowned upon by one of AMD’s architects about a year ago, but there is nothing that stops Khronos from adding it if NVIDIA will convince the rest of the commitee it’s useful. I have seen slides mentioning they consider adding Device Side Enqueue feature from OpenCL 2.0.

JoeJ · March 24, 2017, 1:13pm

cmdDispatchIndirect is not on par with ExecuteIndirect.
With ExecuteIndirect we can create a variable number of commands from GPU, thus we can avoid e.g. things like zero work dispatches followed by useless memory barriers. (correct me if i’m wrong, i have not used DX12 yet.)
With cmdDispatchIndirect we need to bake anything that can eventually happen into the command buffer, even if it’s pointless at runtime.

So ExecuteIndirect is the reason why DX12 is ahead. I second the request on something similar. I’d get at about 10% better performance if i could cull zero work dispatches.

The problem goes hand in hand with the current bad solution to utilize async compute: We need to split command buffers and use multiple queues with semaphres to synchronize.
The additional overhead makes it impossible for my fine grained usecase to make async compute a win.
Although experiments show it could be a huge win and two dispatches with small workloads CAN execute in parallel in half the time, i can’t get there in practice.

I wish both problems can be solved with the same mechanism in a future Vulkan version.
It’s the final thing missing, only then we can unleash all GPU power.

Salabar · March 24, 2017, 1:48pm

we can avoid e.g. things like zero work dispatches followed by useless memory barriers

I can’t find anything about that in the docs: ID3D12GraphicsCommandList::ExecuteIndirect (d3d12.h) - Win32 apps | Microsoft Learn
Also, I wasn’t aware that DX12 uses only one function instead of cmdDrawIndirect + cmdDispatchIndirect

With cmdDispatchIndirect we need to bake anything that can eventually happen into the command buffer, even if it’s pointless at runtime.

You need to optimize for the worst case scenario when doing realtime graphics. Nothing happened + barrier cannot possibly be slower than everything happened + barrier.

The problem goes hand in hand with the current bad solution to utilize async compute

Mantle, DX12 and Vulkan are all exactly the same in this regard: in order to use async, you need to utilize multiple queues, and in order to utilize multiple queues you have to use heavy-weight semaphors. I’m not sure where are you coming from.

Salabar · March 24, 2017, 2:08pm

After reading a little more into it
ExecuteIndirect also binds descriptor sets, but it only affects CPU overhead which is already low in these APIs.

JoeJ · March 24, 2017, 2:24pm

If i get it right, the arguments for ExecuteIndirect come from a ID3D12Resource, which can be a GPU buffer?
So a compute shader can write the commands this way.
Anyone with beter DX12 experience may clarify…

quote: You need to optimize for the worst case scenario when doing realtime graphics. Nothing happened + barrier cannot possibly be slower than everything happened + barrier.

Good point but it does not apply.
E.g. i have a tree of samples with 16 levels. Some interpolation needs to be done one some samples, mostly they are in level 3 and four, but some of them may fall to a higher or lover level.
But i have to dispatch a shader for each level because i can’t be sure. It’s guaranteed I have dozens of zero dispatches per frame even in the worst case.
Maybe a bad example but trust me your argument is pointless for MANY usecases.

quote: Mantle, DX12 and Vulkan are all exactly the same in this regard

Not sure about Mantle (AFAIK it has device side enqueue but i don’t know if it operates async), but if you’re right they are all equally inefficient. (It’s not my point to favorize or critisize existing APIs, i do a feature request for the future)
I assume GCN can do synchronization on its own, efficent and fine grained. Other Vendors will follow.
We need access to this to parallelize small workloads and if possible to do small workloads while memory barriers are executed.

Salabar · March 24, 2017, 2:37pm

ID3D12Resource is a host side structure, so nope, the whole array has to be filled before-hand.

Alfonse_Reinheart · March 24, 2017, 3:14pm

If i get it right, the arguments for ExecuteIndirect come from a ID3D12Resource, which can be a GPU buffer?
So a compute shader can write the commands this way.

I’m not really sure what you’re getting at here. The data for the commands being indirectly executed are stored in GPU memory (that is after all what indirect execution means). But the parameters to ExecuteIndirect are obviously just like any other function’s parameters.

There are only two differences between ExecuteIndirect and vkCmd*Indirect:

ExecuteIndirect is able to get the number of commands to execute from GPU memory, while vkCmd*Indirect must take it from a parameter.
ExecuteIndirect can execute things other than draw/dispatch operations. Specifically, it can change resource most kinds of buffer bindings. Though any state changes it makes are nullified at the end of the command.

#1 is available via an extension: AMD_draw_indirect_count. FYI: in OpenGL, there is an ARB extension providing the same thing. The GL extension is surprisingly widely implemented, so Vulkan-capable desktop hardware is able to do this.

#2 is available (in a substantially more flexible way) in the aforementioned experimental NVIDIA extension.

I would not expect the NVX functionality to become core Vulkan behavior anytime soon. It’s just too big.

It should also be noted that it is not exactly clear whether ExecuteIndirect is truly natively reading the buffer changing commands or if it is internally performing what the NVX extension does: running a GPU operation that reads those commands and turn them into the actual platform-specific GPU data. By contrast, if you’re not doing any state changes (which ExecuteIndirect can detect by the fact that your CommandSignature has no buffer targets), then indirect rendering will almost certainly not use this two-phase approach.

JoeJ · March 24, 2017, 6:39pm

quote: 1) ExecuteIndirect is able to get the number of commands to execute from GPU memory

But this can be used to avoid zero dispatches:
Record N commands, but at runtime decide on GPU to execute only 3 of them (using a compute shader to set the command count).
Wouldn’t this work? Sorry if i still get it wrong.

I do not assume this is on par with Nvidias extension and i agree that’s too complex, but the ability to skip over prerecorded commands would be awesome.

Sadly the AMD_draw_indirect_count works only for draws not for compute shader dispatches.

How do you guys a proper quote on this forum? And how do you edit a post?

Alfonse_Reinheart · March 24, 2017, 7:06pm

But this can be used to avoid zero dispatches:
Record N commands, but at runtime decide on GPU to execute only 3 of them (using a compute shader to set the command count).

I didn’t say it couldn’t.

Personally though, I just don’t think it’s that big of a deal. Especially since it requires a degree of synchronization between the code writing the command count and the indirect call.

How do you guys a proper quote on this forum? And how do you edit a post?

There is a button called “reply with quote” right under each post (not by you). Failing that, you can manually add [ quote ] tags.

15 minutes after making a post, you’re no longer allowed to further edit it.

JoeJ · March 25, 2017, 12:17am

[QUOTE=Alfonse Reinheart;42045]I didn’t say it couldn’t.
Personally though, I just don’t think it’s that big of a deal. Especially since it requires a degree of synchronization between the code writing the command count and the indirect call.
[/QUOTE]

Ok, got it. Thanks.
So if this sync has to happen between CPU and GPU i would agree there is no urgent need for this.
But if some Vendors are able to keep it entirely on GPU it’s top priority to add.

Alfonse_Reinheart · March 25, 2017, 7:28am

[QUOTE=JoeJ___;42046]Ok, got it. Thanks.
So if this sync has to happen between CPU and GPU i would agree there is no urgent need for this.[/quote]

OK, I said that wrong. When I said “indirect call”, I meant “the GPU’s execution of the indirect command”, not the actual API function call.

The CPU doesn’t have to get involved.

… why?

JoeJ · March 25, 2017, 10:05am

Ok, i mentioned earlier, but i’ll give a more detailed example of why i want this functionality in Vulkan.

I’ll replace my tree approach with a easier example to imagine: Think we have a nested grid of 16 levels with volumetric light information.
As the camera moves forward the grid follows and new cells become visible, but we don’t have the compute power to calculate the data for all of them immideately, instead we interpolate their values from their parent cells which are already in view since the last frame.

So we have a compute shader that writes a list of new cells and the count to a indirect dispatch buffer for each grid level.
Then for each level we do the interpolation from parent with indirect dispatch, followed by a memory barrier to guarantee that we have valid data for the next lower level.

I have lots of cases like this. They care for camera movement, interpolation to hide popping from switching levels, building mip map pyramids and so forth.

Most of them are guaranteed to produce work only for 2-4 of 16 levels per frame, but i can’t tell which levels. So for all of those tasks i need to record all indirect dispatches and the memory barriers for all 16 levels.
At the moment the only option to avoid those zero dispatches would be to download stats, regenerate the command buffer and upload it to GPU every frame.
So if i could skip over prerecorded commands from GPU it would be a big win.

To test this, i change the max level count for my algorithm. My current scene requires at least 13 levels - runtime is 2.0 ms. Changing the level count to 20 i get 2.1 ms. So +5% even the workload is the same (the added levels have no data, the slowdown comes from zero dispatches and pointless barriers).
Notice that with this test most zero dispatches are still there, so i assume i could win at least 10% with the ability to remove them all.

I know other developers working on completely different algorithms but with the exact same problem, it’s a common issue and modern APIs are not modern enough here. I don’t know of anything that needs more improvement than this.

I’m not sure if DX12 can really fix it with ExecuteIndirect. Maybe we do not only need to set a command count, but also a start index.
E.g. i compile all my level dependent shaders 16 times changing the level by altering a define.
If i can set only the count, i may need to start allways either from level zero or from level 15, so this would fix only half the problem.
To fix this i’d need to write the level as well from the work planning shader and when doing the work read it each time. This costs additional time, a scalar register and worst the possibility to optimize level 0 for the fact it has no children.

Probably this is very different from what typical graphics devs love about ExecuteIndirect (i’m no praphics pipeline expert), but it’s worth to point out.
The more we leave the ‘brute force everything’ paradigm towards more work efficient algorithms, the more obvious the problem becomes. And we need to do so to solve open problems like proper realtime GI.

Alfonse_Reinheart · March 25, 2017, 11:21am

To test this, i change the max level count for my algorithm. My current scene requires at least 13 levels - runtime is 2.0 ms. Changing the level count to 20 i get 2.1 ms. So +5% even the workload is the same (the added levels have no data, the slowdown comes from zero dispatches and pointless barriers).
Notice that with this test most zero dispatches are still there, so i assume i could win at least 10% with the ability to remove them all.

I don’t think your conclusion follows from the data. By adding 7 levels, you’re adding them to the bottom part of the hierarchy, correct? So you’ve effectively increased the number of nodes (which as I understand it, correlate to the number of dispatches) exponentially. And yet, you only increase the cost of the operation by 5%, despite exponentially increasing the amount of empty work. That sounds like evidence that zero dispatches are not terribly expensive.

Maybe I have misunderstood what you’re trying to do and what a “level” consists of. But given my understanding, I can’t see this as evidence that you could get a 10% performance gain by eliminating other empty work.

In any case, the pipeline barriers can’t actually go away; not even VK_NVX_device_generated_commands allows you to add barriers. At that point, what you’re asking for is nothing less than complete GPU control of itself, the ability for the GPU to build command buffers on its own.

And that’s just not feasible at present.

Ultimately, what you have is a hierarchical chain of CS dispatches. I think I would rather see a more complete solution to that particular problem than to require such GPU-to-GPU command building.

JoeJ · March 25, 2017, 12:13pm

No, i don’t add any nodes, i only add empty levels above the root (in case the scene complexity would grow for some reason, i need to reserve some empty levels to ensure the system can still build a tree).
As said, for my test the additional levels add nothing more than zero work dispatches and barriers causing the additional runtime.
(But i need to mention that actually my barriers cover more memory than necessary. I set them for whole buffers although only parts of it change. I hope i can limit the loss by adjusting this properly.)

[QUOTE=Alfonse Reinheart;42050]
Ultimately, what you have is a hierarchical chain of CS dispatches. I think I would rather see a more complete solution to that particular problem than to require such GPU-to-GPU command building.[/QUOTE]

Not sure what you mean with complete solution. The hierarchical dependencies are unavoidable because they are key to performance.
Changing the algorithm to avoid them would be many times slower than just accepting the zero work dispatches.

Yes, that’s what i want.
If no hardware can’t do it then of course there is no point to discuss this further.
So GPUs need programmable command generation and i would have to bring this up with the hardware vendors.

However - knowing it just isn’t possible i feel a lot better. Thanks for clearing this up!

Alfonse_Reinheart · March 25, 2017, 2:53pm

The problem is that a compute shader cannot issue additional work orders to do more stuff. Indirect dispatches are an incomplete solution to that problem, since it is ultimately reliant on the CPU to actually start those work items.

A complete solution to that problem would be to simply allow a computer shader invocation to provoke the execution of another dispatch operation, probably with some way of specifying ranges of CS buffer/image memory which are made available to the new operation. Even if it is limited to executing the same shader, it would give you a lot of what you need.

[QUOTE=JoeJ___;42051]Yes, that’s what i want.
If no hardware can’t do it then of course there is no point to discuss this further.[/quote]

I don’t know that no hardware can do it. But I do know that no APIs (extensions or otherwise) exposes anything remotely like that.

Good luck with that

Salabar · March 25, 2017, 5:36pm

I wonder if something I’d call “conditional barrier” is possible. It’s an operation that inserts an execution dependency but allows hardware to loosen memory dependencies based on a value in a buffer. If we only write into that buffer using atomic operations (we could even add a buffer layout that forces all associated memory operations to bypass cache), we don’t have to use an actual barrier. This is how it could work for the OPs use-case:

We run a shader:


SomethingCool();
If (SomethingHappenedThatRequiresBarrier){//Needs to be performed by a single invocation in a workgroup
  If (SignalFlag == 0)//No need to perform an atomic more than once
     CompareAndSwap(SignalFlag == 0, 1)
}

cmdConditionalBarrier(SignalFlag == 1, /Everything that normal barriers need/);

The problem with the proposal is that it requires atomics to work the way I think they work, which is, admitedly, is a stretch

JoeJ · March 25, 2017, 11:01pm

[QUOTE=Alfonse Reinheart;42052]The problem is that a compute shader cannot issue additional work orders to do more stuff. Indirect dispatches are an incomplete solution to that problem, since it is ultimately reliant on the CPU to actually start those work items.

A complete solution to that problem would be to simply allow a computer shader invocation to provoke the execution of another dispatch operation, probably with some way of specifying ranges of CS buffer/image memory which are made available to the new operation. Even if it is limited to executing the same shader, it would give you a lot of what you need.[/QUOTE]

Yes that would do it. The Limitation to use the same shader would be no problem.
I would be able to set memory ranges from an earlier work planning shader before any hierarichical dispatches start their work.

A requirement to set memory ranges from CPU would require either a slow GPU<->CPU interaction (but may be still a win),
or the need to allocate enough worst case memory for each tree level (too much in my case).

JoeJ · March 25, 2017, 11:48pm

[QUOTE=Salabar;42053]I wonder if something I’d call “conditional barrier” is possible. It’s an operation that inserts an execution dependency but allows hardware to loosen memory dependencies based on a value in a buffer. If we only write into that buffer using atomic operations (we could even add a buffer layout that forces all associated memory operations to bypass cache), we don’t have to use an actual barrier. This is how it could work for the OPs use-case:

We run a shader:


SomethingCool();
If (SomethingHappenedThatRequiresBarrier){//Needs to be performed by a single invocation in a workgroup
  If (SignalFlag == 0)//No need to perform an atomic more than once
     CompareAndSwap(SignalFlag == 0, 1)
}

cmdConditionalBarrier(SignalFlag == 1, /Everything that normal barriers need/);

The problem with the proposal is that it requires atomics to work the way I think they work, which is, admitedly, is a stretch :D[/QUOTE]

Maybe another option.
Reminds me on what i’m doing for multithreading on CPU, and i was considering this for GPU as well.
Here we would use only one dispatch per task instead one dispatch per level, and a queue where the nodes are sorted by level:



shared uint nodeIndex;
shared uint level;

if (threadID == 0)
{
uint queueItem = globalWorkQueue[atomicAdd(globalQueueIndex, 1)];

level = (queueItem >> 24) & 0xff;
nodeIndex = queueItem & 0x00ffffff;

while (globalProcessedNodesPerLevelCount[level-1] < globalExpectedNodesPerLevelCount[level-1])
{
// ... busy waiting on GPU ?!?
}

}

// process node...

memoryBarrier (); barrier(); // ensure all writes to global mem are visible to following nodes with a different level

if (threadID == 0) atomicAdd(globalProcessedNodesPerLevelCount[level], 1);

The problem is i don’t think it’s guaranteed a processing wavefront can finish if lots of waiting wavefronts are in flight.
Also this approach requires global memory barrier within each shader invocation.
I guess even if it works it’s slower, but i’ll give it a try sometime.

Bypassing cache may be interesting in general.
My project is large, but i never read data from buffers that get modified from within a dispatch and i never use barriers on global memory in a shader. (Only exception is workplanning shader that does atomics to global memory.)
If disabling the cache for writes would not slow things down and reduces some cache trashing, that would be a nice option to have (No matter if we still need API side memory barriers or not).
AFAIK AMD considers this for OpenCL.

Osurac · March 27, 2017, 8:48am

My understanding of vkCmdDispatchIndirect is that it reads the three parameters used to determine the number of compute shader threads to execute from a buffer that resides on the GPU. This seems to be the same functionality as DispatchIndirect from DirectX11. From what I can tell vkCmdDispatchIndirect is the directx11 functionality and not DX12’s ExecuteIndirect.

I’m looking for the equivalent of DX12’s ExecuteIndirect in the standard vulkan API. The ExecuteIndirect method essentially is capable of executing a list of commands such as changing vertex buffers, index buffers and draw calls all as a command buffer that can be executed with a single call from the CPU and conceptually there is a for loop happening on the GPU that interprets those commands:

// Read draw count out of count buffer
UINT CommandCount = pCountBuffer->ReadUINT32(CountBufferOffset);

CommandCount = min(CommandCount, MaxCommandCount)

// Get pointer to first Commanding argument
BYTE* Arguments = pArgumentBuffer->GetBase() + ArgumentBufferOffset;

for(UINT CommandIndex = 0; CommandIndex < CommandCount; CommandIndex++)
{
  // Interpret the data contained in *Arguments
  // according to the command signature
  pCommandSignature->Interpret(Arguments);

  Arguments += pCommandSignature ->GetByteStride();
}

Does that kind of functionality exist in Vulkan without needing to rely on nVidia’s extension?