Signal semaphore on present

Simple question, is there a way for a swapchain present operation (ie. vkQueuePresentKHR()) to signal (not wait on) a semaphore?

I got things running fine using single threaded, and wanted to move to multithreaded/task based rendering. The general idea was each frame would spawn its own task (which in turn could spawn more in the future if necessary) with associated command pools/framebuffers/etc… After recording the command buffer I would prefer to submit/present in that frame’s task instead of marshalling the data back to the main thread. The only real issue is that presents may not occur in exactly the same order the tasks were spawned, so I need to synchronize them. I could do it manually (some sort of atomic counter + spinloop) but it seems to me that semaphores would be perfect here (each frame presentation waiting on the previous via a semaphore), but I know of no way to signal a semaphore from a present operation. Is there some extension that can do this? Am I missing something? is there some way to do this using another method?

After recording the command buffer I would prefer to submit/present in that frame’s task instead of marshalling the data back to the main thread.

All vkQueue* calls must externally synchronize their access to the VkQueue objects they act on. Put simply, two threads cannot call any vkQueue* function on the same VkQueue object at the same time. So you’re going to have to either marshal that data around or otherwise synchronize these operations. Marshaling the data is the preferred option, since vkQueue* operations tend to be pretty heavyweight, so quick synchronization methods like spinlocks are not appropriate for them. Marshaling can usually happen with lockless synchronization.

So what you’re trying to do won’t work, even if you could signal a semaphore on vkQueuePresentKHR.

Having a specific thread/task specifically for a frame’s worth of operations on a specific queue is perfectly fine. Once all of the tasks that build graphics commands have completed, that task can activate, grab their work, and queue it up.

This is also particularly important if you need to be able to do transfer work on GPUs that don’t have dedicated transfer queues. Your transfer CBs will need to be submitted to the same queue as your graphics work, but you don’t want to completely restructure your application around single-queue devices. So you try to make such devices look like they have dedicated transfer queues from the perspective of your application. You generate transfer operations and marshal them to the transfer submission system. It just so happens that the transfer submit operation is handled by the graphics submit processing task rather than a dedicated transfer processing task.

The plan was for each task to have its own VkQueue (from the same queue family). For the architectures that don’t support that many VkQueue’s, I would use a mutex to ensure access is synchronized. Isn’t that the point of VkQueue’s, to allow multiple threads to issue commands simultaneously?

I would use a mutex to ensure access is synchronized.

If you’re talking about an OS mutex, that is not conducive to performance. If you’re operating in small millisecond budgets, genuine mutex locking is to be avoided whenever possible.

Isn’t that the point of VkQueue’s, to allow multiple threads to issue commands simultaneously?

No. The point of queues is to have a firm delineation between the creation of commands and the submission of commands for execution (which, prior to command-buffer APIs, were not obviously distinct operations). Having multiple queues is useful, but it is primarily useful for dealing with operations that are fundamentally asynchronous. That is, where the two operations you’re submitting can overlap execution. Transferring data while rendering, or doing some compute work towards the next frame while the current frame is doing rendering. And so forth.

If the operations cannot overlap execution, then there is little point in submitting them on separate queues. So using multiple queues to submit commands that have to be executed sequentially just isn’t helpful.

Mutex’s are fast unless you have contention. On x86 they are little more than an atomic access, which (worst case) requires cache syncing/flushing. Either way they are certainly faster than marshalling data across threads.

No. The point of queues is to have a firm delineation between the creation of commands and the submission of commands for execution (which, prior to command-buffer APIs, were not obviously distinct operations). Having multiple queues is useful, but it is primarily useful for dealing with operations that are fundamentally asynchronous. That is, where the two operations you’re submitting can overlap execution. Transferring data while rendering, or doing some compute work towards the next frame while the current frame is doing rendering. And so forth.

Rendering multiple frames is overlapping/asynchronous. They use different resources (ie. framebuffers, command buffers, etc…) and have no dependencies.

If the operations cannot overlap execution, then there is little point in submitting them on separate queues. So using multiple queues to submit commands that have to be executed sequentially just isn’t helpful.

Except they can be executed asynchronously. Only the presentation needs to be synchronized. Which is no different than say a transfer or compute job. All asynchronous submissions will at some point have to join/be synchronized. When a compute job is finished you use the data (ie. synchronize it) eventually. Same with transfers, this is no different.

There’s really no point of having multiple VkQueue’s per queue family if they are not meant for asynchronous submissions. Semaphores already succinctly describe all the inter-submission dependencies required to parallelize submission, the device/drive certainly doesn’t need submissions on separate VkQueue’s to determine dependencies. If that was the point of them, then they are completely redundant.

I think there is a misunderstanding of the nature of Vulkan’s queues within a queue family here.

Allow me to pose an analogy. Take CPUs: you have cores and you have threads. Threads represent the sequence of stuff to be executed. Cores represent the actual execution resources of the CPU. Threads get executed on cores, and each core can execute one thread at a time. A single core can make progress on multiple threads, but only by switching between them: executing some of one, then some of another, etc.

The pattern of rendering you want to do only makes sense if you assume that queues within a family are like CPU cores: that each one represents specific GPU execution resources, and that not using a queue means that some resources are lying fallow.

A more accurate analogy is that queue families are like CPU cores, with queues within a family being more like threads. That is, if there is one graphics queue family, then there is one piece of hardware that can do graphics stuff. No matter how many queues/threads it provides, there is still only one actual piece of hardware that executes graphics commands.

Now this analogy is not accurate, due to how superscalar and massively parallel GPUs are. That is, a GPU could theoretically execute commands from multiple command queues at the same time. Plus, different queue families share some execution resources. Compute-only queue families will almost certainly not have shader cores dedicated solely to them; they share them with the graphics queue family.

But even a GPU that can render from multiple command queues can only do so by taking away rendering resources from someone else. That is, let’s say your GPU has 8 shader cores (each of which may have multiple invocations executing in lock-step). Sending rendering data through one queue can saturate all 8 shader cores by itself. You don’t need to send data through 8 queues to achieve that saturation. Indeed, if you did, then each command would only get 1 core processing its work.

So in that case, you can do the work of one command 8 times as fast, or you can do the work of 8 separate commands. But in both cases, all available computational resources are being consumed.

What that boils down to is this. If we have two frames of equivalent complexity, and it takes the GPU 16ms to render either of them, it will take ~32ms to render both frames, regardless of how many queues you use to execute that work.

So you have a choice. If you submit these commands on the same queue, you will get the first frame’s result in 16ms, then get the second frame’s result in 16ms. If you submit them on separate queues, it may take as long as 32ms to get the first frame’s results. But even if it’s not that bad, any progress made towards the second frame will delay the first frame. And that’s bad if you need to keep a solid 60fps.

Now, are there cases where a shader core or two might go unused in a rendering command, which could possibly be “task switched” to work on something from another queue? Maybe. But GPUs do a lot of work to ensure the maximum possible utilization of programming resources, even under APIs that don’t have multiple queues. And even when a core might get freed up, it usually doesn’t stay free for very long, so once it task switched out, it wouldn’t be available to switch back. So long as the workload is sufficiently heavy, a single queue is capable of saturating the GPU’s execution hardware.

In short, what you’re doing will not increase the GPU’s per-frame throughput. Nor the CPU’s throughput (you can achieve equivalent CPU performance by threading the building of the CBs within each frame). It does however potentially increase latency for each frame. Which is a bad thing.

Marshaling data involves setting a couple of pointers (to the marshaled data) and an atomic spinlock. It has essentially the same performance regardless of contention. By contrast, the worst-case for an OS lock is that it causes the thread trying to acquire the lock to sleep until the lock is released. Will the thread immediately wake up when the lock can be acquired? Maybe, maybe not.

So, worst-case performance will always favor lock-less marshaling. And consistency of performance is more important than maximum performance (though I argue that lock-less marshaling will likely give you that too).

They could only use “different resources” if they’re entirely different scenes. If they share buffer objects, textures, descriptors, and the like, then they’re not using different resources. Such sharing can lead to resource contention; for example, rendering often needs scratch buffer memory (for GUIs, text rendering, or whatever). Contention is particularly important if you need to stop using some resources, and create new ones (streaming content from disk). You can’t do that process until everyone using those resources has stopped using them.

Also, I’m curious as to exactly how you’re generating the data for those multiple frames. Doesn’t computing a frame’s worth of data take time? Even if all of your animation is 100% deterministic, you have to take time to do that determination, to figure out where the objects are, which are in view, etc.

And in order to generate the commands for each frame, you have to store the data for that frame’s per-object data (matrices, etc). So, do you actually store multiple frames worth of data? That seems wasteful.

GPUs, and Vulkan implementations, are not as smart as you think they are. With the exception of renderpass sub-pass structures, the queue submission command is not going to spend an inordinate amount of time reordering your commands. They are certainly not going to spend time reordering your commands with respect to commands that were submitted to that queue in a previous batch.

Renderpasses and such can do more radical rearrangement, but that is done a priori, based on the nature of the subpass. It’s not something the system has to analyze every time you submit a batch. Being able to do this stuff is why the renderpass system requires all of that up-front information, rather than being required to deduce it from a stream of random commands.

That’s not to say that command reordering doesn’t happen. But generally speaking, commands sent to a queue will be executed in the order you submit them. You need synchronization primitives to ensure that caches get invalidated and/or to ensure that pipelined command execution does not overlap when that would be a problem. But most of these cases do not involve radical alterations of submission order.

Queues within a family are for asynchronous submissions. But they’re not for asynchronous submissions where it’s really important to you that the overlapping commands execute in a specific order. And if you’re talking about frames of rendering, that is really important.

The question was simple, whether you think I should or should not queue a present on multiple frames isn’t the question. The question was, can you signal a semaphore, or use some other synchronization primitive, to ‘activate’ or ‘signal’ when a present has completed. At this point I’m going to assume your answer is ‘I don’t know’…

[QUOTE=Alfonse Reinheart;43751]I think there is a misunderstanding of the nature of Vulkan’s queues within a queue family here.

Allow me to pose an analogy. Take CPUs: you have cores and you have threads. Threads represent the sequence of stuff to be executed. Cores represent the actual execution resources of the CPU. Threads get executed on cores, and each core can execute one thread at a time. A single core can make progress on multiple threads, but only by switching between them: executing some of one, then some of another, etc.

The pattern of rendering you want to do only makes sense if you assume that queues within a family are like CPU cores: that each one represents specific GPU execution resources, and that not using a queue means that some resources are lying fallow.[/QUOTE]

In SMT/hyperthreading systems a single core actually executes from (potentially) two threads, allowing resources that would ‘lie fallow’ to be utilized. It (on average) gives around 10%-30% performance increase (highly dependent on hardware/program) but is certainly useful. 10-30% performance increase is non-trivial. Given that GPUs have far more resources that could be under-utilized the potential for ‘hyperthreading’ two or more command buffers is substantial. Whether modern cards/drivers utilize it remains to be seen, but there’s no reason we can’t do our best in the meantime. Vulkan has all the tools… well except this one it would seem.

A more accurate analogy is that queue families are like CPU cores, with queues within a family being more like threads. That is, if there is one graphics queue family, then there is one piece of hardware that can do graphics stuff. No matter how many queues/threads it provides, there is still only one actual piece of hardware that executes graphics commands.

Now this analogy is not accurate, due to how superscalar and massively parallel GPUs are. That is, a GPU could theoretically execute commands from multiple command queues at the same time. Plus, different queue families share some execution resources. Compute-only queue families will almost certainly not have shader cores dedicated solely to them; they share them with the graphics queue family.

But even a GPU that can render from multiple command queues can only do so by taking away rendering resources from someone else. That is, let’s say your GPU has 8 shader cores (each of which may have multiple invocations executing in lock-step). Sending rendering data through one queue can saturate all 8 shader cores by itself. You don’t need to send data through 8 queues to achieve that saturation. Indeed, if you did, then each command would only get 1 core processing its work.

So in that case, you can do the work of one command 8 times as fast, or you can do the work of 8 separate commands. But in both cases, all available computational resources are being consumed.

What that boils down to is this. If we have two frames of equivalent complexity, and it takes the GPU 16ms to render either of them, it will take ~32ms to render both frames, regardless of how many queues you use to execute that work.

No that’s just not true. Simply because frames have similar complexity does not mean some resources can’t be shared. It depends on the workload, the hardware, and the drivers.

So you have a choice. If you submit these commands on the same queue, you will get the first frame’s result in 16ms, then get the second frame’s result in 16ms. If you submit them on separate queues, it may take as long as 32ms to get the first frame’s results. But even if it’s not that bad, any progress made towards the second frame will delay the first frame. And that’s bad if you need to keep a solid 60fps.

Again not true. It all depends on the scheduler and I highly doubt any scheduler would delay the 1st frame to execute the 2nd frame. If the driver/hardware can’t share any resources, then they will be serialized, and the worst case scenario is that you gained nothing with submitting them asynchronously (but lost nothing either). But if the driver/hardware can, then its quite possible that you achieve greater hardware utilization, and achieve a small but significant performance boost.

Now, are there cases where a shader core or two might go unused in a rendering command, which could possibly be “task switched” to work on something from another queue? Maybe. But GPUs do a lot of work to ensure the maximum possible utilization of programming resources, even under APIs that don’t have multiple queues. And even when a core might get freed up, it usually doesn’t stay free for very long, so once it task switched out, it wouldn’t be available to switch back. So long as the workload is sufficiently heavy, a single queue is capable of saturating the GPU’s execution hardware.

Completely speculative as to the nature of the workload. Some workloads are compute heavy, others tax the memory subsystem, others the ROPs, or the texture units, or… There’s a reason why AMD suggests to mix compute with graphics, and that’s because most workloads do not saturate the hardware. If half a frame is rendering to a light buffer (ie. uses very little compute, mostly waiting on memory) and half the frame is rendering with a complex shader, it could certainly be the case where the second half of the first frame could execute simultaneously with the first half of the second. This would increase throughput and decrease latency.

In short, what you’re doing will not increase the GPU’s per-frame throughput. Nor the CPU’s throughput (you can achieve equivalent CPU performance by threading the building of the CBs within each frame). It does however potentially increase latency for each frame. Which is a bad thing.

Completely untrue. Worst case you’ve gained nothing (and there isn’t a driver out there that would intentionally increase frame latency just because you submitted asynchronously), best case you’ve gained performance.

Marshaling data involves setting a couple of pointers (to the marshaled data) and an atomic spinlock. It has essentially the same performance regardless of contention. By contrast, the worst-case for an OS lock is that it causes the thread trying to acquire the lock to sleep until the lock is released. Will the thread immediately wake up when the lock can be acquired? Maybe, maybe not.

So, worst-case performance will always favor lock-less marshaling. And consistency of performance is more important than maximum performance (though I argue that lock-less marshaling will likely give you that too).

I’ve written lock-less data structures… this is not the place to go into the details, suffice to say light-weight mutexs can spin-loop as well (ie. there’s no need to yield a thread). The real performance issue in a low contention resource is cache coherency (ie. writing as little to it as possible) not whether its a lock, lock-free, or wait-free, synchronization. On top of that, this is all tangential to the question

They could only use “different resources” if they’re entirely different scenes. If they share buffer objects, textures, descriptors, and the like, then they’re not using different resources. Such sharing can lead to resource contention; for example, rendering often needs scratch buffer memory (for GUIs, text rendering, or whatever). Contention is particularly important if you need to stop using some resources, and create new ones (streaming content from disk). You can’t do that process until everyone using those resources has stopped using them.

Also, I’m curious as to exactly how you’re generating the data for those multiple frames. Doesn’t computing a frame’s worth of data take time? Even if all of your animation is 100% deterministic, you have to take time to do that determination, to figure out where the objects are, which are in view, etc.

And in order to generate the commands for each frame, you have to store the data for that frame’s per-object data (matrices, etc). So, do you actually store multiple frames worth of data? That seems wasteful.

Do I really need to submit a 100-page proof on the pros and cons of my approach simply to get an answer? A few kBs of data on a modern system is a trivial. Reducing synchronization points, good algorithms, reducing memory allocations/frees, and access memory in predictable cache friendly patterns are what dictates performance. Duplicating a few data structures is trivial if it allows me to process frames asynchronously.

But that’s besides the point, if you want to run the simulation asynchronously as well, you’re going to need to copy the relevant data to render off it. The copy has to happen somewhere, otherwise you’re forced to serialize it all. If its trivial enough to serialize it all (rendering, simulation, input processing), then none of this matters, the performance wasn’t needed either way, just do whatever’s fun/convenient. If the performance matters, then you better multi-thread it.

Streaming content off disk does not force serialization. You simply keep the old data around until the last frame that requires it has been rendered (ie. the fence of the submit is signaled) and then you’re free to destroy it… there’s nothing funny, weird, or difficult. I don’t understand, the issue. The vast majority of shared resources are read-only, which can be trivially shared without issue.

GPUs, and Vulkan implementations, are not as smart as you think they are. With the exception of renderpass sub-pass structures, the queue submission command is not going to spend an inordinate amount of time reordering your commands. They are certainly not going to spend time reordering your commands with respect to commands that were submitted to that queue in a previous batch.

Whether they do and whether they can are two different things. AMD and NVidia do, to an extent, the rest I am unsure of.

Renderpasses and such can do more radical rearrangement, but that is done a priori, based on the nature of the subpass. It’s not something the system has to analyze every time you submit a batch. Being able to do this stuff is why the renderpass system requires all of that up-front information, rather than being required to deduce it from a stream of random commands.

That’s not to say that command reordering doesn’t happen. But generally speaking, commands sent to a queue will be executed in the order you submit them. You need synchronization primitives to ensure that caches get invalidated and/or to ensure that pipelined command execution does not overlap when that would be a problem. But most of these cases do not involve radical alterations of submission order.

You don’t need ‘radical alterations of submission order’ to extract parallelism from multiple submissions made to the same queue. A simple walk over the queue will suffice.

Queues within a family are for asynchronous submissions. But they’re not for asynchronous submissions where it’s really important to you that the overlapping commands execute in a specific order. And if you’re talking about frames of rendering, that is really important.

“Queues within a family are for asynchronous submissions. But they’re not for asynchronous submissions”

Honestly… the mental gymnastics to avoid simply saying ‘I don’t know’… why??!

You know there are these things called semaphores? That ensure that multiple submissions that overlap execute in the correct order. If you weren’t suppose to use asynchronous submissions, they wouldn’t need semaphores. The ‘importance’ of a submission really has nothing to do with this at all; because for correct execution all submissions are ‘important’.

I do know the answer. I had assumed that “even if you could signal a semaphore on vkQueuePresentKHR,” was good enough, since the answer is pretty simple. But I’ll answer the question more directly.

You cannot signal when a present has completed. Nor is there any way to make subsequent batches wait on a specific present operation. vkQueuePresentKHR doesn’t take a semaphore to signal (unlike most queue operations), nor does “presentation” count as a pipeline stage, so you can’t even wait on it via a barrier or event or something. And there are no extensions which give vkQueuePresent2KHR the ability to signal a semaphore.

Vulkan instead provides a way to signal the completion of the acquire of an image. By implication, if the image has completed acquisition, it has also completed presentation.

But of course, that is probably not particularly useful for your case. You are trying to make two present operations execute sequentially despite being submitted on multiple queues. Which means you need to know when a specific present has completed. Since vkAcquireNextImageKHR doesn’t allow you to ask for a specific image, you cannot use it as a way to test for the completion of a specific present.

Lastly, there is a practical consideration to be made with regard to semaphores that, even if Vulkan allowed present to signal a semaphore, makes them less than ideal for your desired use case.

Consider threads 1 and 2. Thread 1 generates and submits data for frame 1, and thread 2 generates and submits data for frame 2. Now, let’s say we have a hypothetical extension to vkQueuePresent2KHR that takes semaphores to signal when the present operation is complete. So thread 1 will present its data and use a semaphore to signal thread 2. And when thread 2 calls its vkQueuePresent2KHR, it will wait on the semaphore that thread 1 used.

However, there is a problem. In Vulkan, you cannot submit any semaphore wait operation until a function submits a batch of work to a queue that signals that semaphore. That is, thread 2 cannot call its vkQueuePresent2KHR that waits on the semaphore until thread 1 has called its vkQueuePresent2KHR that will signal the semaphore. Even though they happen on separate threads, if you submit a batch that waits on a semaphore:

And a semaphore signal operation is not “pending execution” until it has actually been submitted to a queue.

So there was always going to have to be some CPU cross-talk between threads, one way or the other. Given that this was always going to be the case, just present all of your images from a thread/task dedicated to that service. If you were going to have cross-talk anyway, then have cross-talk.

Indeed, this limitation is another reason why I say that “Queues within a family are for asynchronous submissions. But they’re not for asynchronous submissions where it’s really important to you that the overlapping commands execute in a specific order.” Because you literally cannot do that, since the batch that waits on a semaphore has to be issued after the batch that signals the semaphore. If you’re submitting these batches from different threads, that requires synchronization between those threads; thread 2 cannot submit its waiting batch until thread 1 has submitted its signaling batch.

What I’m saying is that if you want to do some transfer work and some rendering work, you don’t submit rendering work and transfer work at the same time (from different threads) if the rendering work needs to wait on the transfer work.

That is, if a frame of rendering needs some transfer work done before it can begin, then you submit that transfer work the frame before you’re going to submit the work that uses it. You have to do it that way, since Vulkan won’t let you submit a batch that waits on a semaphore that has not been submitted for signaling. You still need a semaphore, but the semaphore gets waited on next frame, not this frame.

How would the scheduler know which batch is the 1st frame vs. the 2nd? After all, you aren’t submitting them to the same queue (so there’s no implicit order), and you aren’t necessarily submitting them in any particular order (as that would involve CPU synchronization). There is no semaphore between the two submits (since that would require a CPU sync, as detailed above).

So how does the scheduler know which is the “first frame”? You can’t even use queue priorities to handle this, because priorities are defined at device creation time and cannot be changed at runtime. Whereas your batches become more important as time goes on: the first is more important than the second, but once the first is done, the second is more important than the third, which is more important than the fourth, etc. It’s not a queue property; it’s a property of what has been submitted before.

And even the semaphores on the vkQueuePresentKHR calls don’t help. Why? Because those calls are separate batches from the commands that generate them. vkQueueSubmit cannot present images, so whatever work vkQueueSubmit does has to be independent of anything subsequent commands will create. That is, it’s highly unlikely that a command can retroactively change the priority of a batch’s execution.

As such, which batch is more important is dependent on information that simply is not available to the scheduler. So I have a hard time believing that the scheduler will be able to make accurate decisions based on information it doesn’t have.

I am a very defensive programmer when it comes to performance, likely due to long experience with OpenGL. I use Vulkan precisely because I don’t trust the performance of code outside of my control. Vulkan is attractive to me because it allows me to be the scheduler; I am not subject to its whims. It’s not that I assume that such code is a priori malicious, but I certainly don’t assume it is beneficial either. Because if I trust the scheduler, and I’m wrong, the result can easily be worse than not trusting the scheduler at all. If I don’t trust the scheduler, I may lose possible performance, but I have complete control over how low that minimum gets.

Coupled with the fact that the scheduler just doesn’t have the information needed to actually know how to handle this, I would never wager my application’s performance on this kind of thing. Vulkan is supposed to be a low-level API; that means its performance is primarily due to things you control, not due to what you may not control. If you are not absolutely certain that the driver/GPU will make sure that the “second frame” will in no way slow down the execution of the “first frame”, and the “third frame” will in no way slow down the execution of the “second frame”, then you run too much of a risk of contention and latency.

There’s a big difference between adding async compute tasks alongside a graphics operation, and doing multiple graphics operations at the same time. Compute tasks take up a small, specific set of resources: compute cores and memory resources. Compute tasks tend to be short, and compute tasks where the workgroup size is <= the wavefront size can be executed in shader-core-sized chunks. This makes it comparatively easy to schedule them along side graphics resources.

If a shader core gets freed up early (let’s say that the last vertices you process us only 7 VS cores and you have 8 dedicated to them), you can quickly slot in a work group from some compute operation. And even if the scheduler gets it wrong and ends up delaying the graphics process (let’s say that the next batch of vertices come in before the compute operation has finished with its core, thus leaving only 7 cores for VS activity), because compute pipelines are short, the delay to the graphics process will be minimal (the compute operation will not hog that core for very long).

Graphics tasks cannot so easily be broken up and scheduled like that. Since graphics tasks are pipelined, they’re ferrying data from stage to stage. These stages are usually not meant to pause and be resumed later; they’re meant to move smoothly from location to location. So unless the system writes a series of partially-transformed vertices to some memory when it needs to pause the pipeline’s execution, you can’t really stop and start graphics tasks efficiently. They don’t just swipe an unused shader core or two.

Overall, it’s a lot harder for the scheduler to find a way to make multiple graphics tasks cohabitate without using up each others’ resources than it is for a single graphics task and some simple compute operations.

vkQueueSubmit is already a heavyweight call. I don’t want it wasting precious CPU time “walking over the queue” to figure out how to reorder operations based on some complex semaphore gymnastics. I want it to take the compiled data in those batches, shove them down the GPUs throat, and move on. Vulkan is a low-level API; vkQueueSubmit should do what it is told to do, nothing more. If I want some significant reordering to happen, I can do it myself.

Let me lay this out directly. You have two frames, 1 and 2. What you want is for frame 1 to execute, then for frame 2 to execute. But you also want any small stalls, holes, etc that happen during the execution of frame 1 to be filled in where possible by data from frame 2. But this should not happen in any way such that frame 2 will delay the execution of frame 1.

But to me, submitting two batches on two different queues is supposed to mean is that the driver/GPU should split the available resources between both batches more or less evenly. That’s what “asynchronous execution” means: to execute the two things simultaneously as much as possible. Since both queues have equal priority, tasks submitted to both queue should be given essentially equal resources.

Why do you believe that asynchronously executing two batches does not mean to asynchronously execute the two batches? That sounds very much like OpenGL users who misused buffer object hints so often that many drivers just stopped looking at them altogether. You’re telling the system that you want to execute some things in parallel, but you specifically expect the system to execute them in series, but with one of the tasks being able to get a bit of work done here and there if there’s room for it.

We should not encourage drivers to do this kind of rearrangement. Let Vulkan be a simple wrapper over the GPU. If you want series execution, even if its mostly series execution, you submit in series.

You analogized queues within families to hyperthreading. And you’re right; HT can give some overall performance improvement. But it does impact latency. HT doesn’t recognize that either thread is more important than the other. So while the performance of the application may be faster overall, the latency has gotten worse. You might go from 16ms+16ms in the single-threaded case to a total 25ms in the asynchronous execution case, but there’s no guarantee you’ll get that first frame’s data in 16ms.

And guarantees are kind of important in real-time applications.

[QUOTE=Kaisha;43752]“Queues within a family are for asynchronous submissions. But they’re not for asynchronous submissions”

Honestly… the mental gymnastics to avoid simply saying ‘I don’t know’… why??![/quote]

I didn’t stop my sentence there, so I have no idea why you are surprised that half of a complete thought might not make sense.

However, if you would like a better statement, “queues within a family are for asynchronous execution, not for asynchronous submission”. They’re for when you want two processes to execute concurrently, not just because you don’t feel like marshaling some data across the CPU.

Fair enough. Sounds like an huge oversight to me, but it is what the spec states. Why not start with that the next time instead of 18 pages of philosophy…

What I’m saying is that if you want to do some transfer work and some rendering work, you don’t submit rendering work and transfer work at the same time (from different threads) if the rendering work needs to wait on the transfer work.

That is, if a frame of rendering needs some transfer work done before it can begin, then you submit that transfer work the frame before you’re going to submit the work that uses it. You have to do it that way, since Vulkan won’t let you submit a batch that waits on a semaphore that has not been submitted for signaling. You still need a semaphore, but the semaphore gets waited on next frame, not this frame.

Yeah I get that. Seems counter intuitive that a multi-threaded synchronization primitive is hampered in such a way as to be rather useless as a synchronization primitive. But if it is what the spec states then it is what we do, irregardless of how stupid it may be…

I am a very defensive programmer when it comes to performance, likely due to long experience with OpenGL. I use Vulkan precisely because I don’t trust the performance of code outside of my control. Vulkan is attractive to me because it allows me to be the scheduler; I am not subject to its whims. It’s not that I assume that such code is a priori malicious, but I certainly don’t assume it is beneficial either. Because if I trust the scheduler, and I’m wrong, the result can easily be worse than not trusting the scheduler at all. If I don’t trust the scheduler, I may lose possible performance, but I have complete control over how low that minimum gets.

This is untrue. There is still a scheduler. If you don’t trust the driver, Vulkan alone can’t save you. OpenGL might be worse, but this whole notion of ‘close to the metal’ / ‘low level control’ is just marketing BS. Vulkan gives you MORE control, but you’re still no where near what actually goes on under the hood.

Coupled with the fact that the scheduler just doesn’t have the information needed to actually know how to handle this, I would never wager my application’s performance on this kind of thing. Vulkan is supposed to be a low-level API; that means its performance is primarily due to things you control, not due to what you may not control. If you are not absolutely certain that the driver/GPU will make sure that the “second frame” will in no way slow down the execution of the “first frame”, and the “third frame” will in no way slow down the execution of the “second frame”, then you run too much of a risk of contention and latency.

Unless you have examples of this actually happening, I find this VERY difficult to believe. Both NVidia and AMD have been pushing async rendering heavily. I highly doubt that they would mess things up that bad.

There’s a big difference between adding async compute tasks alongside a graphics operation

Overall, it’s a lot harder for the scheduler to find a way to make multiple graphics tasks cohabitate without using up each others’ resources than it is for a single graphics task and some simple compute operations.

Low hanging fruit is often picked first, it would make perfect sense that they would use the easier methods first before working on harder ones. But hard != impossible or even improbable. And you’re grossly misrepresenting what goes on and how it works.

vkQueueSubmit is already a heavyweight call. I don’t want it wasting precious CPU time “walking over the queue” to figure out how to reorder operations based on some complex semaphore gymnastics. I want it to take the compiled data in those batches, shove them down the GPUs throat, and move on. Vulkan is a low-level API; vkQueueSubmit should do what it is told to do, nothing more. If I want some significant reordering to happen, I can do it myself.

This is wrong on every level. 1st its not complex, its a simple linear procedure, no ‘complex semaphore gymnastics’ involved. 2nd they already do it. 3rd it isn’t a ‘low level API’, you’re not controlling the hardware directly, its simply a more explicit API (ie. drive does less guessing). There is still a ton that happens between the commands you submit and the actual hardware executing things. If you’re that paranoid that drivers will mess things up then you need to be writing your own drivers, because that’s the only way you can get the guarantee’s you desire. ‘Defensive programming’ has bought you nothing.

Let me lay this out directly. You have two frames, 1 and 2. What you want is for frame 1 to execute, then for frame 2 to execute. But you also want any small stalls, holes, etc that happen during the execution of frame 1 to be filled in where possible by data from frame 2. But this should not happen in any way such that frame 2 will delay the execution of frame 1.

But to me, submitting two batches on two different queues is supposed to mean is that the driver/GPU should split the available resources between both batches more or less evenly. That’s what “asynchronous execution” means: to execute the two things simultaneously as much as possible. Since both queues have equal priority, tasks submitted to both queue should be given essentially equal resources.

Doesn’t matter what you ‘believe’ or ‘want’ its a simple mathematical algorithm. Semaphores define an acyclic directed dependency graph. There’s no guessing needed, its precisely described, EXCEPT for a lack of signal semaphores on present. IF we could signal a semaphore off present, the driver would have all the information to render as fast as possible, with as low latency as possible, no guessing needed.

Why do you believe that asynchronously executing two batches does not mean to asynchronously execute the two batches? That sounds very much like OpenGL users who misused buffer object hints so often that many drivers just stopped looking at them altogether. You’re telling the system that you want to execute some things in parallel, but you specifically expect the system to execute them in series, but with one of the tasks being able to get a bit of work done here and there if there’s room for it.

We should not encourage drivers to do this kind of rearrangement. Let Vulkan be a simple wrapper over the GPU. If you want series execution, even if its mostly series execution, you submit in series.

If that was what they wanted, there’s no point of representing rendering using a DAG. Its entirely superfluous and we could remove semaphores entirely (along with queue’s). It looks like that’s what they had in mind, but along the way got lost in the details and lost sight of the bigger picture.

You analogized queues within families to hyperthreading. And you’re right; HT can give some overall performance improvement. But it does impact latency. HT doesn’t recognize that either thread is more important than the other. So while the performance of the application may be faster overall, the latency has gotten worse. You might go from 16ms+16ms in the single-threaded case to a total 25ms in the asynchronous execution case, but there’s no guarantee you’ll get that first frame’s data in 16ms.

And guarantees are kind of important in real-time applications.

You have no guarantees in Vulkan either. Command buffers within the same queue can, and will be, interleaved during execution. If you can’t trust the driver to make decent decisions, then Vulkan isn’t for you, because you can’t get the level of ‘guarantees’ you are claiming from it.

I didn’t stop my sentence there, so I have no idea why you are surprised that half of a complete thought might not make sense.

However, if you would like a better statement, “queues within a family are for asynchronous execution, not for asynchronous submission”. They’re for when you want two processes to execute concurrently, not just because you don’t feel like marshaling some data across the CPU.

I don’t agree with you in how it ‘should’ work, if what you are describing was how Vulkan was supposed to work, you could remove about 1/3 the spec without loss of functionality, somewhere along the line they lost their way… but that’s not important, if the spec says it, then so be it.

A post was split to a new topic: Re: Signal on semaphore present