Low latency

Hello,

Usecase: I would like to perform headless offscreen rendering, where the path camera mvp --> rendering --> readback is optimized for lowest latency.

I also need to upload lots of textures to the gpu but this can happen​ asynchronously.

What would be a recommendation in terms of memory types, queues, synchronization etc.?

Target platform is a desktop with i.e. gtx 1080.

Stupid Q: What’s the point of low latency, if it is offscreen?

It is then streamed over network, and eventually displayed to the user. Like google maps, but different :slight_smile:
You might say that network latency dominates the latency which may arise from suboptimal Vulkan practices, but still, even for “pedagogical” reasons I would like to understand this.

Ah, so it is practically not offscreen in the end and needs real-time user input.

You might say that network latency dominates the latency

I wouldn’t dare. I have < 10 ms to some sites on internet. While 60 Hz is ~17 ms. So rendering and internet latency can be quite comparable nowadays. In fact, if all of this work is to be done on the server for multiple users, I would even be afraid the rendering HW shared among several users may be the bottleneck.

What would be a recommendation in terms of memory types, queues, synchronization etc.?

Device-local memory (if applicable).
AMD has this small amout of weird device-local host-coherent memory. Probably worth investigation for some uses.

Rendering=GRAPHICS queue. It may be benefical to use separate COMPUTE queue if it has some clearly defined compute subproblem. Separate TRANSFER queues should be beneficial to appropriate operations.

Well, you always have to synchronize. The trick is to have the GPU (and well, even the CPU) always fed. I.e. having alternative work, when it waits on some Semaphore or Fence.

I’ll try to explain a bit more how I (barely) understand this should work – it would be grate if you could provide some more input:

  1. I’ve heard there are those copy engines for async data transfer to/from gpu. I did not understand if everything goes through them or not. Anyway, I thought it would be best to upload textures (for map data) async via dedicated transfer queue.

  2. I would upload camera mvp via staging buffer to device local buffer using the gfx queue.
    They say push constants are faster then uniforms, so maybe I should use them for just the mvp?

  3. There is no swapchain, so I would create a device local image for the framebuffer color attachment, then download its contents via host visible non-coherent memory.

  4. Without the swapchain, there is no need to acquire images from it. However, I somehow need to wait till the framebuffer readback is complete before overwriting it. Not sure what else?

Thank you very much

Sorry, for 2. either push constants or vkCmdUpdateBuffer

“somehow”? Vulkan really only has one way for the CPU to wait on something to be finished. Well yeah, you can WaitIdle on the queue or the device directly, but those are obviously the wrong answers. Which leaves the only other tool Vulkan has: fences.

Maybe a semaphore? Anyway, I’m trying to understand “the big picture here” how ever ignorant that may sound…

(Maybe Events? But anyway, they are new. And probably designed for the opposite direction of dependency. And they don’t have wait, only poll.)

Anyway, Semaphores are only for inter-queue synchronization. Won’t work for readback (GPU->CPU). Yeah, you need Fences.

ad 1) good plan.

ad 2) obvious choices.

ad 3) that sounds reasonable

ad 4) you will be basically creating a swap chain of sorts. You do not have to use the same image. While doing the read-back, you can already render next frame to another image.

Ok, thanks, I guess it sums up it pretty well.

Regarding the “a swap chain”, if I understand it correctly, double buffering would not improve latency of a single frame here, only throughput, right?

If we are talking one pipe and same workload, then latency and bandwith is really the same thing. I.e. the faster result is being computed, the faster you get the result.

For single buffered approach the worst case would be when the input is received after the rendering already started, so the latency would be:
frame rendering of previous frame + readback/waiting on frame buffer to get available + rendering of this frame + readback

For double buffered the worst case would be:
frame rendering of previous frame + rendering of this frame + readback

There’s a chance that the latency would be hidden for single buffered (i.e. you don’t need the frame buffer ready until the writing phase). But the double-buffered approach is more predictable about it.