Transient SSBO Question for Full Screen Resolve Pass

Hi everyone,

I’m benchmarking different Order Independent Transparency shaders. I came across an interesting question.

  1. OIT does both load and store ops to a fragment buffer (“a-buffer”) to collect fragments

  2. A final resolve pass uses a fragment shader that applies the sorted transparent fragments in order (back to front) and writes to the framebuffer

Step 1 requires load/store, so the a-buffer is an SSBO.

Step 2 is a full screen fragment shader but only reads from the a-buffer and writes to the framebuffer.

Question: Is there a way to use a transient a-buffer SSBO? I’m looking to reduce the a-buffer memory size.

Step 2 does not read from neighbor pixels. There is no need to store the entire a-buffer covering the full screen if the hardware uses tiled rendering.

If the hardware does not use tiled rendering I plan to treat it as if the full screen was a single tile.

Some numbers might illustrate the question:

I’ve developed some different formats for the a-buffer. Sizes range from 64 - 148 bytes per pixel.

Some mobile devices have 4K screens (3840 * 2160) and 4GB RAM shared between CPU and GPU. Even using 64 bytes * 3840 * 2160 = 506 MiB is too much for a transient buffer like the a-buffer.

I haven’t yet resorted to 4 render passes (subdivide the screen into quadrants using a VkViewport), but that is one way to reduce the allocated size of the a-buffer.

Is there a way to use a transient a-buffer SSBO?

If you mean “transient” like the way TBRs can use lazily-allocated memory and transient attachments, no.

Step 2 does not read from neighbor pixels.

Only because of what your stored data just so happens to be. Transient images and lazy memory work because Vulkan’s render pass architecture limits what you can actually do with such images and memory. There’s no way to communicate to Vulkan that the particular data structure you’re using in some instance just so happens to be used in such a way that each tile’s data is independent.

Especially since the independent data for each tile is not contiguous.

Also, it wouldn’t really help. Lazily allocated memory and transient images work because the amount of data per tile is fixed. While neighboring pixels may not read from one anothers’ data, the amount of data per-pixel and therefore per-tile depends on exactly what you rendered.

I’ve developed some different formats for the a-buffer. Sizes range from 64 - 148 bytes per pixel.

64 bytes per pixel? That’s a lot. Even four full IEEE-754 floating-point colors would only take up 16 bytes. What all are you storing in each pixel?

Thanks, Alfonse. Much appreciated.

64 bytes seems like a lot for a single fragment, but an a-buffer collects fragments in a sorted list and defers resolving the transparency until the final pass.

There are some good research publications on OIT a-buffering around how to use a small list, e.g. 8 * 32-bit values (RGBA), looking at what to do when the list fills up.

Re: fixed or data-dependent amounts –

The idea is to allocate 8 (or more) entries up front using a const list dimension. All fragments “output” the same amount of data, though within the data, you encode how many fragments were actually added to the list.

Doing anything more fancy than that amounts to some kind of “list compression” scheme. (Use neighboring pixels and try to combine the data? Reduce precision as the list gets deeper?) Nobody has published about that yet, if you want to get the jump. :slight_smile: