Best practices for ping-pong shading

CadBud · October 13, 2015, 12:50am

I’m looking at applying post process effects to my scene by using ping-pong shading.
Code should be multi-platform and support a wide range of devices over OpenGL ES 2.0.

My question is - what would be the preferable strategy to implement the ping-pong effect (for both the entire screen and small renderbuffers we read pixels from):

Create 2 framebuffers with similar parameters. Attach a texture to both of them and bind a different framebuffer each time.
Create a framebuffer for each texture size we ping pong shade into, and bind-alternate between 2 textures as color attachments.

I guess both methods would enforce a validation of the buffer (when I unbind a framebuffer, or a texture).

Sources are a bit contradictory on this matter:

NVidia (p.29) recommends switching texture attachments -

OpenGL Wiki recommends on alternating between framebuffers -
https://www.opengl.org/wiki/Framebuffer_Object_Examples#1_FBO_or_more

And all sources relate to OpenGL, not OpenGL ES, so I guess the whole underlying infrastructure could be different.

Alfonse_Reinheart · October 13, 2015, 7:07am

Code should be multi-platform and support a wide range of devices over OpenGL ES 2.0.

My best advice is this:

Don’t

Most modern OpenGL ES devices hate any changes to render targets. Whether it’s attaching different textures or changing FBOs entirely is irrelevant. Tile-based deferred rasterizers lose a huge amount of performance when you have to read from an image you just wrote to.

There’s a reason why Apple’s Metal API forces the framebuffer state to be part of the rendering pass/command encoder, which is explicitly constant over the entirety of a command buffer. Changing framebuffer state is expensive on mobile devices, moreso than on desktop hardware, so they make it harder for you.

Ping-ponging will hurt. A lot. It’s better to flat-out avoid ping-ponging on mobile hardware. Generally speaking, whatever graphical effect you’re doing won’t be worth it.

CadBud · October 13, 2015, 8:07am

Thanks for the answer.
I’m a bit surprised, you would rather execute your post-processing on the CPU manually, ALWAYS?

What if, for example, a mobile app aims to apply FXAA as a post process effect to every frame?

Alfonse_Reinheart · October 13, 2015, 11:32am

I’m a bit surprised, you would rather execute your post-processing on the CPU manually, ALWAYS?

No. I would simply not have that post-processing happen at all. The performance loss just isn’t worth whatever it was that required ping-poinging

What if, for example, a mobile app aims to apply FXAA as a post process effect to every frame?

FXAA does not require ping-ponging. The term “ping-ponging” refers to swamping back and forth between two textures. You read from A and write to B, then switch to reading from B and writing to A.

FXAA requires:

1: a single rendering of the scene (color & depth)
2: reads from the scene’s depth buffer to compute edges that computes the scene’s edge buffer
3: reads from the scene’s image and edge buffers to produce the final image

That’s not ping-ponging.

Ping-ponging is typically used for read/modify/write operations, which FXAA is not. Or at least, not in the way you mean.

If you’re talking about any “write stuff to a texture, then read it and write elsewhere” operation, it really doesn’t matter whether you changed FBOs entirely or just attached a different texture. The reason being that the overhead of the tile-based resolve operation will completely overshadow any CPU overhead from the specifics of how you change the framebuffer’s state.

Mobile hardware and desktop hardware are different. Things that are fast on one are not necessarily fast on another.

CadBud · October 14, 2015, 4:08am

[QUOTE=Alfonse Reinheart;39527]No. I would simply not have that post-processing happen at all. The performance loss just isn’t worth whatever it was that required ping-poinging
[/QUOTE]

I’d prefer to find a better solution in my app logic than give up on the entire post-process effects.
E.g: A paused game scene which is grayscaled and blurred. The final scene background can be generated once as a texture and then reused.
If GPU resolves makes the FPS bleed, then post process effects don’t necessarily have to be applied at 60 fps. Thats a plausible solution on my part.

Alfonse Reinheart;39527:

FXAA does not require ping-ponging. The term “ping-ponging” refers to swamping back and forth between two textures. You read from A and write to B, then switch to reading from B and writing to A.

FXAA requires:

1: a single rendering of the scene (color & depth)
2: reads from the scene’s depth buffer to compute edges that computes the scene’s edge buffer
3: reads from the scene’s image and edge buffers to produce the final image

That’s not ping-ponging.

Ping-ponging is typically used for read/modify/write operations, which FXAA is not. Or at least, not in the way you mean.

If you’re talking about any “write stuff to a texture, then read it and write elsewhere” operation, it really doesn’t matter whether you changed FBOs entirely or just attached a different texture. The reason being that the overhead of the tile-based resolve operation will completely overshadow any CPU overhead from the specifics of how you change the framebuffer’s state.

Mobile hardware and desktop hardware are different. Things that are fast on one are not necessarily fast on another.

Thanks for a great answer!
Allow me to reciprocate: In the context of my engine I think of FXAA as a single post-process effect in a chain of effects, and this is why I included it in this discussion of ping-pong shading.
If applied as a single effect, no ping ponging is required of course (and the same notion would apply to most simple post-process effects that perform convolution).

But while we’re at that, FXAA requires switching the rendering target at least 2 times for each frame.
Do you suggest FXAA is not optimal for mobile hardware then?
(As much as MSAA gets hardware support, the extra memory requirement bothers me).

Alfonse_Reinheart · October 14, 2015, 6:23am

I see your point on chaining post-processing effects.

As to the primary thrust of your question, ultimately I don’t think it will matter to mobile hardware. FBOs are ultimately an abstraction. The primary cost burden of the abstraction itself is borne by the CPU, whereas the cost of doing the conceptual operation is on the GPU.

When NVIDIA recommends that you only change image attachments, that’s primarily because of how their hardware works. Notice that they specifically say that the new attachment should retain the same image format as the old. This is because their hardware, and therefore the CPU that implements it, seems to have some explicit dependency on this.

You can see this as well in NV_command_list, where they allow changing render targets in the middle of a command list. But they require that an entire rendering pass use the same image format, so you’re only allowed to change targets if the format doesn’t change.

Those kinds of things are likely to be hardware-specific. Maybe NVIDIA’s GPUs have ways to change where a texture comes from without having to incur a full pipeline stall and clear caches. Or whatever.

For tile-deferred GPUs, all that is irrelevant next to the gigantic pipeline stall that must be incurred when you change a render target into a texture. However you do this, whether it’s changing FBO attachments or the entire FBO itself, you’re still going to pay a serious price when you try to read from that texture. And that price will overshadow pretty much everything else.

That’s not to say that desktop-style GPUs have no pipeline stalls to pay when you initiate a readback from a previously written texture. But it’s not nearly as huge, so the CPU penalty of it is more likely to be important.

So generally speaking, the specific style of the change won’t matter. It’ll hurt a lot either way.

But ultimately, I don’t think there’s much practical, comparative experience out there on mobile GPUs. If you think it could help performance, you could always profile it.

Do you suggest FXAA is not optimal for mobile hardware then?

Optimal relative to what? Just like desktops, it’s probably faster than MSAA. But it won’t be as much faster as it was on desktop hardware, due to the larger cost of switching from writing to reading.

system · October 19, 2021, 5:52pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.