Depth Testing

j111 · February 1, 2001, 9:23pm

In the official OpenGL fragment pipeline, depth testing is performed after texturing and many other per-pixel operations.

I was just wondering if inside their chipsets some companies cheat a little bit and do the depth test before texturing. It seems to me that with the upcoming pixel shaders,which can be very complex, a program might be doing a lot of work that never shows up on the screen. Depth testing early would save some work, and I think that it should give the same results as depth testing after, since the fragment is culled or passed only based on it’s depth, and not on any texture or fragment ops.

This would save a whole lot on memory bandwidth as well, which would really help cards like the GeForce series.

Anybody have an idea whether or not some chipsets do this?

j

DFrey · February 2, 2001, 3:22am

Depth testing is per pixel, pixels that come from a fragment. You have to have the fragment before you can test the pixels of the fragment.

Think about it, if depth testing came earlier, you would be forcing the hardware into a tight spot. It would have to cache intermediate data for each pixel of the fragment, perform the test, potentially splitting the fragment into multiple fragments, and then it would have to rerasterize all the resulting fragments again to apply the texture. That would seem to require a bit more silicon and time.

[This message has been edited by DFrey (edited 02-02-2001).]

j111 · February 2, 2001, 5:44am

Depth testing is per pixel, pixels that come from a fragment. You have to have the fragment before you can test the pixels of the fragment.

But the only part of the pixel that the depth test needs is the depth, right?

Would it be that hard to calculate the depth of a pixel first, from the information we have at the beginning of the fragment operations, and then to continue with the fragment operations after the test?

Think about it, if depth testing came earlier, you would be forcing the hardware into a tight spot. It would have to cache intermediate data for each pixel of the fragment, perform the test, potentially splitting the fragment into multiple fragments, and then it would have to rerasterize all the resulting fragments again to apply the texture. That would seem to require a bit more silicon and time.

I don’t mean to be impertinent or anything, but why would a processor have to cache intermediate data for the fragement? I can’t see what intermediate data would need to be saved if the depth test was first. The processor would calculate the depth and check it, which would happen anyway, just later on in the pipeline. If it passes, continue with the fragment operations, if not, cut it off there.

And isn’t a fragment simply a pixel that hasn’t quite made it on to the screen? If each fragment consisted of multiple pixels, then display resolution wouldn’t matter very much, I think.

Anyway, as you can see, I’m not understanding your explanation. Is there another way you can explain it?

j

DFrey · February 2, 2001, 8:44am

Ok, maybe a little difference in nomenclature. When I’m referring to fragment, I’m reffering to a series of frame-buffer addresses and associated data that results from rasterization. But after thinking about what you were saying it sort of made sense, until I remembered that frame buffer memory is slow compared to internal registers. You want to minimize the amount data pulled from the frame buffer, this includes the depth beffer. Now remember that some fragments may be clipped due to alpha testing, which must necessarily come after texturing. If those fragments get clipped, and other fragments get clipped by the scissor test, then you have fewer fragments to depth test, thus requiring less data pulled from the depth buffer. Its all about memory bandwidth it appears.

[This message has been edited by DFrey (edited 02-02-2001).]

j111 · February 2, 2001, 12:18pm

Yes, it’s about memory bandwidth. But when you are texturing, the data comes from main memory anyway, which is the same speed as the frame buffer. So if you are doing two textures with mip-maps and trilinear filtering (a very bandwidth killing example), you will be doing 16 memory accesses assuming none of the memory gets cached. If you do the depth test before you texture, you only have one memory access.

Now in real life, textures get cached, so it isn’t quite as bad as the example I gave, but I still think depth testing before texturing would be faster.

j

system · February 2, 2001, 1:20pm

>But when you are texturing, the data comes
>from main memory anyway, which is the same
>speed as the frame buffer.

Ummm… not really. I think each kind of
memory is optimized for different things.
On-card memory is often “faster” than DRAM
for certain operations.

I would assume that color buffer could be
optimized for writing (which makes
destination alpha somewhat slow) whereas the
depth buffer could be optimized for reading,
because you get more depth (and stencil)
reads than writes in a scene with reasonable
overdraw. Meanwhile, texture memory depends
whether it comes from on-card memory
(probably optimized for reading) or from
(S/R/D)DRAM (which is sort-of general purpose).

Or it may be that on-card memory is just
faster than typical DRAM in general, and the
slowness from reading the color buffer comes
from the reads having to share the bandwidth
with screen refresh.

Anyway, as far as fragments vs. depth testing
goes, perhaps the old, initial GL was written
before texturing was an option, and it was
designed more with goraud shading in mind,
and we now have a pipeline which works well
enough anyway?

j111 · February 2, 2001, 2:14pm

I meant on card memory is used for texturing, but I guess I stated it in the wrong way.

I always thought that the memory on a graphics card is pretty much all the same, and the frame buffer is allocated out of this. This makes sense because the amount of framebuffer memory required varies a lot. It could be less than a meg for 640x480x16 with no depth buffer, all the way up to dozens of megs for 1600x1200x32 with a 32 bit depth buffer.

Either way, I think that one depth read would be faster than a bunch of texture reads, especially when the card is going to do the depth read anyway, no matter what.

I have heard that with many games, there is 2 or 3 times overdraw. If the scene were drawn from front to back and the card checked the depth buffer first, that would save about half of the texture reads.
That’s almost like doubling the bandwidth.

I think either chipsets already do this, or there is something legal that is stopping the manufacturers, such as SGI not licensing these implementations.

j

Glossifah · February 2, 2001, 2:59pm

J, technology to do this exists in the consumer market today. Check out ATI’s white paper on it’s “Hyper-Z” capabilities. In certain scenarios, Hyper-Z can amount to as much as 30% performance increase, using:

Z compression
Fast Z clear
Hierarchical Z

Nvidia will be adding these features in an upcoming hardware release IIRC.

Glossifah

j111 · February 2, 2001, 3:11pm

That’s sort of what I mean, but not quite. I read the paper, and the closest technique to what I’m talking about is Hierarchical Z. However, the way they word it, it seems to be more like a preprocessing step, rather than something that is done per-pixel as the triangles are drawn. And a 30% performance increase indicates that whatever method they are using is not eliminating all the overdrawn pixels.

It could be that their marketing is aimed at people who don’t know anything about graphics pipelines, and that they are trying to make it understandable to the general public.

They could simply be doing the depth test before texturing, or it could be something else.

It would be great if cass, matt, or somebody from ATI could explain what is really going on with “Hyper-Z.” I’m still sorta confused.

j

mcraighead · February 4, 2001, 1:45pm

There is so much misinformation out there about these techniques… my favorite is the claim that HyperZ allows a 1.2 Gtexels/s (theoretical maximum) card to be rated at 1.5 Gtexels/s by saving memory bandwidth. Clearly this is nonsense – 200 MHz, 2 pixel pipes, and 3 texture pipes per pixel implies a maximum of 1.2 Gtexels/s, even if memory bandwidth was not a limitation. Improving memory bandwidth efficiency will only increase the degree to which this maximum can be achieved, and never increase the actual maximum.

It is possible for similar techniques to increase fill rate, but you have to look carefully at the claims to separate the wheat from the chaff.

In the end, these techniques are not terribly relevant to users, and only somewhat relevant to developers. For an end user, all that matters in the end is actual results, not specs or claims of revolutionary technology – so such a technique should be graded on its performance impact and not on its technical details. For developers, yes, these things matter, because they have implications for how you should perform your rendering passes for highest efficiency, but you can safely ignore them unless you are trying to optimize.

I should also note that I have seen benchmarks with HyperZ “disabled”. It’s not clear what this means, but (1) no one in their right mind would ever turn it off, and (2) you can never trust IHVs to provide you with a fair comparison of this kind.

If we provide you with a switch to disable performance feature X, which in theory should have no impact on the actual images produced, what stops us from checking for whether feature X is disabled and adding some delay loops in the driver? If you’ll always turn it on anyway, we can exaggerate the real performance impact.

We don’t have a switch in our OpenGL driver to “disable T&L”. Sure, such a switch would allow us to say “T&L helps performance by this much”, but the results would be questionable at best, because you don’t know exactly what is being compared.

Matt

Humus · February 4, 2001, 4:12pm

There are actually reason why you should be able to turn parts of HyperZ off. The main reason is that in the current state of the Radeon drivers there can sometimes occure some artefacting in some games with it enabled. For this reason HierarchicalZ is actually disabled by default. You have to manually edit the registry to enable it, or use a Tweaker (for example my own tweaker which can be found here if someones interested)

I can assure that there are no internal loops just to prove that HyperZ is enhancing performance. It does what it’s supposed to do and when you compare numbers with it on and off with some apps you can see that it affects apps differently in a way that you could expect from the technology. Apps with a huge overdraw factor (such as VillageMark) sees a significant performance boost (~40%) by enabling HierarchicalZ, while the average game gets about 20%.

mcraighead · February 4, 2001, 6:49pm

Soapbox mode…

Well, if a performance feature is broken, it should be disabled. It’s as simple as that.

And if a performance feature is broken, the numbers you get when it is turned on are pretty much irrelevant. The “HSR” 3dfx drivers are an example of this. Who cares what the framerate is if all the geometry is flashing? To argue about the performance benefits of the “hierarchical Z” setting is silly, because, in the end, you can’t enable it safely.

I’ll repeat my previous statement:

In the end, these techniques are not terribly relevant to users, and only somewhat relevant to developers. For an end user, all that matters in the end is actual results, not specs or claims of revolutionary technology – so such a technique should be graded on its performance impact and not on its technical details. For developers, yes, these things matter, because they have implications for how you should perform your rendering passes for highest efficiency, but you can safely ignore them unless you are trying to optimize.

Technology is irrelevant. Results are what count.

Matt

j111 · February 4, 2001, 10:15pm

Ok, this is all interesting, but going back to my original question, for curiosity’s sake:

Do chipsets do the depth test before or after the texture stage?

j

spit · February 5, 2001, 4:35am

I won’t claim to know what all chipsets do, and I’m not at liberty to discuss what our (NVIDIA’s) chipsets do “under the covers” in this forum, but I can say that under some (probably most) circumstances, a conformant OpenGL implementation can perform the depth compare prior to texturing. A notable (and somewhat common) exception occurs when using alpha test in conjunction with texturing to create billboarded trees and such. Nevertheless, it’s not a bad idea for the OpenGL app to coarsely sort from front-to-back to minimize the framebuffer bandwidth required for rendering depth-buffered primitives. Also, make sure you clear the depth buffer when you start a frame, rather than just drawing a skybox, as many chips out there (hmm, who could that be?) can optimize for that too!

John Spitzer

Humus · February 5, 2001, 5:26am

Originally posted by mcraighead:
[b]Soapbox mode…

Well, if a performance feature is broken, it should be disabled. It’s as simple as that.

And if a performance feature is broken, the numbers you get when it is turned on are pretty much irrelevant. The “HSR” 3dfx drivers are an example of this. Who cares what the framerate is if all the geometry is flashing? To argue about the performance benefits of the “hierarchical Z” setting is silly, because, in the end, you can’t enable it safely.

I’ll repeat my previous statement:

Technology is irrelevant. Results are what count.

Matt[/b]

Yes, and it is disabled. But there’s nothing wrong with letting the advanced user enable it if he knows what he is doing. There’s no checkbox or anything like that in any control panel, you need to go into the registry and enable it yourself.

But the comparion with 3dfx HSR is not quite fair. It’s not as broken that it’s barely useable, in fact, in 90% of the cases HierarchicalZ work without any problems or artefacts. But in a few games there can for a single frame pop up a small black box (something like 16x16 pixels or whatever size the tiles are) somewhere in the screen like every 30th frame average. It’s just a little annoying, but the performance difference between that and a mode where it would work perfectly is insignificant. And in games where it work perfectly the performance increase is just good.

I’d say the performance you benchmark with HierarchicalZ on is valid. If you would have like 16bit Zbuffer and get some small Zfighting problems once in a while would you think the score would be invalid?