Performance of Non-Power-of-2 Textures

DigitalDative · January 20, 2017, 11:40am

Hi guys
While currently coding on a Vulkan app that heavily relies on rendering to offscreen textures (optionally supporting MSAA), I’m currently “limited” to test on a 1050Ti running on Windows.
Since my hardware is not very restricted on both feature limits and feature performance, and since I’m aiming cross-platform development, I’m thinking about whether to design the renderable texture manager to (also) support power-of-2 textures (PoTT) in respect to potential slower hardware.
As a side note, the offscreen content is text rendered in a “vector fashion style” using pre-optimized glyph data. As this may be a very expensive draw call, the rendered text is cached in the RT and used in further calls as billboard until the text and/or its transformation changes (=rarely, in general).

I'm aware of the fact that non-PoTT support is part of the Vulkan’s hardware requirement specs, but AFAIK there is no requirement regarding the performance behaviour

. E.g. I’m concerned that some hardware rasterizers may perform slower (or the MSAA-resolve is slower) when handling non-PoTT?

Long story short: Does anybody know whether using non-PoTT rather than PoTT may be a performance penalty on some (exotic/mobile) hardware?

Thanks in advance.

Maik

Salabar · January 21, 2017, 5:55am

If this would be a performance hazard for some GPU developer in the Khronos, they’d vetoed such proposal. Modern GPUs are a lot like CPU in terms of memory access capabilities and texture sampler units have caches of their own, so I doubt non-PoTTs pose a problem. One thing I can think of is that (for example) if a tile-based GPU renders images with portions of 64x64 pixels and you render an image in 100x100 resolution, the hardware may not have choice, but to render an 128x128 (multiple of 64) image and simply discard some pixels, but this is barely a concern with modern resolutions.

DigitalDative · January 21, 2017, 8:30am

Thanks for your fast response, Salabar.

Tile-based frame buffers are a good point which could really have at least a small impact on the performance (if at all).
I can imagine the GPU will handle those cases somehow by using _some_kind_of scissor-rect rendering - just an assumption.

Anyway, in the current implementation I make sure that at least the width is dividable by 2 (for no plausible reasons - I just feel better doing so^^).

In worst-case, so when future testing really shows that using only sizes dividable by X (probably 64) on (probably mobile) platforms achieve a performance gain that is worth the optimization, then I’ll simply do so - shouldn’t be a big deal.

The bigger deal will be to find out the platforms/GPUs which will cause other problems, but that’s another story and not related to this issue…

[Offtopic]
That said, a helpful source for indie developers like me is a site I stumbled on yesterday night, which most of you probably already know: [EDIT: not allowed to insert link to vulkan gpuinfo 0rg)] - really nice resource to check the Vulkan specs of many devices and also the behaviour on different OS/platforms. Very valuable information out there when one is coding on such a low-level to enable apps to run on as many devices* as possible. (except Apple, sadly).
BTW: The site seems to be initially released by Sascha Willems, who also did so much more to push Vulkan - not only but also his FW/examples on GitHub – another great resource. At this point a giant thanks to Sascha and of course everybody else involved developing and supporting Vulkan.
(*except Apple, sadly :rolleyes: ).
[/Offtopic]

Thanks again & have a nice weekend,
Maik

krOoze · January 22, 2017, 2:08pm

Yeah, I have no idea where does he get free time for all that. Anyway pretty-printing it for ya:
http://vulkan.gpuinfo.org

This claims POTT still makes sense performace-wise (in OpenGL):
https://software.intel.com/en-us/articles/opengl-performance-tips-power-of-two-textures-have-better-performance

Salabar · January 22, 2017, 3:29pm

Because interpolation of float numbers can be done very quickly with power-of-two textures,

This sounds weird. Hardware samplers are baked into silicone, it’s not like a hardware design can restructure transistors on the fly. I can argue that lower resolution textures are nicer on cache utilization (even more important for integrated GPUs, often bandwidth starved), i.e when 256x256 is not enough and 512x512 is excessive, you may use a 400x400 picture as a compromise. In the benchmark they
a) draw a single static image
b) using the nearest neighbour filter (single memory fetch per fragment!)
while in real world scenarios you’d at least want to use 4x anisotropic filtering with mipmapping. I’m guessing Intel engineers know their hardware better than me, yet their example does not convince me.

DigitalDative · January 22, 2017, 6:04pm

Hey KrOoze,

This claims POTT still makes sense performace-wise (in OpenGL):
https://software.intel.com/en-us/art...er-performance

Oh, that’s interesting, even if I don’t see why this issue is especially related to float number interpolation

"Because interpolation of float numbers can be done very quickly with power-of-two textures, these textures will render faster than ones that are not a power of two. […]”.

Maybe I’m just to tired to understand, or the writer was, or he/she just meant the interpolation itself (min/magFilter != nearest) rather than the value type that is interpolated.

Anyhow, good to know. And also a reason more measuring and comparing performance. Even if it’s mentioned that this is more related to older Intel GPUs (which surely are still widely used).

Maybe I’m wrong, but I always had the feeling that Intel GPUs and their drivers sometimes are something “very special” ;), compared to the market leaders AMD&NVidia. Not to mention even more exotic mobile GPUs, where every transistor counts so a mobile phone won’t turn into a radiator. Probably that’s just the nature of things (and money).

Following the Intel link and using the code sample, I’m wondering that even on my desktop GPU there is a difference of ~3% in performance:

OpenGL renderer string: GeForce GTX 1050 Ti/PCIe/SSE2
OpenGL version string: 4.5.0 NVIDIA 376.33

This lesson compares the read performance between using Power-of-Two textures and Non-Power-of-Two textures.
Press <esc> to exit; <space bar> to switch between texture sizes …

*** Non-Power-of-Two Texture – 640 x 426
frames rendered = 11166, uS = 2000089, fps = 5582.751568, milliseconds-per-frame = 0.179123
frames rendered = 11206, uS = 2000035, fps = 5602.901949, milliseconds-per-frame = 0.178479

*** Power-of-Two Texture – 1024 x 1024
frames rendered = 10964, uS = 2000122, fps = 5481.665618, milliseconds-per-frame = 0.182426
frames rendered = 10969, uS = 2000070, fps = 5484.308049, milliseconds-per-frame = 0.182338

The big question is whether and if how much performance loss is caused by the Open GL driver. Luckily, there is a new API with an extremely low driver overhead that can be used to find this out

Another issue (which I mismatched in my last post) is the difference between rendering to such (“framebuffer”/”renderable”) textures, sampling/fetching from it, and last not least resolving such in case MSAA is used. Not to mention the case if mip-mapping comes into play. But the ladder is, fortunately, not related to my topic and IMHO not recommended at all.

So, many assumptions are made to this point, and few really reliable measurement data or at least rules of thumbs for real life cases. As soon I found out more, I’ll let you know (need to install the Android dev tools for VS and port my code to support Android in order to be able to let you know at least how a Snapdragon=Adreno GPU behaves dealing with non-POTT.) Of course, I would be happy if you have any further info regarding this topic to share.

ratchet_freak · January 23, 2017, 1:46am

the difference between sampling a PoT texture and an nPoT texture is 2 floating point multiplies. To map the 0-1 texture coordinates of the shader to the 0-(w,h) of the texture.

With PoT textures that can be replaced with some integer math in the exponent.

krOoze · January 23, 2017, 6:02am

Yeah, that’s all true. I dare you though to find a better article, that is sufficiently recent.
I guess it is back to “you have to measure yourself”. :doh: Would be awesome if you share some results.

IMHO the biggest concern was that in old OpenGL some cards would switch to software renderer if they saw a NPOTT. I hope that is history (though AFAIK there’s nothing in Vulkan spec saying some things can’t be emulated in software).

@ratchet Is there any performance difference on modern HW between fmul and add? Maybe some more stalls here and there due to higher latency. And maybe some of those exotic/mobile HW would choose simpler mul circuit?

ratchet_freak · January 23, 2017, 7:29am

[QUOTE=krOoze;41736]IMHO the biggest concern was that in old OpenGL some cards would switch to software renderer if they saw a NPOTT. I hope that is history (though AFAIK there’s nothing in Vulkan spec saying some things can’t be emulated in software).

@ratchet Is there any performance difference on modern HW between fmul and add? Maybe some more stalls here and there due to higher latency. And maybe some of those exotic/mobile HW would choose simpler mul circuit?[/QUOTE]

it’s going to be a few cycles of latency at most given that fmul is the backbone of what gpus actually do. However that will drown in the noise that is the memory read+all the other things you are calculating.

Alfonse_Reinheart · January 23, 2017, 7:40am

That assumes there is any significant processing to “down” the memory read. If you’re just blasting text/sprites/particles/whatever to the screen, it’s just read/write. Maybe with a 4-vector multiply to do color tinting, but that’s it. The texture access will dominate performance in such cases.

DigitalDative · January 23, 2017, 9:46am

the difference between sampling a PoT texture and an nPoT texture is 2 floating point multiplies. To map the 0-1 texture coordinates of the shader to the 0-(w,h) of the texture.

With PoT textures that can be replaced with some integer math in the exponent.

No idea how this is handled by hardware - but in case raw IEEE 754 floats are used instead of pre-convert them to fixed point before rasterization, there should be (as always: in theory) no difference because float’s exponent is log10 based and not 2 (so no optimazation [easily] possible). In case you meant with “integer math in the exponent” the “exponent” of an fixed point value (“the digits after the comma” - sorry my English) - you are absolutely right.

IMHO the biggest concern was that in old OpenGL some cards would switch to software renderer if they saw a NPOTT. I hope that is history

Indeed, this would be the absolute worst-case. It’s unlikely, but if such a GPU that is offically supporting Vulkan would exist, I wouldn’t take it serious and just ignore it. These days, developers have to rely on the fact that common rendering is performed by hardware (except when explicity asking for a SW ref rasterizer).

krOoze · January 23, 2017, 12:05pm

Wait, what? It is true IEEE 754 (also) has a decimal system specified. But I don’t think there exist a computer using it.
Anyway, the math is x = u * w, y = v * h. So if w = 2^N, then the optimization ratchet ment is signed left shift of the floats exponent instead of the multiplication (which could be also implemented as a int add).

DigitalDative · January 23, 2017, 1:28pm

Man, for the first seconds after reading your post I thought that I’ve chosen the wrong job.

Then I did a simple test:

    struct FltCast {
        FltCast(float f) : fValue(f) {}
        union {
          float fValue;
          struct {
            unsigned int mantisa : 23;
            unsigned int exponent : 8;
            unsigned int sign : 1;
          };
        };
    };
    FltCast flts[4] = { 256.0f, 200.0f, 1.0f / 256.0f, 1.0f / 200.0f };
    for (auto f : flts) std::cout << "Fl.32:" << f.fValue << " mant.:" << f.mantisa << " exp.:" << (int)f.exponent-127 << std::endl;

So now, after seeing the results, I’m convinced that I’ve chosen the wrong job The exponent of floats is 2 and not 10. Even it’s common to define the exponent in literals in a based-10 manner… How could I miss that all the years? So sorry for my statement! Time to stop working for today and have a beer or two…

Cheers!

krOoze · January 23, 2017, 2:42pm

Yeah, and I bet nobody complained yet at your job. It is the benefit of formal education to know some such pointless factoids. Glad I got to brandish what I learned - my 15 minutes of fame.

For a second mind blow factoid: the commonly used (and commonly working well) code snippet you posted is “undefined behavior” under C++ spec.

DigitalDative · January 23, 2017, 5:11pm

Hehe, getting offtopic again; it’s likely easier to start to only mark sections of this thread that are related rather than offtopic to the initial topic

For a second mind blow factoid: the commonly used (and commonly working well) code snippet you posted is “undefined behavior” under C++ spec.

It was intended that the shown code is undefined in behaviour since my state of profession/occupation seems to be in a related state ;p

Seriously, what’s wrong with the code except that the iostream include and a surrounding main() function is missing? Let me guess:

1.) Something related to the for(var:iterOver) loop since I used a raw array which size may be undefined to some compilers? [Only recommended: According Scott Meyers usage of arrays should be avoided in Modern C++]
2.) The braced initialized array is wrongly initialized (initializer list issue stuff and {} vs. ())?
3.) Missing typedefs of the union or struct (I’m pretty sure it’s no more necessary as since C++11 it’s done per default?)
4.) Using old-style cast (int) vs. static_cast<int>() (->cosmetic stuff)
5.) Unnamed struct and/or union?
6.) I made the assumption that float is 32bit (23:8:1) – which, as far as I can remember, is at least by C++ specs wrong (specs only say that sizeof float <= double <= long double AND float must be >=4, doubles >= 8 )
7.) Usage of auto is wrong because FltCast[4] array deduction is ambiguous?
8.) Missing using std::cout (which is only recommended)?
9.) Missing noexcept on constructor? (-> only performance impact, other way round would be a mistake
10.) Most likely: Something regarding the good old POD struct.

This feels a bit like playing Jeopardy^^ and I’m really looking forward to your answer… We also can start make this a quiz instead or take stakes from others who can bet on the correct answer

[EDIT: 10 things that could possibly go wrong in a code snippet of 10 lines - I want my C64 back!]

Alfonse_Reinheart · January 23, 2017, 5:42pm

This is rather off-topic, but:

Seriously, what’s wrong with the code except that the iostream include and a surrounding main() function is missing? Let me guess:

1.) No, that’s fine.
2.) No, that’s also fine. Since the constructor is not explicit, you can implicitly convert from a float to the union type through the constructor.
3.) He said “C++”, not C. C++ has never needed typedef struct.
4.) C++ did not remove C-style casts.
5.) A pedantic ISO C++ compiler should have complained, as unnamed structs are not part of any C++ standard. But that would be a compile error, not undefined behavior.
6.) float does in fact have a size. The fact that different implementations give it a different size makes your code “implementation-defined”, not “undefined”. There’s a difference.
7.) No, that’s fine.
8.) … huh?
9.) No, that’s fine.
10.) No, that’s fine too.

If you really want to know, you’re not allowed to use unions to do type-punning (reading the bits of one value as though it were another value). At all. You cannot write to one member of a union and then read that value’s data through another union member.

It should be noted that any form of introspecting the bits of a float was going to have to be at least implementation-defined, since C++ does not require float to be IEEE-754 BINARY32 values.

DigitalDative · January 23, 2017, 10:30pm

Thanks a lot for your detailed answer, Alfonse.

Regarding:
5.) The code was compiled using VC++, which is (still) far away from behave like a pedantic ISO compiler. Regarding the compile error vs. undefined behaviour, you’re right.
6.) Since float may in theory also be a 64/128/123456 bit type, the output would at least look pretty undefined (which, again, and as you mentioned, isn’t undefined behaviour in general meaning)
8.) That was meant in regard to ADL/namespace qualification. There is a good chapter in Gottschling’s “Discovering Modern C++” 3.2.3 (I’m pretty sure he resp. Addison-Wesley is ok if I make this tiny part public)
Use using
Do not qualify namespaces of function templates for which user-type overloads might exist. Make the name visible instead and call the function unqualified.
Not doing so could, in contrast, really end in undefined behaviour (of course hardly in case of my small code snippet).

Ok, the real reason is surprising. I understand not to do so with floating point values. Also, there might be issues when doing cross-platform developing related to endianness “union{uint32_t x1; struct{uint8_t x[4];};};”, but I can’t see a real reason why type-punning is forbidden at all (defining a single byte and making the two nibbles accessible this way).
But, to be honest, I’m currently not really interested to dive deeper into this topic as I’m not planning to use punning. I just grabbed a few lines of sample code (from a top-rated answer on stackoverflow) and modified it slightly, to finally discover that the exponent of a float in memory is log-2 and not, like I was thinking all the time, log-10 based. That in turn explains now why ratchet’s assumption of doing integer math in the exponent is a.) possible and b.) may be the reason why hardware handles PoTT faster. And boom, we’re back in topic

krOoze · January 24, 2017, 6:03am

That’s an A for Alfonse. And also bit-fields packing, order, alignment and such is largely implementation-defined. And float also being implementation-defined type as you said, not necessarily IEEE754/IEC559.

Anyway it is possible, but how beneficial can replacing fmul with some iadd be? Considering both tend to have 1 CPI throughput on modern hardware.

Hmm, what would be a good design to test it? Would something like Hello Textured Cube suffice? What would be the problematic areas to test for specifically? - in this thread I heard mip-maps, resolve and conventional sampling so far.

Alfonse_Reinheart · January 24, 2017, 2:47pm

You’re assuming that the conversion from normalized to integer coordinates happens in the fragment shader rather than in dedicated texture lookup hardware. If it’s in the texture fetch unit itself, then it could certainly have a faster throughput with integer math, with a slower floating-point unit for those cases where it is needed.