Improving the performance of current opengl code.

Remaldeep · August 4, 2016, 12:15am

I am conceptualizing an approach to improve the current implementation of opengl in an application. The application has high res image (max i have seen till now is 40k x 24k). These images are made of 512x512 tiles.(Ignore the issues like 40k is not divisible by 512).

Current approach:
[ul]
[li]If there are 4 textures on the screen of size(512x512) and window size being(1024x1024), then i have 4 threads holding 4 textureId’s. Using immediate mode rendering textures and then rendered on the screen so as to fill the entire window.
[/li][li]The textures are bound to change depending upon panning. So new textures are created using a separate thread.
[/li][li]Zooming in also happens so image size changes where new high res image tiles are loaded.
[/li][/ul]
Conceptual approach:
[ul]
[li]I am thnking of getting rid of immediate mode rendering.
[/li][li]First I am planning to use Texture 2d array to store the textures and reduce the number of draw calls. (Still a bit unclear about this approach)
[/li][li]Second I am planning to use PBO to load textures from RAM to GPU via DMA.
[/li][li]Third can I generate textures once and then use glTexSubImage2D with PBO to update new texture data as I pan the image. (One thing to worry about here is what happens when you zoom in/out, the number of textures on the window change.)
[/li][/ul]
Any suggetions and tips on the above approach are helpful. If there are other methods to improve the performace they are welcome too. I am beginner in opengl so correct me if I have misunderstood some concept.

mhagain · August 4, 2016, 1:31am

Profile and determine where your bottlenecks actually are before making guesses as to what approach to take. We can make some informed deductions, however.

In your case I can say with complete 100% absolute hand-on-heart confidence that for drawing 4 screen-aligned quads per frame, immediate mode is not a bottleneck. It doesn’t matter a sweet damn, in fact. Quake drew 100s of polygons using immediate mode back in 1996, it wasn’t a bottleneck then, and it’s not a bottleneck now.

With only 4 draw calls per frame, number of draw calls is not a bottleneck. (Once again, quake, 100s, 1996.) Also, it looks as though you possibly have one draw call per thread in your example; if so, this design is gaining you absolutely nothing and likely just making your code more complex than it need be. In fact threading overhead is probably going to outweigh any theoretical advantage.

You don’t seem to have any overdraw; you’re filling the extire screen each frame.

So your bottleneck is most probably going to be texture uploads then. It’s unclear from your post if you stream textures from disk, from memory, or if they’re coming from some other source, so I’m going to ignore that unless or until you can provide more detail.

A PBO will give you absolutely nothing if you must use the textures in the same frame as you load them. The whole point of a PBO is to be able to get an asynchronous upload from the PBO to the texture object, so you do the upload, then wait a frame or two, then use the texture: the application doesn’t need to block when doing the upload. It will probably need to block when loading from source to the PBO however. My point is: PBOs aren’t magic bullets; you can’t just say “use a PBO” and have an expectation of performance improvements without sitting down and thinking carefully about how you use the PBO.

Creating textures once then using glTexSubImage is a good approach. Creating and destroying resources at runtime is expensive.

You can get fast texture uploads if you select your source data formats and glTex(Sub)Image parameters carefully. First of all, if you’re uploading source data as 24-bit (GL_RGB) you’ll never get peak performance as the driver will always need to do a slow format conversion as part of the upload. the best approach is to have your source data as 32-bit in BGRA ordering. On some hardware you’ll also need to use GL_UNSIGNED_INT_8_8_8_8_REV instead of GL_UNSIGNED_BYTE for the type parameter too. See https://www.opengl.org/wiki/Common_Mistakes#Texture_upload_and_pixel_reads for more info on this.

If your source data is coming in as GL_RGB and you have little or no control over that, you’ll get beter performance by converting it o BGRA yourself (using a one-time-only allocated chunk of memory), but ideally you do want it coming in as BGRA and no convresions happening.

So I suggest that you switch over to creating textures once only, then (if necessary) fix up your glTex(Sub)Image parameters and source data before exploring other avenues.

Remaldeep · August 4, 2016, 2:41am

Adding clarifications based on mhagain’s post.

The number of textured tiles I am suppose to render per frame ranges from 4 to 200 depending on my zoom level of the image.

In my application, the user interacts with the image by zooming or panning it. As a closest analogy, think of using google maps and panning and zooming on it.

Thus, if I am on current frame ‘f’, I am unaware of my next frame ‘f+1’ as it depends on the user action on the current frame (zoom, pan etc.). What is ‘then’ the best possible way to load the textures for frame ‘f+1’? As per your line “PBO will give absolutely nothing…”, I understand my PBO based approach won’t lead to any performance benefits! I request you to please clarify my understanding here.

Since PBO texture loading is async call and doesnt involve CPU cycles I am banking on that it will be faster than CPU doing it without PBO. Assuming this, does PBO transfer still help for rendering current frame, while CPU is busy doing something else with my application?

Regarding the source data, they are jpg RGB format(512px512p) images which are fetched from “hard disk” depending on user’s action and part of 40kx24k pixel image to be rendered. The loaded tied images are converted to integer array (in java) upon loading into RAM and eventually loaded as textures through openGL.

Thanks for your help.

Alfonse_Reinheart · August 4, 2016, 7:09am

Since PBO texture loading is async call and doesnt involve CPU cycles I am banking on that it will be faster than CPU doing it without PBO. Assuming this, does PBO transfer still help for rendering current frame, while CPU is busy doing something else with my application?

If you’re not using a PBO to upload texture data, then what OpenGL will do is read data from your pointer into an internal buffer. It will then DMA from that buffer into video memory asynchronously.

So you’ll be getting an async DMA either way.

they are jpg RGB format(512px512p)

Stop right there. 3-channel byte-wise formats are death when it comes to pixel uploads. You need to give OpenGL a 4-channel format. Even if you never use that fourth channel, you absolutely need to provide it.

So your OpenGL internal format should be GL_RGBA8. Your pixel transfer format should be GL_RGBA or GL_BGRA, depending on what the hardware likes best, and your data should be generated accordingly. The pixel transfer type should either be GL_UNSIGNED_BYTE or one of the 32-bit packed types.

Of course, this all assumes that your JPEG decoding library can give you 4-byte aligned pixels. If it can’t… find a better JPEG library.

GClements · August 4, 2016, 10:38am

If texture uploads are the bottleneck, uploading the YUV420 data and converting on the GPU will halve the bandwidth (and also avoid the alignment issue without requiring a padding byte). That assumes that the JPEG decoder will actually let you have that data rather than forcibly upscaling and converting to RGB.

Also, if you’re going to be zooming out by a large factor, it’s better to store mipmap levels on disk. When zoomed fully out, you’ll need all of the tiles, but only at a low resolution. You don’t want to have to upload the entire data set at full resolution for this case.

Remaldeep · August 8, 2016, 11:40pm

Ok so I profiled the code and got the following results.

https://www.dropbox.com/home/Public?preview=profile.png

SwapBuffers is taking most of the time. Hope this gives you guys more info on what is happening.

Profiling was done over a period of time without much zomming/panning(i.e GenTextures wasn’t called much).

Silence · August 8, 2016, 11:52pm

Without any account to dropbox, I cannot see the image. I assume this will be the same for other people. Please let it in a public area. You can try this: https://postimage.org/

You certainly have vsync enabled, try to disable it and make your profiling again.

Remaldeep · August 9, 2016, 12:01am

Public link for dropbox image:
https://dl.dropboxusercontent.com/u/54943107/profile.png

Profiling was done with VisualVM.

mhagain · August 9, 2016, 3:23am

[QUOTE=Remaldeep;1283459]Ok so I profiled the code and got the following results.

SwapBuffers is taking most of the time. Hope this gives you guys more info on what is happening.

Profiling was done over a period of time without much zomming/panning(i.e GenTextures wasn’t called much).[/QUOTE]

Just to clear up what seems a misunderstanding: glGenTextures doesn’t actually create a texture object. All that it does is give you unused texture names; no textures are actually created until you first call glBindTexture (which creates a default object in a default state) and first call glTexImage or glTexStorage (which actually specifies the texture and allocates storage for it).

Likewise glDeleteTextures makes no promises about freeing memory. An implementation is able to leave the memory allocated for subsequent reuse if it’s internal heusistics determine that to be the optimal approach.

Yes, you do have vsync enabled; your SwapBuffers has 20,594 invocations over a total of 314,757 milliseconds, which is just over 15 milliseconds each: this is vsync at 60 fps.

Your results for glBegin demonstrate what I mentioned above: over 500,000 invocations and it’s still significantly lower than your wglMakeCurrent calls (which look like 2 per frame; glVertex and glTexCoord are effectively down in the noise); i.e threading overhead far outweighs the cost of using immediate mode, and immediate mode is not a bottleneck for you.

It’s puzzling that you say you’re using immediate mode, but yet from lloking at your test results I also see calls to glBufferSubData and glDrawArrays; this suggest to me that you’re doing things a little bit different to what you originally indicated, and perhaps you haven’t told us the full story about what you’re actually doing?

Remaldeep · August 9, 2016, 4:17am

It’s puzzling that you say you’re using immediate mode, but yet from lloking at your test results I also see calls to glBufferSubData and glDrawArrays; this suggest to me that you’re doing things a little bit different to what you originally indicated, and perhaps you haven’t told us the full story about what you’re actually doing?

Since I am not the only one working on this, I think its related to text rendering library somewhere. The texture drawing doesn’t use those commands.

Grognard · August 10, 2016, 8:16pm

Profiling like this is pretty useless in opengl, you can’t tell how long something really takes this way because you aren’t really doing everything and then waiting for it to finish that is just when your end is done. I mean obviously it doesn’t take that long to swap buffers. But you will have to time the whole frame and change things around and see if they improve.

But if this is really what you are doing obviously the image transfer is all of your overhead. Maybe you can keep some of the images mapped to the card depending on how often they change out.