glClientWaitSync always timeouts, glMapBufferRange always stalls

Extra explanations

I completely don’t understand what’s wrong with my tries to make asynchronous frames downloading from FBO. I’m now is trying to get high FPS in fullscreen rendering with OpenGL.

Well, that’s OpenGL ES 3.0, I use GL functions from QOpenGLExtraFunctions (QT framework), but I think this context is not important.

I have a background OpenGL rendering thread, which for now draws nothing, just reads frames from FBO with no pauses.

My screen has resolution 1920x1080 pixels, so FBO has the same size.

I realised, that glReadPixels are too slow to transfer such big frames through PCI from NVIDIA video card to RAM, I have about 55 FPS, but I want 60 FPS.

Then I knew about PBOs and got an idea, that I can copy frames from FBO to a PBOs (create a buffer with GL_PIXEL_PACK_BUFFER, bind it and call glReadPixels, which in this case copies pixels to PBO in video card memory, not to a storage on RAM, and returns immediately because GL_PIXEL_PACK_BUFFER is bound) and asynchronously transfer them to my storages on RAM after calling glMapBuffer. And while the last frame is being written to storages on RAM, I map and draw last second frame, which is already completely (as we hope) transferred to RAM.

I also read about shared contexts for multithreading, but as I understood, the best solution for performance - one thread for one context with asynchronous data downloading/uploading, just forget about shared contexts.

glMapBufferRange stalling issue

So i have, for easy example, two buffers. What i realised next:


glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[0]);
glReadPixels(0, 0, width, height, GL_BGRA, GL_UNSIGNED_BYTE, 0); // ~50 microsecs

glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[1]);
glReadPixels(0, 0, width, height, GL_BGRA, GL_UNSIGNED_BYTE, 0); // ~50 microsecs
    
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[0]);
glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, width * height * 4, GL_MAP_READ_BIT); // ~20000 microsecs
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
glReadPixels(0, 0, width, height, GL_BGRA, GL_UNSIGNED_BYTE, 0); // ~50 microsecs
    
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[1]);
glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, width * height * 4, GL_MAP_READ_BIT); // ~15 microsecs
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
glReadPixels(0, 0, width, height, GL_BGRA, GL_UNSIGNED_BYTE, 0); // ~50 microsecs

// keep mapping and glReadPixel'ing pbo[0] and pbo[1] with same call durations

As you see, mapping of first PBO stalls CPU for 20 ms, but the mapping of second PBO is no-op.

But I need about the same time duration between mapping of PBOs.

How I understand, that means that when I map the first buffer, it causes synchronization such what OpenGL is finishing glReadPixels to 1st and 2nd PBOs before return, because I try to map same PBO that is already using in other GL commands (glReadPixels), but instead of waiting 1st glReadPixels finish, GL just flushes all already queued commands, including 2nd glReadPixels.

But! When i place std::this_thread::sleep_for(10ms) before every glMapBufferRange, i get same durations, so when my CPU thread have waited enough before calling glMapBufferRange, glMapBufferRange call for 1st PBO still takes 20ms! That’s why i have “glMapBufferRange always stalls” in title.

Otherwise, I have no idea what’s happening. So did I understood this right?

glClientWaitSync timeout issues

Then I knew about OpenGL synchronization objects, which are inserted into GL command queue, and when such object is processed by GL and signalled, that means, that all commands in the queue before this objects are processed.

So I wanted to insert glFenceSync just after glMapBufferRange and glClientWaitSync just before glMapBufferRange, or after/before glReadPixels, to make my frames to be updated evenly. But I still didn’t try, because my sync objects just don’t work properly.

Now i try to execute just this:


GLsync fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
while (true)
{
	GLenum syncRes = glClientWaitSync(fence, 0, 1000);
	switch (syncRes)
	{
		case GL_ALREADY_SIGNALED: qDebug() << "ALREADY"; break;
		case GL_CONDITION_SATISFIED: qDebug() << "EXECUTED"; break;
		case GL_TIMEOUT_EXPIRED: qDebug() << "TIMEOUT"; break;
		case GL_WAIT_FAILED: qDebug() << "FAIL"; break;
	}
	if (syncRes == GL_CONDITION_SATISFIED || syncRes == GL_ALREADY_SIGNALED) break;
}
glDeleteSync(fence);

This loop becomes infinite and always prints “TIMEOUT”, so as I understand, GL just can’t process this sync fence, although I’ve inserted it into the command queue.

So what’s wrong with my sync fences using?

As you see, mapping of first PBO stalls CPU for 20 ms, but the mapping of second PBO is no-op.

What exactly do you expect to happen here? You told OpenGL that you wanted to do an async transfer into a buffer. Then you told OpenGL that you’re going to read from that buffer. Which means you have to be able to see all of the data in that buffer, which includes the results of the transfer. Therefore, OpenGL must synchronize with the async process you just started.

You may as well have just used glReadPixels into client memory directly.

Remember: OpenGL is a synchronous API. It allows things to behave asynchronously, but only so far as everything still works “as if” it were synchronous. Which means that, so long as you don’t look at the result of a process, it can be executed asynchronously. If you actually look, the implementation must synchronize.

So if you want to make an async transfer actually improve performance, you have to wait before you access the buffer. Ideally at least one frame long. And if you’re going to busy-wait on a fence issued after the transfer, there’s really no point in having the fence; just map the buffer.

So I wanted to insert glFenceSync just after glMapBufferRange and glClientWaitSync just before glMapBufferRange, or after/before glReadPixels, to make my frames to be updated evenly. But I still didn’t try, because my sync objects just don’t work properly.

Sync objects have to be properly flushed; if you don’t, they may never become signaled. This is why glClientWaitSync can take the GL_SYNC_FLUSH_COMMANDS_BIT flag.

Thank you for reply!

But i want to repeat myself:

When i place std::this_thread::sleep_for(10ms) before every glMapBufferRange, i get same durations, so when my CPU thread have waited enough before calling glMapBufferRange, glMapBufferRange call for 1st PBO still takes 20ms! That’s why i have “glMapBufferRange always stalls” in title.

So i give enough of time for OpenGL to finish all commands before i map my first PBO. 10ms, or 30ms, or 100ms, whatever. But it still takes 20 ms! Why? I really have no idea. I’m sorry, if i don’t understand something obvious.

Well, another man answered me, that i should call glFlush() before std::this_thread::wait_for(), to push commands to GL forcely, i didn’t think about this completely. Now when a thread is woke up, buffers are loaded already.

And yes, i totally missed GL_SYNC_FLUSH_COMMANDS_BIT in glClientWaitSync().

Thank you!

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.