Minimal latency video rendering with real-time input

Hi -

The title generally describes what I am trying to figure out…

I have spent a lot of time looking through forums and Google trying to come up with just the right scheme to allow for jitter and tear free rendering of live HD video with the least amount of latency. Here is what I am currently trying to do:

  1. Acquire uncompressed HD video frame (RGB) on a thread at 29.97 FPS.
  2. Using shared context, I transfer this frame to the graphics memory with glBufferSubData(GL_DYNAMIC_COPY).
  3. Separate thread for rendering, after a glXSwapBuffers() - I assume I have ~ 16 ms (VSYNC 60 Hz) to update the texture from the PBO before glXSwapBuffers() must be called again

The below pseudo-code is incomplete, but should give you the idea of what I am likely doing wrong.


acquisition_thread {
  wait_for_frame()
  glXMakeCurret(CTX2) {
    glBindBuffer()
    glBufferSubData(GL_RGB)
    glBindBuffer(0)
    glXMakeCurrent(0)
  }
}

render_thread {
  glXMakeCurrent(CTX1)
  glBindTexture()
  glBegin(GL_QUADS)
  ...
  glEnd()
  glBindTexture(0)
  glXSwapBuffers()
  
  if (new_PBO) {
    glBindTexture()
    glBindBuffer()
    glTextSubImage2d(GL_RGB, 0)
    glBindBuffer(0)
    glBindTexture(0)
  }
  
  glXMakeCurrent(0)
}

Because I am using VSYNC, I have avoided the tearing issue, but have created a very choppy video scene. I am developing this on Linux with nVidia Quadro hardware and very recent drivers from nVidia.

Thanks in advance for the help, I am really at a loss here!

Why are you constantly unbinding and rebinding the context in each thread? Flopping contexts is expensive. Would suggest making the context for that thread current once, and then leaving it.

Also I’d suggest that you have only one GL context for both threads, owned by the render thread, have it map and unmap the buffer used by the acquisition thread, and in between, just tell the acquisition thread what pointer to dump the image data into. The BG thread should not be using the GL context nor issuing GL commands.

There may be more efficient ways, but that is one way to get rid of “all” of your GL context swapping.

Also, I would not try to render with a new frame in the render thread until a frame or two after you have given it to the driver (i.e. unmapped the buffer). Give the driver time to get the frame over to the GPU before you try to draw with it. Otherwise you risk blocking the enter rendering pipeline doing a full App-GPU synchronization while the driver shuffles the image over to the GPU. To get this working smoothly, you might find you need to buffer 3 or 4 frames ahead (in multiple buffers or in the same buffer) to keep things smooth.

One other thing. There are more efficient ways to stream data over to the GPU than Map/fill/Unmap. Definitely check out the PDF chapter in [b]OpenGL Insights[/b] here called “[i]Asynchronous Buffer Transfers”[/i]. If you find it useful, buy the book!

glMapBufferRange with MAP_WRITE, MAP_UNSYNCHRONIZED, and MAP_INVALIDATE_RANGE for filling a buffer can be pretty darn efficient at avoiding costly App-GPU synchronizations (with MAP_WRITE and MAP_INVALIDATE_BUFFER for orphaning). But before you get too attached to that, be sure and check out this hot-off-the-presses presentation from Cass Everitt and John McDonald @ NVidia where they describe how to do even better with ARB_buffer_storage, avoiding even the implicit sync between the app and the driver that occurs with UNSYNC maps. Kudos to Prune for pointing this one out recently.

First, thanks so much for taking the time to respond.

As I am a beginner with OpenGL, I was under the impression that any gl* API calls had to be made within a current context, thus the shared context(s) and constant swapping between threads. I had actually read about the expense of constant swapping, but can’t figure out how to do without (I haven’t tried yet, but I can see how mapping would allow this).

I hadn’t considered mapping, as I thought those were reserved for direct drawing (updating a portion of pixels) and would be lesser in performance. Ok, so lots of words here, how about some code to make sure I am on the same page.


class GLCanvas {

  GLvoid *mPixelMap;

acquisition_thread {
  while(1) {
    wait_for_frame(tFrame)
    memcpy(mPixelMap, tFrame)
    cond_signal()
  }
}
 
render_thread {
  glXMakeCurrent(CTX)

  while (1) {
    glBindTexture()
    glBegin(GL_QUADS)
    ...
    glEnd()
    glBindTexture(0)
    glXSwapBuffers()
 
    if (cond_timedwait()) {
      glBindTexture()
      glBindBuffer()
      glUnmapBuffer()
      glTextSubImage2d(GL_RGB, 0)
      glBindTexture(0)
      mPixelMap = glMapBuffer()
      glBindBuffer(0)
    }
  }
 
  glXMakeCurrent(0)
}

};

This is what I think you mean by removing all the GL commands from the acquisition thread. In the meantime, I’ll write this up and give it a shot. I had actually come across the OpenGL Insights PDF you mentioned, but because I had tunnel vision about not needing to use mapping, it just didn’t make sense at the time. To bad that book isn’t available on the Apple store, I would have already picked up a copy.

Thanks again for the great suggestions!

Yes, something like that, with appropriate map flags and buffering added.