PBO DMA transfers and completion

I’m using the commonly proposed leapfrogging PBO transfer technique, and I’m doing it on a separate thread (+ context) for each movie I play. Then each screen has it’s own context and accesses those frames.

My question is, how can I tell when the PBO transfer is complete? I know that if I access the texture while the transfer is still happening (on Linux, at least) then I can see an incomplete texture. So is there a way to ‘touch’ the pixels on the PBO that will force the transfer to complete, but not take any more time than needed to do that? I know it will block but I will only be doing this after the PBO has had some transfer time already.

Bruce

If you see incomplete texture then it is probably bug in your code. I guess the PBO buffer is not fully loaded with movie frame and therefore you can see this as an incomplete texture.

That’s what I said. Since the PBO upload is one thread, and the use on another, I need a way to force completion of the transfer before I let the displaying thread use the texture.

Bruce

First, I don’t know if any such mechanism exist. NVidia has an extension called fence (http://www.opengl.org/registry/specs/NV/fence.txt) have you looked at that? Have you tried reading back a single texel (e.g. the last) from the texture - isn’t it conceivable it would block, as you need?

Maybe you could give a more detailed explanation of what you’re doing and why? To me it sounds like you would be better off NOT using a PBO transfer whenever you wanted to be notified when it had completed.

Thanks, I’m aware of fence, but I think it blocks the whole context, I was hoping to just force the completion of the one texture transfer.

I have a playback application, and each movie has a transfer thread, in which I copy the pixels to the PBO and start the DMA. It might be doing multiple frames at once though, so I use a job system that starts each off then sees what else needs to be done. For instance, a DMA transfer would start and then a CPU pixel copy can happen in the same thread. I could afford to block, but it would be less efficient. I can’t really spin another thread off, since the extra context needed would cause even more problems (and Nvidia advises keeping the context count down anyway).

If I were using the PBO in the same context/thread, then OpenGL would force completion, but since I need it in another thread, it ‘doesn’t know’.

Something like a single pixel read occurred to me (unless there’s something specific). Possibly I would want to read the top left and bottom right pixel, since I don’t know what order the upload happens in? That would be a glReadPixels? are there pre-requisites that make that unworkable? For instance, what do I read ‘to’?

Bruce

Are you doing something like this?

Thread A (with OpenGL context)
while(…) {
get available buffer ID

glBindBuffer
glMapBuffer
copy video frame
glUnmapBuffer
send the buffer ID to another thread

}

thread B: (with opengl thread)
while(…) {
receive buffer id from thread A (queue)
glTexSubImage
draw
send buffer ID back to the pool of free buffers
}

right?
If so, then change it, the thread that loads data to PBO (by CPU) does not need to have OpenGL context at all. You can do something like this:

Thread A (noOpenGL context)
while(…) {
get available already mapped buffer (memory address)

copy video frame

send the buffer to another thread

}

thread B: (with opengl thread)
while(…) {
receive buffer from thread A (queue)
glUnmapBuffer
glTexSubImage
draw
glMapBuffer
send buffer ID back to the pool of free buffers
}

in this case you have only 1 opengl thread and there cannot be any synchronization issues.

btw. NV fence does not work across two opengl contexts (threads).
that is why we are waiting to get ARB_sync to do it. So far no sings of this extension.

>btw. NV fence does not work across two opengl contexts (threads).
>that is why we are waiting to get ARB_sync to do it. So far no >sings of this extension.

You are not the only one, we would LOVE to see ARB_sync reinstated, but from the general deafening silence on what is a core OpenGL problem, I am not holding my breath.

I’m doing:
Thread A (+ AA etc.): decodes movie frames

Thread B (+ BB etc.): multiple working frame transfers interleaved, in which each frame does:
get PBO, bindBuffer, get buffer data, copy pixels, unmap buffer (x3 with planar), glTexImage2D to start DMA, unbind buffer, post frame to queue

Thread C, D, E, F: use frames from queue to draw by just binding the textures

Thanks for the fence news, but I would be using it on the same one, to test whether the frame was down before I post to the queue.

As for your proposed split, I think I see what you mean, but it puts the actual transfer in the same thread as drawing - I’d rather the copy and the transfer be one a thread, and my understanding is that the actual CPU to GPU transfer doesn’t start until glTexImage2D.

I was hoping that there’s a trivial command I can put after glTexImage2D that will force the thread to block until that one transfer finished.

Actually, I’m thinking of reworking things so that I do a post buffer stage that uses the texture in the same thread (render to a half format FBO) that will force the texture to complete, then I just have to use glFlush or finish to get the result safe to use on another thread, AFAIK.

Bruce

Do not worry about my code outline. The glTexSubimage takes ~0ms when using PBO. The draw thread is not affected at all. I am loading several HD videos this way in my draw thread without a glitch.

Profile it if you don’t believe me.

BTW I have very bad experience with the way you share textures. At least on NVIDIA Quadro cards. Sometimes the texture is not updated at all in the second context.

So, to check my understanding, you are triggering the DMA transfer in your draw thread (which takes no time), doing some other stuff, then using the texture after it’s had time to complete?

Can you explain how you use the memory address then? Don’t you have to map the data in the context? Then you leave that address mapped? I didn’t know you could do that. So can you do anything else in the draw thread while you’re mapped to that buffer?

Anyway, if I wanted to use that approach, I think I would have trouble since I have multiple draw threads (2 screens plus 2 offscreen) and multiple movies heading to the card that I want to keep separate.

That’s bad news about Quadro cards though. What platform? I’m using ATI & NV on Mac and Nvidia (non Quadro) on Linux.

Bruce

You map the PBO in draw thread, and pass the memory address to another thread. The memory address is the same for all threads in your process even without OpenGL context. Once the loading thread fills the memory you pass the information to the draw thread that unmaps the buffer.

I do not know when the DMA is triggered. It is somewhere under the hood. I only know the memory has strange access performance. It is very slow to read from and little bit slower to write to then normal system memory. So for example some decompression algorithms that read back recently decompressed data are very slow (10x or more).

There should be no problem to have more draw threads. You can share the textures already loaded.

I am mostly using NVIDIA on WinXP.

/Marek

In GL (render) thread, you can create pool of PBO’s and map their pointers.
In decoder thread, ask pool for unused PBO pointer, then image data in it and notify pool that PBO is filled with image data.
In render thrad once per frame check is there any pool notification, unmap filled PBO and call glTexSubImage. glTexSubImage will be instant because it copy data from currently binded PBO. After that mark that PBO to map_its_buffer_in_next_frame (assume that transef will finish in current frame). Do not try to map its pointer right after glTexSubImage call… It will stall CPU.

Now, the question is: When the texture data is avaible for rendering. Well you dont know, but you can use that texture object for texturing… GPU will wait until all pending operation on that texture is finished before it use it as source for texturing.

You can reored operation in your render loop so you can process pool notifications before SwapBuffers call. In this case driver will initiate DMA transfers and then swap buffers. If you turn on vsync, then vsync waiting time on GPU should be used for texture data transfer and CPU will be free to spent some time with decoders.

Another idea is to use fences. Just set fence right after each glTexSubImage call an later just check fence status. If its finished, you can use that texure, if not use yoour CPU for something else.

Yes, I think that’s what mfort suggested, and I had my head around something like that - I do like the fact that since the GL stuff is in one thread, it wasn’t a problem. But in my case, I have 2-4 drawing threads, so even if I do that in one thread I still have to see when the other threads can use it.

Meanwhile, I’ve been back and forth about whether to take the next step of using shaders to convert the PBOs to floating point FBOs on the draw threads or the transfer thread. My current thinking (as of 4pm ; ) is that since I will also need to recompile shaders and find uniforms etc every time I change the movie - and pixel format, I can’t risk the time of doing that on the draw threads, therefore I need to draw to FBOs on the transfer threads.

So my problems actually changes from being the esoteric PBO completion to a simpler question of when the FBO drawing is complete.

I know that I can first just glFinish, and hopefully that one thread only will stall, or I can (flush then) use a fence - using test to visit a few times, then wait if needed.

Thanks for the answers and extra knowledge. Love this forum.

Bruce

Just to check back in, I am now using fences (on Mac and Linux Nvidia) to detect drawing is finished, and it’s working very well. It means I can break down and recreate shaders quite happily on the side thread, and do some nice overlapping (with a job pool) copies and transfers, then use a shader to convert to a RGB HALF from whatever the transfer format was. In general, testing the fence a few times (in a round robin way between tasks) shows when the operations are complete, and I only need to finish up to the fence occasionally.

Avoiding contention is the hardest bit, but my current effort seems reasonable.

Thanks for all help.

Bruce