How to change VBO by content of another?

Aleksandar · January 28, 2011, 3:52pm

Hi All,

In order to avoid unnecessary stress caused by walking an unknown path, I would rather ask the community to help me answer the following question:

How to change the content of one VBO by content of the other using OpenGL API?

It is not an ordinary copying. For example, I need to create an upsampled/downsampled version of data already stored in another VBO, or something similar. I have implemented it using CUDA, but that approach has several drawbacks:

The registering process (cudaGraphicsGLRegisterBuffer) is a pretty lengthy operation. Application initialization time becomes unacceptably long if I register all VBOs I want to use later.
Whenever a transformation should occur, buffers involved in the transformation have to be mapped (cudaGraphicsMapResources + cudaGraphicsResourceGetMappedPointer). Mapping requires all pending GL calls to finish before CUDA start to use the resources.
Unmapping (cudaGraphicsUnmapResources) also creates a stall in a pipeline waiting CUDA to finish all pending calls.

Much of the Map/Unmap times can be hidden by smart code reorganization, but cannot be completely eliminated. Application initialization (registering) is the most irritating issue. There are also other issues, but I believe they are driver dependent, and they currently make no trouble (like different VBO memory footprints after mapping with CUDA). Having all previously mentioned in mind I would try to use only GL API to achieve similar results, and to avoid stalls imposed by CUDA/GL interoperability.

Before starting banging my head against the wall, I would like to hear your opinions and suggestions.

Thank you in advance!

Alfonse_Reinheart · January 28, 2011, 4:34pm

For example, I need to create an upsampled/downsampled version of data already stored in another VBO, or something similar.

What do you mean specifically by “upsampled” or “downsampled?”

If what you’re doing has a 1:1 correspondence between the input and the output, then you can just use transform feedback and do your processing in the vertex shader. You’ll have to decide how to pass the data as vertex attributes and then pipe the transform feedback outputs to the buffer in the order you require.

The 1:1 correspondence doesn’t mean one vertex attribute in to one value out. You could easily have two attributes input that combine to form one output. Or 4 attributes to make 6 outputs. Each attribute can be any size, up to a vec4, and each output can be any size up to a vec4. It’s all about how you set your attributes and outputs up.

There are also games you can play with attribute definitions that may allow you to get around some of the attribute value issues. That should be good enough for many uses.

So if that limitation is too much, you’ll need to get into extensions like shader_image_load_store. That extension might be a GL4.0+ class feature though.

Aleksandar · January 29, 2011, 2:16am

Thank you, Alfonse!

I’ll consider using transform feedback for the purpose, but combined with texture buffer object.

By downsampling I mean “creating one out of four”, generally; but there can be also various other combinations (1->1 or 2->1). Upsampling practically makes buffer more dense inserting new vertices using appropriate interpolation (similar to tessellation). In both cases sizes of the source and destination buffers are different (e.g. (n+1)x(n+1) -> (2n+1)x(2n+1) for upsampling, and vise versa for downsampling).

Maybe I could use tessellation for upsampling?

It is not a problem to bind the solution to SM5 graphics cards (GC) if it is efficient enough. Currently, I’ve got a very efficient execution on CPU. I just want to remove data transfer to graphics card memory, and reduce main memory footprint (I keep a copy of each VBO in main memory to avoid reading from the GC memory). Source data is already in GC memory, and transformed data should be sent to GC memory too, so it seems that transferring recalculation to GPU could be beneficial.

Alfonse_Reinheart · January 29, 2011, 9:45am

By downsampling I mean “creating one out of four”, generally; but there can be also various other combinations (1->1 or 2->1).

That didn’t really answer the question. What algorithms are you trying to use? Are you saying that every group of 4 floats in the input translate to 1 float of the output? Or is it a sliding scale, where each consecutive group of 4 floats becomes a single float in the output?

In the first case, an input data stream of 8 floats would build an output data stream of 2 floats. In the second case, the 8 float input would build 5 floats of output. The first output value is constructed from floats 0, 1, 2, 3, the second from 1, 2, 3, 4, and so forth.

You can technically do both. The first case is pretty simple to setup.


//Setup buffer object and attributes in C++
glBindBuffer(GL_ARRAY_BUFFER, buffer);
glEnableVertexAttrib(0);
glEnableVertexAttrib(1);
glEnableVertexAttrib(2);
glEnableVertexAttrib(3);
glVertexAttribPointer(0, 1, GL_FLOAT, GL_FALSE, 16, (void*)0);
glVertexAttribPointer(1, 1, GL_FLOAT, GL_FALSE, 16, (void*)4);
glVertexAttribPointer(2, 1, GL_FLOAT, GL_FALSE, 16, (void*)8);
glVertexAttribPointer(3, 1, GL_FLOAT, GL_FALSE, 16, (void*)12);

//Setup transform feedback stuff.

glBeginTransformFeedback(GL_POINTS);
glDrawArrays(GL_POINTS, 0, 2);
glEndTransformFeedback();

The second case is similar, but the array stride is different.


//Setup buffer object and attributes in C++
glBindBuffer(GL_ARRAY_BUFFER, buffer);
glEnableVertexAttrib(0);
glEnableVertexAttrib(1);
glEnableVertexAttrib(2);
glEnableVertexAttrib(3);
glVertexAttribPointer(0, 1, GL_FLOAT, GL_FALSE, 4, (void*)0);
glVertexAttribPointer(1, 1, GL_FLOAT, GL_FALSE, 4, (void*)4);
glVertexAttribPointer(2, 1, GL_FLOAT, GL_FALSE, 4, (void*)8);
glVertexAttribPointer(3, 1, GL_FLOAT, GL_FALSE, 4, (void*)12);

//Setup transform feedback stuff.

glBeginTransformFeedback(GL_POINTS);
glDrawArrays(GL_POINTS, 0, 5);
glEndTransformFeedback();

In both cases your shader looks like this:


layout(location = 0) in float first;
layout(location = 1) in float second;
layout(location = 2) in float third;
layout(location = 3) in float fourth;

out float output;

void main()
{
  //Do stuff.
  output = //stuff.
}

As for upsampling, that can be achieved easily enough by outputting multiple outputs from the shader. The number of outputs that can be written by transform feedback is implementation defined. And the written data is limited to 32-bit floats or 32-bit integers (you can’t pack the data the way you can with attributes).

Aleksandar · February 2, 2011, 3:08pm

Thank you, Alfonse!

Downsampling is pretty easy to implement just by setting index buffer to jump appropriately. I’m sorry, but I figured it out just after the previous post.

The problem with the upsampling is (thus far) more complicated. This is, in fact, a 2D problem. Let’s denote vertex attribute value in the destination buffer with V’, and the value in the source buffer with V. Source buffer has bx(n+1)[/b] vertices, and destination has bx(2n+1)[/b] vertices.
If ind is an index in the source buffer, then 2D indices will be i = ind / (n+1) and j = ind % (n+1).

The values in the destination buffer should be calculated this way:


V'[(4n+2)i+2j]      =  V[(n+1)i+j]
V'[(4n+2)i+2j+1]    = (V[(n+1)i+j] + V[(n+1)i+j+1])   / 2
V'[(4n+2)i+2j+2n+1] = (V[(n+1)i+j] + V[(n+1)(i+1)+j]) / 2
V'[(4n+2)i+2j+2n+2] = (V[(n+1)i+j] + V[(n+1)i+j+1] + V[(n+1)i+j] + V[(n+1)(i+1)+j]) / 4

/*---------------------------------------------------------------
 O - Original value
 X - Value calculated from two neighboring O
 W - Value calculated from four neighboring O

 Source     Destination
 O----O      O---X---O
 |    | ->   |   |   |
 O----O      X---W---X
             |   |   |
             O---X---O
-----------------------------------------------------------------*/

Still don’t have any idea how to solve this even with TBO. Each vertex is not aware of its position inside the matrix, and I don’t want to introduce new attributes.

Maybe several iterations would solve the problem; each with separate calculation formula and different output buffer layout.

Alfonse_Reinheart · February 2, 2011, 4:05pm

Source buffer has (n+1)x(n+1) vertices, and destination has (2n+1)x(2n+1) vertices.

You can’t do that. That would require a shader to not output certain values at certain locations (the right and bottom edges) or to overwrite old data. Neither of these is possible; the stride magic that allows you to read an input value into several iterations of the shader doesn’t work with transform feedback.

If you have n input values, you can only get k*n output values. If you were content with not outputting the right and bottom edges at all, you could do it. Each input quad returns an output quad. The top-left of the output is the top-left of the input, and the other 3 values are the interpolations based on the input.

Honestly, you should probably go back to CUDA or OpenCL. You could do what you want with a geometry shader (take a quad list as input, then output 4 quads), but that will likely be slower than any GPU stalls you might get from CUDA locking.

Maybe several iterations would solve the problem; each with separate calculation formula and different output buffer layout.

Do you think that this will be faster than the stalls you get with CUDA?

Aleksandar · February 2, 2011, 4:49pm

So, let’s back to CUDA…

I don’t know. I have to try.

Thank you very much for the fast response!

I have one more question to ask. It is a little bit out of topic, but I won’t start a new thread for this.

Do you have any experience with updating VBOs with multiple threads using shared contexts?

After all I have read about NV GeForce drivers, I think that they serialize access to GPU. Quadro drivers have some level of parallelization, but GF don’t (or at least it is disabled).
Apart from better CPU utilization, do you think that GPU execution can have any benefit from such parallel data loading?
I’ll certainly try it by myself, but your opinion would be valuable.

Aleksandar · February 12, 2011, 10:46am

I have implemented pure OpenGL resampling in my application.
Upsampling requires 5 passes, but it is fast enough.
But, there are some observations I want to share with you.

glBeginTransformFeedback() call is far more expensive that I thought.
glDeleteBuffers() calls stochastically change duration. From almost immediate execution to several orders of magnitude slower execution (from 1.7uS to 297uS on GTX470). I tried to find why this happens, but I failed.

The second problem made me implement statically Transform Feedback buffers in application creation time, and reuse them when I need TF. Although I solved the problem in this case, it generally stays and it is severe. Deleting buffers can last 3 times more than 5 buffers creation + 5 iterations with changing shaders (the whole upsampling procedure).

Does anyone know why this peculiar behavior occurs?

Alfonse_Reinheart · February 12, 2011, 10:56am

I tried to find why this happens, but I failed.

Because you deleted a buffer. There is no reason at all for you to explicitly be deleting a buffer. Orphaning can help in some cases, but not full-on glDeleteBuffers.

Aleksandar · February 12, 2011, 1:19pm

Can you elaborate this? I really don’t understand why deleting time varies, and especially why it can be more expensive than the whole (and pretty complicated) procedure of other OpenGL commands?

Alfonse_Reinheart · February 12, 2011, 2:13pm

Can you elaborate this?

I would wager that deletion of objects is not really something that implementations expect you to do in the middle of performance-critical code. IHVs can write many kinds of optimizations. But, as with good programmers, they only bother to write optimizations for cases that are actually likely to be used. And people generally do not destroy buffer objects in the middle of their render loops.

Furthermore, there’s no reason for you to be creating/deleting any buffer objects in this case. So I don’t know why you even encountered this.

Aleksandar · February 12, 2011, 3:04pm

I have already said that I switched to static VBOs. Deleting is not an issue in this particular case. But in general, it is much easier to create/delete buffers on-the-fly whenever they are needed than using complex synchronization to prevent simultaneous usage (you’ve already mentioned orphaning).

On the other hand, I don’t think deletion is so difficult to deal with in a performance-critical code (whatever it means; does OpenGL do anything else than drawing? There is no assumption about how critical some code is.). The driver could mark object for deletion, and do it when it is more convenient. If you try to read memory allocation just before and after buffer creation, you’ll see that in almost all cases it is the same. The drivers are highly optimized, and I really don’t see the reason why deletion could cause a pipeline stall.

Alfonse_Reinheart · February 12, 2011, 3:35pm

But in general, it is much easier to create/delete buffers on-the-fly whenever they are needed than using complex synchronization to prevent simultaneous usage (you’ve already mentioned orphaning).

As I said in another thread, performance requires effort. Easy things are, more often than not, slower than doing things the fast way.

And there’s nothing particularly complex about buffer object orphaning.

The driver could mark object for deletion, and do it when it is more convenient.

And the time that is spent writing and testing this code is not being spent writing and testing code that would optimize some other thing. Something that people would actually find useful.

The question isn’t whether it could be faster. The question is why they should bother.

The drivers are highly optimized, and I really don’t see the reason why deletion could cause a pipeline stall.

It doesn’t matter whether you see why it could or not happen. What matters is that it does happen. And you have to deal with that fact.

kRogue · February 14, 2011, 5:35am

One thought:

GL_shader_buffer_store

Requires GL4 NVIDIA hardware… also I highly suspect that transform feed back is likely to perform better… but not as straightforward to do as OpenCL or the above.

Aleksandar · February 14, 2011, 6:16am

Thanks kRogue!

I’m already using Bindless in combination with transform feedback, and find that glBindBufferBaseNV/glBindBufferRangeNV would be useful to have (functions that would accept GLuint64EXT address instead of buffers IDs).

I’ll try shader buffer store, and find if it is fast enough. In fact, for small buffers (less than 4k vertices), CPU implementation is much faster than pure GL implementation based on transform feedback. I’ll compare TF, shader buffer store and CUDA implementation and find the most suitable.

The problem with CUDA was in fact that I registered all VBOs (thousands of them) for CUDA use, and it spends a lot of time. I’ll try to combine TF and confine CUDA use to only several buffers…

Thank you for the useful advice!

P.S. In fact, “shader buffer load” is only I need. Shader buffer store adds writing and some atomic operations, which are not required in my use-case. I think the problem can be solved by using combination of random read from multiple buffers using GPU addresses in VS (shader buffer load and READ_ONLY resident buffers) and transform feedback to capture transformation.

Aleksandar · February 15, 2011, 3:31am

I need various transformation of VBO content, like:

[li]downsampling,[]upsampling,[]splitting and[*]merging.

Downsampling is making (n+1)x(n+1) buffer out of (2n+1)x(2n+1) buffer.

Upsampling is making (2n+1)x(2n+1) buffer out of (n+1)x(n+1) buffer using bilinear filtering. Take a look at this.

Splitting is making four (n+1)x(n+1) buffers out of (2n+1)x(2n+1) buffer (splitting a single tile into four tiles).

Merging is making a single (2n+1)x(2n+1) buffer out of four (n+1)x(n+1) buffers (merging four tiles into one tile).