DrawElementsInstanced, DrawArraysInstanced, batch

Joachim.Laguarda · October 26, 2010, 3:42am

Hi folks!
English is not my language so please be forgiving for my aproximate use of english…

I need to draw a lot of instance of the same geometry.
For now my geometry is a “cross bilboard”: two bilboards facing in perpendicular direction. I use them to draw “cheap” far away trees…

I’ve tried two approach, batching and instancing.
-By batching I mean I got 1 VBO with multiple instances in it, each one transformed to a different position. (this should give the optimum drawing speed but this is not very dynamic and require a lot of CPU to create batches. I want to avoid that…)
-By instancing I mean I got 1 VBO with the geometry and one VBO with a lot of 4x3 transformation matrixes. I use “glVertexAttribDivisorARB” to send one matrix to each instance.
Something like that:

glBindBufferARB(GL_ARRAY_BUFFER_ARB, positionBuffer );
glEnableVertexAttribArrayARB( attribLocationX );
glVertexAttribPointerARB ( attribLocationX, 4, GL_FLOAT, GL_FALSE, sizeof(Vector4D)*3, 0 );
glVertexAttribDivisorARB( attribLocationX,1 );
glEnableVertexAttribArrayARB( attribLocationY );
glVertexAttribPointerARB ( attribLocationY, 4, GL_FLOAT, GL_FALSE, sizeof(Vector4D)3, (void)(sizeof(Vector4D)));
glVertexAttribDivisorARB ( attribLocationY,1 );
glEnableVertexAttribArrayARB( attribLocationZ );
glVertexAttribPointerARB ( attribLocationZ, 4, GL_FLOAT, GL_FALSE, sizeof(Vector4D)3, (void)(sizeof(Vector4D) * 2));
glVertexAttribDivisorARB ( attribLocationZ,1 );

I’ve got pretty weird results here and I will be happy if someone have comment or want to share their knowledge about efficient instancing.

On a Quadro 5600 with lastest drivers, in a tiny viewport so the limitting factor should be vertex throughput:

TEST1 - 100 batch of 1000 cross-bilboards drawn with indexed primitives take 4 miliseconds (250 FPS) to draw

glDrawElements ( GL_TRIANGLES, 1000 * 6 * 2, GL_UNSIGNED_INT, 0 );

TEST2 - 100 batch of 1000 instanciated cross-bilboards drawn with not-indexed QUADS take around 5 ms (200 FPS) to draw

glDrawArraysInstancedEXT(GL_QUADS, 0, 4 * 2, 1000);

TEST3 - 100 batch of 1000 instanciated cross-bilboards drawn with not-indexed TRIANGLES take 8 ms (125 FPS)to draw which is also acceptable:

glDrawArraysInstancedEXT(GL_TRIANGLES, 0, 6 * 2, 1000);

TEST4 - now 100 batch of 1000 instanciated cross-bilboards drawn with indexed TRIANGLES take 25 ms (40 FPS)to draw which is not good at all:

glDrawElementsInstancedEXT(GL_TRIANGLES, 6*2,GL_UNSIGNED_INT, 0, 1000);

What the #$@% is going on!
It seems that glDrawArraysInstancedEXT can be as fast as a batched geometry (extra cost in the TEST3 come from sending 6 vertices where TEST1 and TEST2 use 4 vertices by bilboard).
BUT glDrawElementsInstancedEXT is very slow…

This is bad because I also need to draw more generic geometry (houses, …) and as vertexes reuse is a must, indexed geometry should be used.

This results are what I got on my NVIDIA QUADRO card…

Is there a known caveat to use indexed geometry with instancing?
Does the figures look the same on some other graphic boards?
Any help in efficient instancing?

Thank you for your time!

aqnuep · October 26, 2010, 6:42am

Hi,

In fact, your findings are interesting. Me myself did not have any issues when using DrawElementsInstanced, in fact, I got pretty good results with thousands of instances (on Radeon cards).
I think this is a driver issue so maybe on NVIDIA forums they can tell more.

ViolentHamster · October 26, 2010, 10:28am

How are you uploading the index list?

Rather than sending entire matrices, you might be able to send only orientation and position and save some bandwidth to the card. I’ve found that instancing performs better when the instanced geometry contains more vertices. I did a simple test with a billboard tree model like yours and found that instancing was of little benefit. Once I switched to a more complicated model, instancing was a clear winner. I think I did that testing on a 8800 or 285.

Joachim.Laguarda · October 27, 2010, 6:47am

Fine, I also thought it might be the case.
I ve mailed my findings to an NVIDIA engineer, I will post his reply when I got one

Sorry, I’m not sure I get the question.
All my VBO are uploaded once and for all at vbo initialisation with a glBufferDataARB (…,mydata, GL_STATIC_DRAW_ARB).

In fact I thought about it already. The problem is that I need non uniform scale…
So at the best I can send position (3 float), orientation (3 float in a normalised quaternion), scale (3float).
This is 9 floats VS 12 float for a 4x3 matrices.
I’m not sure it’s worth trying considering the extra cost of quaternion transform…

Ok, that’s encouraging. Do you notice any performance difference beetween indexed VS non-indexed instancing?