Vertex Arrays: Interleaved or seperated arrays?

Hi there

Currently i use VBOs with interleaved arrays. Now, i am at a point where i need to dynamically update the content of my arrays. It would be quite complicated to work on those interleaved arrays, so i would like to change my vertex-array manager to use seperate arrays for every type of data. Also, i think it might be more efficient to only activate those arrays, which are actually needed, which makes no sense with interleaved arrays, since still the same amount of memory would be required.

However, now my question is, if it makes sense to change my stuff to non-interleaved arrays? How is performance? Would it be worse, better, the same?

Or would it make sense to do both: interleaved for static data and non-interleaved for dynamic stuff?

Thanks,
Jan.

Could be just a bit worth, but the difference will be minimal

Depends on bottleneck…

Originally posted by Jan:
Or would it make sense to do both: interleaved for static data and non-interleaved for dynamic stuff?
This would be the best in most cases, but it depends on what’s the bottleneck and so on. Interleaved arrays are fastest to access in general, but if large portions of each vertex is unused, or a particular attribute needs to be updated often, then it may be faster to rip some stuff out and provide that as additional streams.

Well, in general i don’t update my vertex-arrays, because most geometry is static and the rest only translates and rotates, which can be done with the modelview matrix.
And everything really animated can be done with skinning in a shader, i think.

However, at the moment i am implementing a particle system. Obviously, i need to upload the positions of the particles every frame, sometimes even texture coordinates or colors. And i also thought, that it might not be useful to have one vertex array per particle system, but better maybe 5 vertex arrays which get used to upload ALL particle systems (more than one, so that i can render from one and upload the next, etc).

And here it might be mighty useful to have a non-interleaved array, because then i can do a nice memcpy and have a whole bunch of particles copied.

BTW: The particle-data is non-interleaved (struct-of-arrays instead of array-of-structs), so that i am be able to use SSE or such, in the future, so i wouldn’t want to change that.

However, the question is HOW BIG the penalty would be, if i use non-interleaved arrays for everything, or if it would be worth the additional work.

Thanks,
Jan.

@Humus: You might know that:

Does it make a difference in which order i put my data into an interleaved array?

At the moment i use
position | color | normal | fog | texcoords0…7

I mean, if the hardware can benefit from an order which fits its internal organization, then i’d like to use that, of course.

Jan.

Try it and measure it in a simple test program, then let us know your conclusions (giving hardware and driver version).

Originally posted by Jan:
[b]@Humus: You might know that:

Does it make a difference in which order i put my data into an interleaved array?

At the moment i use
position | color | normal | fog | texcoords0…7

I mean, if the hardware can benefit from an order which fits its internal organization, then i’d like to use that, of course.

Jan.[/b]
I can’t see any reason why order should matter as such, but alignment and vertex size can matter. So if your vertex ends up as 68 bytes, you may consider shaving something off or using a smaller type somewhere to cut it down to 64. Or if your vertex is 60 bytes it may be faster to pad it with 4 bytes. And place all floats on a 4-byte aligned address and so on.

This question is very interresting I guess. So, I’d like to participate in order to find a way or to light up my lantern.

I think using interleaved or non interleaved arrays might depend on just one thing I’m just going to try to explain.

We should know how the card manages the data internally. If giving some vertex (and normal…) arrays, deos the card computes data per vertex, per triangle or could it do several triangles each time ? I mean, if it does per vertex computation, then it will need to switch from one array to another (first the vertex array, then the normal array…). So, doing non interleaved array would surrely be a bottleneck. But if the card can compute several triangles each time (I think it’s the way on pipelined cards, but I might be wrong) it won’t need to switch from one array to another so often. So, doing interleaved or non interleaved won’t have much difference.

Well, i ran a few tests.

All this was done on an ASUS Laptop with Intel Centrino 1.8 GHz, 512 MB DRR RAM and an ATI Radeon 9700 Mobility (64 MB) using Catalyst 5.6. And WinXP, of course.

To test the throughput i used a particle-system. Blending off, Alpha-Test off, Depth-Writes on, no shaders, no lighting, etc, plain old

textured quads.

The particle-system itself was pretty simple. At one point a big bunch of particles were emitted and simply flew up. No complex math behind

it. Also the particles are screen-aligned billboards and the billboarding was done on the CPU.

Every vertex consisted of position (3 floats), color (4 bytes) and a texcoord (3 floats) = 28 bytes = 112 bytes per particle.

I used VBOs. I set the usage to STREAM and to DYNAMIC. I didn’t see a difference. Even STATIC seemed to make no difference. To upload the data i mapped the VBOs and stored the data directly in the buffer without temporarily storing it in RAM. Every particle-system had its own vertex-buffer, i didn’t share them.

I used glDrawArrays to render the particles, so no indexing, but that wouldn’t make a difference anyway.

Now to the interessting results:
First i used half a million particles. That worked with interleaved arrays. When no particles were rendered i got 31 fps, when particles were

rendered i got 15 fps. With non-interleaved arrays (3 arrays) i got ALWAYS 3 to 6 fps, even if no particles were updated/rendered. I tracked

it down to something around 480000 particles. If i used 48xxxx particles i got 15 fps, with 48xxxx+1 particles, performance broke in. Seems

to be a memory issue.

So for further tests i used 4 particle-systems with 100000 particles each = 1600000 vertices.

The result: interleaved and non-interleaved were about equally fast (31 fps with no particles, down to 9-11 fps with particles).

Surprisingly, immediate mode achieved 11 to 15 fps.

I also checked CPU usage. When updating the particles it rised from 50% to 75%-85%. There were no big differences, but immediate mode always

consumed a few % more.

Well, my conclusion is this: I don’t think, that the gfx-card was the limiting factor here. Filling the vertex-buffer was quite an easy task,

not many operations per particle, but still the CPU seemed to be the limiting piece. So, in general, it seems not to make a difference, what

to use.

However, i experienced a few other issues. One big disadvantage of non-interleaved arrays is, that one cannot map more than one buffer at a

time. This means, i am not able to update all vertex attributes in one loop, but i have to do one loop per attribute. Since i needed to

calculate some per-particle temporary results, i needed to do this 3 times as often as with interleaved arrays.
That makes non-interleaved arrays very cumbersome to work with and for more complex particle-systems it might be very inefficient.

So, interleaved arrays are my preferred choice.

Now, one advantage of vertex-arrays over immediate mode is, that you can update the vertex-arrays only every few frames, which you need to do

every frame, when using immediate mode. So vertex-arrays can be more efficient.

I THOUGHT!

My code only updates the vertex-arrays if 40 milliseconds have passed and some particles are active.
Now something strange happened. If i updated the vertex-arrays every frame, no matter if particles were active or not,
then i got 31 fps with no rendered particles and 20 fps with particles rendered. When i updated the vertex-arrays only
if at least one particle was active, then i got 31 fps with no particles rendered and 15 fps when particles were rendered.

I tried it both with STREAM and DYNAMIC usage, no difference.

So i changed my code to map the vertex-buffer every frame, even if no particles are active or no change was necessary.
That means i mapped it and immediatly unmapped it. This brought the fps back to 20, when particles were rendered.

I am pretty sure this is a driver bug! I cannot organize my engine to map and unmap all unused buffers every frame to
get a good framerate!

However, my conclusion is, that interleaved arrays are the best choice. They are easier to work with, give good performance
and (if that bug wouldn’t be), are a bit more efficient than immediate mode. However, the speed of immediate mode really
surprised me. Seems to be well optimized internally.

Puh, what a long post. Hope it is interessting to read.

Jan.

are u drawing the 480,000 things in one call.
ive found u can get much better performance (2-3x from memory)
if u split the drawing up into smaller batches, start with 500 and increase the batch count by 500 a time until u hit the sweetspot

Originally posted by Jan:
One big disadvantage of non-interleaved arrays is, that one cannot map more than one buffer at a time.
Uhm, I don’t see any such restriction in the spec.

Originally posted by Humus:
[quote]Originally posted by Jan:
One big disadvantage of non-interleaved arrays is, that one cannot map more than one buffer at a time.
Uhm, I don’t see any such restriction in the spec.
[/QUOTE]Well, i get a NULL-pointer if i try to map more than one buffer at a time. And i am not talking about mapping the same buffer more than once, but different buffers.

Did you try with glBuffer[Sub]Data calls instead of glMapBuffer ?

From what I read, glBuffer[Sub]Data should be prefered for performance, see :
GDC Performance Tuning Document

I don’t have any direct experience about this question though…

Originally posted by Jan:
[quote]Originally posted by Humus:
[quote]Originally posted by Jan:
One big disadvantage of non-interleaved arrays is, that one cannot map more than one buffer at a time.
Uhm, I don’t see any such restriction in the spec.
[/QUOTE]Well, i get a NULL-pointer if i try to map more than one buffer at a time. And i am not talking about mapping the same buffer more than once, but different buffers.
[/QUOTE]I routinely map different VBOs at the same time, and write to them simultaneously. Works fine here (on both ATI and nVidia HW). There could be some other problem there, and it might also be the cause for the strange slowdowns you are experiencing.

Originally posted by Jan:
Well, i get a NULL-pointer if i try to map more than one buffer at a time. And i am not talking about mapping the same buffer more than once, but different buffers.
I quickly tested this, but couldn’t reproduce a problem. Do you have a repro case?

I’ll check it out again. But time is short at the moment, so it could take a bit.

Jan.