Is not a common way of putting arrays on AGP o Videomemory?

I have one model that I wan to draw some time in the same frame. You can think in the same tree in some places in one landscape.
You can use one display list if your model is static but if it is dynamic I can’t find a common (between Geforces and Radeons) way to do it quickly in OGL.

I made one testing: one model with 64000 faces and 45000 vertex (aprox) and drawing it 16 times each frame. With one infinite light, infinite viewer, one texture.

With VAR and a GF2GTS I got about 18FPS (12Mtri/seg)
With ATI_vertex_array_object the system freezes (I have to reset) after a few frames. This extension just works with Radeon 8500 and there is nothing similar for other Radeons.
With CVA I got about 2.5FPS (1.6Mtri/seg)
I thought that CVA in T&L cards copies the vertex to AGP and then you can draw it several times but it seems it is not true as the program is doing:
- SetArraysPointers
- LockArrays
- for each instance
- Set modelview matrix
- Draw model (tried DrawElements and DrawRangeElements)
- UnlockArrays
And it got, more or less, the same speed that not using LockArrays and UnlockArrays.

In D3D8 I got more or less the same speed that VAR (with GF2GTS) and faster on Radeon8500.

So my question is:
Is there a common (ARB) way for GeForces and Radeons to do it with reasonable speed?
I can’t believe that they are discussing about vertex programmability and there is no fast way to send those vertex.

Thank you.

CVA’s do not copy vertices over to AGP. CVA’s are just used to cache transformed vertices, so hopefully for multipass effects you dont have to transform everything again.

This will not work if you draw the same object in different places.

There is currently no common way to upload vertices to AGP memory across a variety of boards in OpenGL. OpenGL 2.0 will recitfy this, but it may be a while before you see implementations of that coming out. 1st quarter 2003 at the absolute earliest I would say.

Nutty

This will not work if you draw the same object in different places.

I think it’s wrong. I wrote an app to play with transfer modes. I draw the same object 4 times, using glTranslate/Rotate to move it, and I get a good perf boost using CVA.
(good meaning +90% for a 3900-vertex object, comparing to regular vertex arrays)

But this boost vanishes if the number of locked elements exceeds some value ( around 5000 or 6000 on my card). In this case, LockArrays is probably ignored by the driver and rendering perf is identical to that of regular vertex arrays. Look at the CVA spec, you’ll see that the issue of ‘how many elements can I lock at once?’ is not solved. It just says that you can ask to lock any number of elements, but it’s up to the driver to ‘optimize’ or not. And you can’t query the threshold value.

I remember reading something like ‘use CVA the way Quake does, or don’t use it’ on this board. Maybe Quake locks 5000-element batches.

Hope this helps.


I remember reading something like ‘use CVA the way Quake does, or don’t use it’ on this board. Maybe Quake locks 5000-element batches.

Humm… This has sense… I’ve noticed that when you go out of Quake3 path drivers bugs and limitations begins to arise.

Has any sense, in this T&L times, don’t have an ARB extension to submit the vertex properly? (GeForce was released in 1999)
Has any sense that the first part of the pipeline, vertex submision, is not solved this days?
Reading this forum I found that this is one of the most commented issue (using CVA, how is the best way to send vertex to OGL, using VAR, allocating AGP memory, …)
Didn’t hardware vendors notice that?

Thanks…

Hmmmm… how can it cache the transformed vertex if it needs transforming to another position?

Are you sure the speedup wasn’t due to vertex re-use within each object render, rather than across all the object renders?

Nutty

You’re right :wink:

In fact my object is a sphere which is generated at the begining of the app and stored in vertex array. Obviously there is a lot of shared vertices.

Some observations :
If I render the object only once, I get no perf gain using CVA. If I render it twice or more (moving it), I get a great speedup. The driver clearly does something clever to take advantage of vertex reuse. Perfs are close to what I get when I render the stripified version of the sphere.

So in my case, CVA is good because it allows detection of shared vertices. Rereading the spec, this advantage is clearly stated in the overview as a potential gain. So CVA is not just for post-transform caching.
Spec says CVA allows to :
1/ possibly transfer data to higher-bandwith mem
2/ possibly cache transformed results
3/ possibly detect shared vertices (quote :‘static vertex array
data to be cached or pre-compiled for more efficient rendering’)
(you noted that ‘detect shared vertices’ is my interpretation of ‘pre-compiled’)

Now of course for something as simple as a sphere, it’s better to handle shared vertices myself and stripify it.

BTW, 1/ could mean that the driver has the opportunity to transfer to AGP… I don’t know how CVA is implemented in the drivers…

I just reread a presentation by John Spitzer (who works at NVIDIA). I don’t have the exact link, but I picked it on NVIDIA’s site, the file is named GDC01_Performance.pdf

Page 10 is about CVA :

*Shared vertices can be detected, allowing driver to eliminate superfluous operations.
*Locked data can be copied to higher bandwidth memory for more efficient transfer to the GPU.

Hope this helps.

You say you get performance increase using CVA’s if you draw the object twice, but no speedup if you draw it once. Is that correct?

I still can’t believe the driver is using transformed vertices from the 1st render in the 2nd render.

Perhaps the buffered calls are analysed, and if references to the same vertex array is made more than once, then it auto copies it to AGP, and keeps it there for the 2nd render.

With ordinary vertex arrays, the vertex array would be copied to AGP memory twice.

This saving of 1 transfere to AGP is probably the speed increase you are experiencing.

Nutty

The win with LockArrays is that the driver can copy to a fixed space in AGP memory, and then tell the card to pull vertices from there as long as you haven’t changed the configured array pointers or unlocked the arrays.

Because you specify how much data to copy, this could even be a win compared to a regular DrawElements for a single instance, but it should be the same a DrawRangeElements when you’re only drawing one instance. If you’re drawing more than one instance (even when changing the modelview matrix) it should always be a win unless you’re using VAR.

Of course, if the driver only allocates a buffer of size X to use for LockArrays data, and the amount of data you lock is greater than X, the driver will probably skip this optimization, and you’ll fall back to the regular path.

When you run GLTrace on games that use the Quake III engine, you’ll see about 10 k tris per frame in the original Quake III levels, and about 20 k tris per frame in games like Alice. These are split on many individual lock calls, so the pre-allocated buffers in some drivers may be fairly small.