Occlusion queries inside shaders and "group objects queries" for HW-frustum culling

Let me explain a little extension I have in mind:

MOTIVATION

Real transparency sorting is very slow. Methods like depth peeling or triangle sorting are too much expensive.

Traditional shadow methods ( shadowmaps, shadow volumes, etc… ) have tons of problems and are too much tricky ( bias, infinite extrussion cap, perspective antialiasing, filtering… )

Physics are coming to the world of the GPUs. These days we assist to a war between the GPUs and the AGEIA card.

New “parallax” bump methods are coming. Last one is the “parallax occlusion mapping” ( im sure you saw the wonderful ATI presentation about this ).

There are tons of applications that could use the HW-acceleration to accelerate normal map generation, ambient occlusion or lightmap/PRT computation.

We need a method to cast “rays” from the shaders so we could perform:

  1. Real and high-accurate penumbra shadows
  2. Transparency sorting
  3. Collisions and physics in the GPU
  4. More accurate “occlusion parallax” bump mapping using a heightmap.
  5. Medical 3D voxel applications
  6. Volumetric fog ( distance between closest ray hit and furthest )
  7. Sub-surface scattering

PROPOSAL

We can create an “object batch”. This is like an occlusion query batch really. We tell the driver some geometrical objects are “grouped”. We could use something like:

   GLint query = glCreateObjectQuery();
   glStartObjectQuery(query);
   glAddToObjectQuery(query,obj.vb,obj.ib,obj.id);//vb is a vbo, ib is an index buffer and id is an identifier for the object (like in the geom shaders)
   glAddToObjectQuery(query,obj2.vb,obj2.ib,obj2.id); 
   glAddToObjectQuery(query,obj3.vb,obj3.ib,obj3.id);
   //Add more objects to the query forming a "group set" 
   ...

   glEndObjectQuery(query);

The driver now calculates the AABB, OOB of the set. Also could make an octree or other spatial structure for future use.

Of course will be good too to allow “queries inside queries” feature to make this hierarchic.

Ok, now this extension will define a function for the GLSL called “castRay” like:

   bool castRay ( [in]vec3 origin, [in]vec3    endPoint,
        [out]vec3 hitPoint, [out]vec3 hitNormal, [out]int triangleId, [out]int primitiveID );

We pass to the function the ray origin and end point ( which is origin + (rayDir*distance) ).
The function will return “TRUE” if there is hit and then will fill the out params hitPoint, the interpolated normal and the geometry-shader triangleID and primitiveID.
The function will return “FALSE” if there is NOT hit ( and then out params won’t be filled )

This function iterates over all the previously-created “object queries”, trying to find a ray-triangle intersection. The driver must internally optimize the “object queries” using AABBs and some kind of hierarchical structure to optimize the collisions ( like the AGEIA does ).

The “castRay” should be available for BOTH vertex and fragment shaders ( and pehaps too in the upcoming geometry shaders? )

With this, you could do collisions, shadows, etc… inside the GLSL.

For example, you could perform raytraced shadows in the fragment shader in a very simple way:

uniform vec3 lightPos;
uniform float lightRange;
uniform sampler2D baseTex;

varying vec3 vPos;

void main ()
{
   vec3 base = texture2D(baseTex,gl_TexCoord[0].st).rgb;

   vec3 hitPos, hitNormal;
   bool inShadow = castRay(lightPos,vPos,hitPos,hitNormal);

   if ( inShadow )
   {
      base *= 0.4;
   }

   gl_FragColor = vec4(base,1.0);
}

The OGL driver when finds the “castRay” GLSL instruction will iterate over all the “group object batches” previously-created with the glCreateObjectQuery() function.
If the ray doesn’t touch the group AABB no test will be performed and could skip that triangle set SUPERFAST.
If the ray touches the group AABB then will find more AABBs inside the AABB ( iterative process ). Once reached a “node limit” will perform ray-triangle hit tests ( yes this is slow but only occurs a few times and there is where the NVIDIA/ATI engineer’s brain should work to optimize it ).

CONCLUSION

What do you think about this? We can perform raytracing in the GPU very easy with this and will be HW-driver optimized very fast. This feature is very good to achieve TONS of effects like the mentioned ones.

Also, “object batch queries” can be used by the driver to perform HW-accelerated simple frustrum culling because the driver could know the AABB or OOB of the group sets, so could skip FAST these objects if are not visible…

However, the performance can be bad… That is why the graphic cards should implement any kind of spatial structure to accelerate the ray-triangle collision test ( like a kd-tree or axis aligned bounding boxes ). Perhaps dynamic-objects shouldn’t be available for this so we could limit this to static objects to start??? Also perhaps the driver should forget octrees/kd-trees and just fire a HW-occlusion query painting all the “object queries” and to get if the pixel is visible from a camera placed at ray origin??? Well, the “castRay” could be implemented by the graphics engineers using different methods…

“Object batch queries” combined with the “castRay” GLSL instruction ( combined too with the geometry shaders ) can open a new world for the GPU, the raytracing, the next step we all waiting for to achieve real and amazing new effects!

We need a method to cast “rays” from the shaders
http://graphics.cs.uni-sb.de/SaarCOR/
http://graphics.cs.uni-sb.de/~woop/rpu/rpu.html

http://www.openrt.de/

I second ZbufferR’s implicit sentiments.

I don’t think that this is something that should be handled explicitly by OpenGL. It may be more flexibly handled using GLSL and general purpose extensions.

http://graphics.stanford.edu/~tpurcell/
http://graphics.cs.uiuc.edu/geomrt/

Another problem with this suggestion is how to handle dynamic meshes (possibly animated in the application or vertex shader).

Originally posted by gamefreedom:
Another problem with this suggestion is how to handle dynamic meshes (possibly animated in the application or vertex shader).

Yep yep that is a problem like I mentioned… But the problem are not the “dynamic” meshes. Non-deformable-but-mobile mesh ray-triangle can be achieved just passing ray to local space. The problem are the “deformable” meshes like skinned or morphed ones…How AGEIA solves this, any idea?

However, I will be very happy if I can raycast only static objects( or mobile rigid ones) using a dynamic light… That will be enough for me!

About the links… Some are good things( and look nice!) but a little vaporware and too much abstract. This is much more concrete and possible to do with a GPU and much more simple. It also allows to parallelize the “castRay” GLSL function using an instruction cache ( or we could modify its params to fit better SIMD passing an array of vec3 and a count ).
The proposal I think can be better than that pure-RPU-raycaster because you can decide if raytrace or not a determinatred pixel. The links show a “we use raycast for all”, and I only want to raycast a few pixels( for example not the shadow pixels when the light diffuse is zero ) and not all.

Originally posted by santyhammer:

Yep yep that is a problem like I mentioned… But the problem are not the “dynamic” meshes. Non-deformable-but-mobile mesh ray-triangle can be achieved just passing ray to local space. The problem are the “deformable” meshes like skinned or morphed ones…How AGEIA solves this, any idea?

Sorry, I missed that line; I did mean deformable/morphing meshes (particularly those shaped by the vertex shader).

Originally posted by santyhammer:

I only want to raycast a few pixels( for example not the shadow pixels when the light diffuse is zero ) and not all.

Have you tried implementing this using GLSL? It would be good to have a (at least speculative) comparison (in efficiency and difficulty) if you want to promote this suggestion.

Originally posted by gamefreedom:
Have you tried implementing this using GLSL? It would be good to have a (at least speculative) comparison (in efficiency and difficulty) if you want to promote this suggestion. [/QB]
Yep, the problem with the current raytacing with the GPU is that there is no efficient way to discard AABBs with triangles inside from the fragment shader. You need some spatial structure to hierarchicly discard the triangles, and doubt this could be performed atm in a fragment shader like I want. So far I found:

        [http://www.clockworkcoders.com/oglsl/rt/gpurt1.htm](http://www.clockworkcoders.com/oglsl/rt/gpurt1.htm)            
        [http://gpurt.sourceforge.net/DA07_0405_Ray_Tracing_on_GPU-1.0.5.pdf](http://gpurt.sourceforge.net/DA07_0405_Ray_Tracing_on_GPU-1.0.5.pdf)            
        [http://graphics.stanford.edu/papers/gpu_kdtree/kdtree.pdf](http://graphics.stanford.edu/papers/gpu_kdtree/kdtree.pdf)           
[http://inferno.hildebrand.cz/index.php?f...p=0&r=&m=&dir=0](http://inferno.hildebrand.cz/index.php?fs=1&s=14&p1=0&p2=0&p3=0&p4=0&p5=0&p6=0&p7=&p8=&c1=0&c2=0&c3=0&c4=0&c5=0&c6=0&p=0&r=&m=&dir=0)     

These implementations are a nightmare and far from efficient ( tables show it, see last 128x128 drops near 0fps using only 1024 triangles )

Perhaps with the upcoming geometry shader you could hack it, but will be hard 8( And I finish with a reflexion… If the AGEIA ppu can do this by HW very easy ( see the closestHit function in the PhysX sdk )… why not a graphic card? And also Renderman can use the gatherPoints() function to achieve this…

So we need the “castRay” GLSL function and an extension to batch triangle groups so can be discarded fast using their AABB or something.

It is not the same hardware, your “castRay” function would need new specific hardware, maybe integration of an AGEIA card :smiley:

Originally posted by ZbuffeR:
It is not the same hardware, your “castRay” function would need new specific hardware, maybe integration of an AGEIA card :smiley:
I’ll second that! Two for the price of one pls! The problem is that we will get two for the price of three! hahah

So we need the “castRay” GLSL function and an extension to batch triangle groups so can be discarded fast using their AABB or something.
Or if you’re wanting to do serious raytracing in the fragment shader, maybe you should get hardware that is designed for raytracing. And an API designed for raytracing. OpenGL isn’t that API, and OpenGL hardware isn’t that hardware.

OpenGL defines a scan converter. Oh, you can use glslang to do some raytracing, but you can’t cast rays into a scene defined by VBOs. OpenGL, being a scan converter, only comprehends the current triangle. Indeed, in a fragment shader, it only understands the current fragment.

In order to do what you’re suggesting, you need to have all of the triangles go through T&L, and store them in some kind of data structure. That isn’t what OpenGL is, and that isn’t what OpenGL does. And it certainly isn’t even remotely similar to modern scan converter graphics hardware.

There are limitations that you are going to have to accept. One of them is a lack of any kind of “true” raytracing. You might be able to fake it to a degree in a fragment shader, but that’s as far as it will ever get under OpenGL.

Originally posted by Korval:
In order to do what you’re suggesting, you need to have all of the triangles go through T&L, and store them in some kind of data structure.

Yep yep your’re right but that’s exactly the definition of the upcoming SM4.0 Geometry Shaders. You could access to ANY triangle/strip in the mesh from shaders, getting even triangleID and primitiveID ( and even outputing mkultiple vertices from the vertex shader, but that is offtopic here )… You will have access to all the “topology” of the mesh. That’s why I said you could use the geometry shaders to do ray-triangle hit test, but very slow because you need to iterate over the ENTIRE triangle list.

For more info you can go to:

  [http://www.beyond3d.com/forum/showthread.php?t=25760](http://www.beyond3d.com/forum/showthread.php?t=25760)      

I don’t think will be very complicated to change the “vertex and triangle array list iteration” by a more advanced spatial structure like AGEIA does( for example, an axis aligned bounding box hierarchy )… Is not as complicated as you think and in fact the AGEIA does that in hardware at millions of raytest per second ( see the cube-terrain shadows example in its SDK running at amazing speed! )…

What I am trying to tell you is that the required structure for this already exists, is currently implemented in the silicon, perfectly possible, speed-proven and will be very easy to implement it for shadows and other uses in OpenGL like I suggested.

The OpenGL driver implementation just need to store the vertices and indices like now PLUS in an hierarchic way ( kd-tree, aabb groups or whatever ) like the AGEIA does. If you can get one of these cards you can see the mesh loading times and triangle compilation are perfectly acceptable and the raycast speed is great ( and usually they have only 128Mb of RAM to store these structures )

HW-vendors should be interested in this too… In this way they can win the physics and collisions using the GPU battle. Without a raycast method in the shaders the battle not even can start… And not only physics… raytacing will be used for tons and tons of graphic applications.

Have you guys seen that cell-technology ray-traced terrain demo, running on an IBM Blade server? How I hope and pray PC architecture could advance a little quicker. The biggest battle for PC games in the future is going to be figuring out how to get around archaictecture limitations, hence the recent rash of GPU “abuse” and square peg in round hole syndrome. :frowning:

By the way, I don’t see how triangle subdivision in a geometry shader is going to really help with ray-tracing. It’s not as though you’re going to have access to the entirety of the scene’s geometry database within the shader–not unless they plan to install vertex/auxiliary caches the size of Texas, and even then the memory/bandwidth impact would be appalling.

For sure, there are ways to exploit the hardware to accelerate some of the tasks in ray-tracing, but that shouldn’t be confused with ray-tracing itself (or should it?).

Originally posted by Leghorn:
[QB] Have you guys seen that cell-technology ray-traced terrain demo, running on an IBM Blade server?
Nope but the AGEIA terrain shadows is still good! Btw! Do you think I could plug into the new Opteron a few ClearSpeed CSX600 to perform raytracing using this?

         [http://www.dailytech.com/article.aspx?newsid=2642](http://www.dailytech.com/article.aspx?newsid=2642)             

        :D            

Originally posted by Leghorn:
For sure, there are ways to exploit the hardware to accelerate some of the tasks in ray-tracing, but that shouldn’t be confused with ray-tracing itself (or should it?)

Exactly… I am NOT asking to change the current OpenGL scan-line approach like Korval mentioned. I am NOT asking to perform reverse-Monte Carlo raytacing. I DON’T want to cast rays for each pixel in the final image like a traditional raycaster like POV-Ray or Mental Ray does.
What I want is ONLY a way to cast a few rays ( for shadows and other rare situations ) from the fragment shader.

Originally posted by Leghorn:
[ I don’t see how triangle subdivision in a geometry shader is going to really help with ray-tracing

Nope, tessellation won’t help… What helps raytracing is the iteration through all the mesh triangles feature. I could perform very basic shadow test inside a geometry shader with ( sorry DX10 notation ):

bool TestRayTriCollision ( const Triangle t, 
  const float3 rayOrigin, float3 rayDir )
{
   //This function tests is a ray hits a triangle.
   //I erased the implementation to clarify the example
}

void
GS_main( const TriangleStream<Triangle> inTris,
TriangleStream<Triangle> outTris,
const float3 rayOrigin, float3 rayDir )
{
   int t;
   for ( t=0; t<inTris.Count; ++t )
   {
      if ( TestRayTriCollision(inTris[t], rayOrigin, rayDir ) )
      {
         //blah blah mark triangle hit by ray, calculate baricentric coordinates of hit, hit point and blah blah...
       outTris.Append(inTris[t],bari,hit,hitNormal);
      }

   }
}

But won’t be efficient because I need to iterate over ALL the triangles in the mesh! I need some kind of spatial structure to discard triangle batches fast!

Yeah, I see what you’re getting at (I think).

Here’s a great rundown on IBM’s cell-tech (page 36 describes a ray-caster architecture and the terrain demo):
http://www.graphicshardware.org/previous/www_2005/presentations/damora-cell4graphicsandviz-gh05.pdf

It is to weep.

Originally posted by Leghorn:
http://www.graphicshardware.org/previous/www_2005/presentations/damora-cell4graphicsandviz-gh05.pdf

Omg, it says THAT terrain renders in REALTIME from a heightmap

30+ frames per second with only one Cell processor

hahah! Amazing :smiley:

Exactly… I am NOT asking to change the current OpenGL scan-line approach like Korval mentioned. I am NOT asking to perform reverse-Monte Carlo raytacing. I DON’T want to cast rays for each pixel in the final image like a traditional raycaster like POV-Ray or Mental Ray does.
What I want is ONLY a way to cast a few rays ( for shadows and other rare situations ) from the fragment shader.
You seem to believe that there is some distinction. There isn’t.

The very instant you want to cast a ray against the actual rendered scene (which hasn’t even been fed to a scan-conversion-based GPU yet, so it’s impossible), you have 95% of what you would need for a full-on ray tracer. Indeed, if you can ever trace a ray into a scene, you may as well raytrace render the scene.

But won’t be efficient because I need to iterate over ALL the triangles in the mesh! I need some kind of spatial structure to discard triangle batches fast!
And it won’t work either because it only works for that one batch. What if there’s some other object that hasn’t been rendered yet that would cast a shadow?

The method you decribe would only function as a self-shadowing method. And even then, that presumes that the entire mesh is rendered in one batch.

You can’t get raytraced shadows without making a real raytracer.

hahah! Amazing
Why does this amaze you?

A cell processor contains upwards of 8 functional units, each of which has good FP power. If it couldn’t render a raytraced scene in real time (raytracing is very amenable to multiprocessed solutions), I’d say that there would be something wrong with it.

Even so, it still can’t achieve the scene complexity of modern scan converters.

Yep yep, the current in-development geometry shaders model is not good to achieve this because can only see 3 adjacent triangles. We need to change that, but was only a small example.

And yes you’re right… Need to see MORE than the current triangles batch being drawn, that’s why we need the glObjectQuery function or something similar to group more than one object and to allow more than self-shadowing.

Need to see MORE than the current triangles batch being drawn
And that is what turns it from being a scan converter into being a ray tracer. That is one of the principle, fundamental differences between scan conversion and ray tracing.

Which is why you’re never going to see it as long as we’re doing scan conversion.

Originally posted by Korval:
[Which is why you’re never going to see it as long as we’re doing scan conversion.

8((((