An important element of Geometry Shaders people struggle to emulate in Compute Shaders is Streaming Output Buffers - i.e. the ability to incrementally spit out geometry for rendering without accumulating all the output geometry data into a single large buffer object.
When we use Geometry Shaders, we create buffers of input geometry (direct or indexed), and issue a draw call. The geometry shader operates on each input primitive, and produces one or more output triangles, with potentially unique transformed coordinates. Those output primitives are streamed into the next pipeline stage.
If our draw call has 50,000 triangles, we hope Geometry Shaders don’t create a wasteful (50,000 * GL_MAX_GEOMETRY_OUTPUT_VERTICES) output buffer from the geometry shader. We hope they accumulate primitives in a buffer appropriate to the GPU thread-dispatch width, and dispatch it whenever it fills. (though technically we don’t know what they do)
However, a simple Compute Shader implementation allocates an output buffer big enough to hold all results from the entire compute batch. This is because the compute shader is run to completion, then a draw call is issued with the output buffer.
If our drawing batch has 50,000 triangles, and each Compute Shader instance creates one triangle, our output buffer has to have space for 50,000 triangles. If each instances creates multiple triangles, this output buffer is (50,000 * max_geometry_per_cs_instance)! And this problem is amplified as command-buffer drawing allows us to put more and more work into a single draw call.
If we want to avoid allocating this output buffer on the fly, we pre-allocate the largest one we need, and then if it’s use we have either wait for a batch to finish, or we need more than one of them. This is a real inconvenience that Geometry Shaders free us from.
Two examples of this are the Geometry Shaders in Parallel Split Shadow Maps and Voxel Global Illumination. In PSSM, a geometry shader is used to clone geometry into the appropriate shadow-map splits (normally 3). In VXGI, a geometry shader is used to decide which projection axis makes the triangle coverage the largest in the 3d clipmap texture, and projects it onto this axis.
Whether using simple geometry buffers or Metal Indirect Command Buffers, all examples I can find still allocate and store a buffer for the whole compute call.
However, what do you do when your Compute Shaders are doing work on the input data, and producing completely unique output data, such as in PSSM or VXGI, that you don’t want to pre-allocate and store?
Is there some way in Metal2 to have compute shaders directly issue streaming drawing commands (without CPU involvement) in temporary buffers that are proportional to GPU thread-width, not the full drawing batch size of each object? If so, how?
One way I can see to do something remotely similar with Compute Shaders is to manually split the draw-batch into sub-batches. For example, we could take a 50,000 primitive object, and subdivide it into multiple 2,000 primitive compute batch calls. When we issue a single compute batch, it now only needs a (2,000 * generated_primitives) output buffer. Each time a compute call finishes, we send the sub-batch output to drawing, and simultaneously issue the next compute batch. This only requires two output buffers, which we ping-pong. I do worry about CPU involvement and GPU stalls, but it might be better than allocating and consuming large temporary buffers.