Shader Emulation Function Switching Mode

I’m suggesting a feature that it may not be very efficient, at least now, but it can be very helpful when shader version does not support certain functionality.

What if giving the option to specify that a certain function in a shader to be run by the CPU not the GPU?

For example,


target(CPU[, options])
vec3 myCPUFunc(...)
{
   ...
}

Here I’m telling the shader that this function is to be executed by the CPU because it contains functionality that is not supported by the current shader version. Hence, it’s run in software emulation mode.

To simplify things, the function can only call other functions if and only if they are:

1 - CPU targeted AND
2 - Defined within the same shader.

So it cannot call for instance the system-defined math library functions or any others. Only the core C-language.

I’m not sure about the communication path between the GPU and CPU, jumping (too-far-calls) and return value mechanisms. But should be no problem with current hardware I guess.

This is impossible on current hardware (at least not in an efficient mode). The CPU and the GPU are completely separate computing units and work completely asynchronously (actually as far as I understand even Sandy Bridge or Fusion architecture CPUs cannot work synchronously) and switching between CPU and GPU is not a matter of call or jump, they would require some other form of communication, which would be rather inefficient, at least on current hardware.

ic. but I still think it’s very possible even if it’s not going to be efficient, anyway it’s run on the CPU.

The trick is to implement it at compile time meaning that when the shader compiler hits the CPU-targeted function it generates the code using a JIT like compiler and upload the code segment into process space instead.

The shader is divided into tow sections and then combined, replacing the CPU-function call with a temporary variable/register where the CPU will spit the return value of the function.

Off course It will require extra functionality from GL core such as a JIT-compiler, shader-splitter-mode:

A shader is called into two stages and in between a core function is implicitly injected to call on the CPU-shader-function and writes the result value to that register.

Behind the scene it’s something like this:

glUseProgram(MyShaderSection1);

reg = CPUFunction(); <-- normal function call

glUniform(regLocation, reg);

glUseProgram(MyShaderSection2);

Ehhh, no.

The notion of running OpenGL (or portions of it) in software comes up quite frequently, and it’s just not any kind of viable solution. Won’t work in the Real World.

You’ll be constantly switching back and forth between CPU and GPU for rendering, which will cause horrible pipeline stalls and latency. Modern GPUs have extremely deep pipelines and are allowed to process data out of sync with the CPU. Having to jump back to the CPU at any arbitrary moment would wreck that. The entire pipeline would need to drain, the GPU would need to wait for the CPU to catch up again, transfer control, and then do the same all over again when the CPU is done with it’s thing. Yuck. Aside from this, in general any kind of per-fragment operation on the CPU will typically degrade performance to less than 1 frame per second.

Think of this from the end-user’s point of view. They’re running your program fine and then suddenly - SPLAT - everything grinds down to less than 1 FPS. Or performance seems to suddenly and quite randomly become all hitchy and jerky. What is the end user going to think? “Oh something must have used a shader feature that’s not supported by my GPU, I guess I better upgrade”? I don’t think so.

The only part of the pipeline that could be reasonably emulated on the CPU is per-vertex (this would include geometry shaders). Older (pre original GeForce) GPUs (they weren’t called “GPUs” in them days) did exactly that (they didn’t have shaders in them days either, but it was the same part of the pipeline) and even then it was a good deal slower than hardware T&L.

A much better option would be to allow the driver to arbitrarily decompose any pass using unsupported features into a multipass algorithm, and arbitrarily rewrite shaders behind your back to deal gracefully with unsupported calls. Even that’s not good, just “better”, and it’s a driver feature not an OpenGL feature. The best option is to just crash (hopefully with an informative error message) - at least that way the end user has a clearer notion of what the problem is and what they have to do to resolve it.

But if it isn’t going to be fast, then why offer it as a “product”? Why not just execute the entire pipeline in software?

Or better yet, instead of executing the pipeline in software, use OpenCL where I’m assuming you can do all sorts of gymnastics.