Need a way to calculate theoretical FLOPS of a device

We can query some information about a device, such as the number of compute units and the clock speed. Seeing how many compute units that are present isn’t very descriptive in terms of estimating the computational power of a device. A CPU will consider a compute unit to be one core whereas something like an nVidia card will consider a compute unit to be one streaming multiprocesor, which has either 8 or 32 streaming processors (cores).

When we start a program and have multiple devices present, we will want to run the program on the fastest device we can. Using the number of compute units alone can not help us do this.

Doing this in a meaningful way is very complex and highly subject to the exact nature of your algorithm(s). The best way to evaluate this turns out to be for an application to simply try it’s algorithms, or a sufficiently representative one, and measure execution time on the machine it is running on. This is troublesome for an app that is only running once, but should be fine for ongoing execution.