In my GPU there are 384 cores, 8 compute units (streaming multiprocessors), so there 384/8 = 48 streaming processors on each compute unit. Given that NVidia warp size is 32, which means 32 threads execute in step, doesn’t that mean 48-32=16 SPs are not doing anything on each cycle? That doesn’t seem to make sense to me. Can someone help to clarify?
I’m guessing you have a 2nd-gen Fermi (cc 2.1). The scheduling on those is a little weird and I don’t entirely have my head around it myself, but if you read the CUDA C Programming Guide appendix on Fermi it explains it all.
On Fermi, each warp is physically executed as two half-warps; the 2.1 devices can effectively run 3 half-warps at once. (The thing is actually more complex, due to the device ability to issue more than one independent instruction per cycle, but that’s the gist of it.)