warp size vs # of SPs per SM

In my GPU there are 384 cores, 8 compute units (streaming multiprocessors), so there 384/8 = 48 streaming processors on each compute unit. Given that NVidia warp size is 32, which means 32 threads execute in step, doesn’t that mean 48-32=16 SPs are not doing anything on each cycle? That doesn’t seem to make sense to me. Can someone help to clarify?

Thanks,
J

I’m guessing you have a 2nd-gen Fermi (cc 2.1). The scheduling on those is a little weird and I don’t entirely have my head around it myself, but if you read the CUDA C Programming Guide appendix on Fermi it explains it all.

On Fermi, each warp is physically executed as two half-warps; the 2.1 devices can effectively run 3 half-warps at once. (The thing is actually more complex, due to the device ability to issue more than one independent instruction per cycle, but that’s the gist of it.)

Thanks guys!