See Volkov's paper on matrix multiplication in CUDA