I would like to know if someone knows how to implement the dot product in a way that is efficient, at least not slower or not much slower than doing it in a CPU. My idea is to implement the method of conjugated gradient with sparse matrix and the matriz vector multiplication is faster in the gpu but the hole method is slower and I guess the reason is the dot product!!