Hi,
I would like to know if someone knows how to implement the dot product in a way that is efficient, at least not slower or not much slower than doing it in a CPU. My idea is to implement the method of conjugated gradient with sparse matrix and the matriz vector multiplication is faster in the gpu but the hole method is slower and I guess the reason is the dot product!!
Thanks!!
You probably want a parallel prefix sum, or some sort of parallel reduction step. It’s a very common opencl operation so you should be able to find plenty of papers and some code for it (ALL the sdk’s will have examples of it).
A search for ‘parallel reduction’ shows a lot of relevant stuff, or try ‘parallel prefix sum’.
Thanks!! I will search for it, the idea would be a kernel to multiply position per position and sum all, am I right? the problem is that a time ago I tried just doing the first part and only that took longer than the hole dot product in the CPU, maybe I made a mistake that time.
Thanks again!!
It’s only worth if if the data is resident on the gpu and staying there within a loop.
From your other posts it looks like you’re moving data to/from the cpu a lot within your main loop: it will be pointless if you’re doing this - and no matter what you do you’ll be massively underutilising any discrete gpu.
I changed the program and I do all the operations in kernels now even the escalar operations. I guess that something that currently I´m not doing but in a better version should be necessary to control the values so if they do not vary more than a tolerance, end the loop (now I´m just runing the loop N times so I don´t need to send data from GPU to CPU and viceversa, except in the begining and end).
I don’t know the best answer, but some of the things i’ve tried:
just hard-code it, that works for some problems …
a) perform a reduction of termination state checking on the gpu, and put it in a small buffer, which can then be read by the cpu quickly.
b) batch up a bunch of loops at a go, so this check isn’t done too often.
c) copy the small state buffer to another on the gpu, then read it synchronously on the cpu, but read it using a separate queue and wait for it on another thread, thus avoiding a synchronous device-stalling round-trip to check the state.