dot product

mustang · March 6, 2012, 1:49pm

Hi,
I would like to know if someone knows how to implement the dot product in a way that is efficient, at least not slower or not much slower than doing it in a CPU. My idea is to implement the method of conjugated gradient with sparse matrix and the matriz vector multiplication is faster in the gpu but the hole method is slower and I guess the reason is the dot product!!
Thanks!!

Pablo

notzed · March 6, 2012, 2:41pm

You probably want a parallel prefix sum, or some sort of parallel reduction step. It’s a very common opencl operation so you should be able to find plenty of papers and some code for it (ALL the sdk’s will have examples of it).

A search for ‘parallel reduction’ shows a lot of relevant stuff, or try ‘parallel prefix sum’.

mustang · March 6, 2012, 3:08pm

Thanks!! I will search for it, the idea would be a kernel to multiply position per position and sum all, am I right? the problem is that a time ago I tried just doing the first part and only that took longer than the hole dot product in the CPU, maybe I made a mistake that time.
Thanks again!!

Pablo

notzed · March 6, 2012, 5:53pm

It’s only worth if if the data is resident on the gpu and staying there within a loop.

From your other posts it looks like you’re moving data to/from the cpu a lot within your main loop: it will be pointless if you’re doing this - and no matter what you do you’ll be massively underutilising any discrete gpu.

mustang · March 6, 2012, 6:03pm

I changed the program and I do all the operations in kernels now even the escalar operations. I guess that something that currently I´m not doing but in a better version should be necessary to control the values so if they do not vary more than a tolerance, end the loop (now I´m just runing the loop N times so I don´t need to send data from GPU to CPU and viceversa, except in the begining and end).

Pablo

notzed · March 6, 2012, 10:08pm

Ahaah, good. Yeah dynamic termination is tricky.

I don’t know the best answer, but some of the things i’ve tried:

just hard-code it, that works for some problems …
a) perform a reduction of termination state checking on the gpu, and put it in a small buffer, which can then be read by the cpu quickly.
b) batch up a bunch of loops at a go, so this check isn’t done too often.
c) copy the small state buffer to another on the gpu, then read it synchronously on the cpu, but read it using a separate queue and wait for it on another thread, thus avoiding a synchronous device-stalling round-trip to check the state.

mustang · March 7, 2012, 3:53am

thanks, anyway the idea up to the moment is trying to make the method to run faster in the GPU and maybe then do the dynamic termination!!

Pablo