dot product

Hi,
I would like to know if someone knows how to implement the dot product in a way that is efficient, at least not slower or not much slower than doing it in a CPU. My idea is to implement the method of conjugated gradient with sparse matrix and the matriz vector multiplication is faster in the gpu but the hole method is slower and I guess the reason is the dot product!!
Thanks!!

Pablo

You probably want a parallel prefix sum, or some sort of parallel reduction step. It’s a very common opencl operation so you should be able to find plenty of papers and some code for it (ALL the sdk’s will have examples of it).

A search for ‘parallel reduction’ shows a lot of relevant stuff, or try ‘parallel prefix sum’.

Thanks!! I will search for it, the idea would be a kernel to multiply position per position and sum all, am I right? the problem is that a time ago I tried just doing the first part and only that took longer than the hole dot product in the CPU, maybe I made a mistake that time.
Thanks again!!

Pablo

It’s only worth if if the data is resident on the gpu and staying there within a loop.

From your other posts it looks like you’re moving data to/from the cpu a lot within your main loop: it will be pointless if you’re doing this - and no matter what you do you’ll be massively underutilising any discrete gpu.

I changed the program and I do all the operations in kernels now even the escalar operations. I guess that something that currently I´m not doing but in a better version should be necessary to control the values so if they do not vary more than a tolerance, end the loop (now I´m just runing the loop N times so I don´t need to send data from GPU to CPU and viceversa, except in the begining and end).

Pablo

Ahaah, good. Yeah dynamic termination is tricky.

I don’t know the best answer, but some of the things i’ve tried:

  1. just hard-code it, that works for some problems …
    a) perform a reduction of termination state checking on the gpu, and put it in a small buffer, which can then be read by the cpu quickly.
    b) batch up a bunch of loops at a go, so this check isn’t done too often.
    c) copy the small state buffer to another on the gpu, then read it synchronously on the cpu, but read it using a separate queue and wait for it on another thread, thus avoiding a synchronous device-stalling round-trip to check the state.

thanks, anyway the idea up to the moment is trying to make the method to run faster in the GPU and maybe then do the dynamic termination!!

Pablo