Summing up all elements of a buffer

I cant quite figure out the best way to sum up all the elements of buffer.

I have tried the following kernels but neither sums up the buffers properly.


__kernel void vector_sum(__global int *A,__global int *sum) {
    // Get the index of the current element to be processed
    int i = get_global_id(0);

    // Do the operation
    sum[0] = A[i] + sum[0];
}



__kernel void vector_sum(__global int *A,int sum) {
    // Get the index of the current element to be processed
    int i = get_global_id(0);

    // Do the operation
    sum = A[i] + sum;
}


Does anyone have an idea for the best way to do this?

Does anyone have an idea for the best way to do this?

This is not surprisingly, a very common problem on parallel architectures. One that is well studied and basically ‘solved’.

Do a search on ‘parallel reduction’, or ‘parallel prefix sum’: you will find dozens of hits for code and algorithms which should point you in the right direction. The sdk’s of every vendor include examples as well.

Your proposed solutions indicate you’re not understanding the basic parallel execution programming model either. You need to acquaint yourself with that if you have any hope of getting anywhere with opencl. The opencl specification and all of the vendor introductory material are good guides.

thanks

I found this resource on the amd site
http://developer.amd.com/documentation/articles/pages/opencl-optimization-case-study-simple-reductions.aspx

If they’re global ints, you can use atomics right?

http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/atomic_add.html

Is that supported in all OpenCL devices?

Global atomics in general are really really slow.

And even if they weren’t really slow, they would still be slow.

They introduce a serial bottleneck which may span 20+ CPU cores so it doesn’t matter how fast they are - they will be a bottleneck.