Writing to shared global memory

Hi there, I’m wondering how to do what seems like a fairly simple task. I want each thread to increment an integer, so that if 30 threads are run on a the gpu device, the counter will be 30.

I’m assigning the integer counter as global memory ( __global int* _nsamp) and using a simple
*nsamp += 1;
operation on it to increment it, how do I ensure that the threads write to it in order so there are no threading issues? at the moment the nsamp value is not the total number of threads that have run as I’d like it to be.

thanks in advance.

Write are in no way ordered, so you will not get the right result here. The only way to do this is to use atomic operations, but they are really slow. However, you know exactly how many threads are run since you set it with the global work-group size, so I may not fully understand what you are trying to do.

Sorry I should explain better, this was a simplification of my problem.

I’m writing an algorithim which compares two images similarity by creating a histogram.

What this essentially entails is for each pixel of the image, perform a computation, then increment a histogram “bin” by some the computation;

there are only 64 bins for all the pixels, so multiple threads will be wanting to increment the same bin causing the issue I’m up agaisnt.

I can use a 1d array to get around this, but that slows it down quite a lot as these images are actually 3d and can be relatively massive (300 x 300 x 300).

so at the moment I use this 1d array to store the results, copy it back out to the host process after completion, and then traverse the entire array to get my results, this is taking a fairly large amount of time, do you think atomic operations would be significantly slower?

thanks for taking time to reply.

Parallel histograms are tricky. You should have each work-item do its own histogram and then each work-group merge them together. After that you can do a reduction across work-groups to get the combined histogram. (Note that reductions across all work-groups require either using atomic_add operations (slow) or executing a second kernel since you can’t do synchronization across work-groups.)

If you try to have all work-items do it to global memory you will find your performance is terrible as you’re effectively serializing large parts of your computation through memory.