Kernels and mutexes

The goal of a mutex is to make sure only one work-item can access critical data at a time. I’ve tried to implement a mutex with the atomic_cmpxchg function, which atomically performs the ternary operation (*x == y) ? z : *x

To test the mutex, I tried the following code:

__local mutex;
mutex = 0;

while(atomic_cmpxchg(&mutex, 0, 1) == 1);
...
atomic_xchg(&mutex, 0);

The while statement checks the value of mutex. If the value is 1 (the locked state), it keeps waiting. But if the value is 0 (the unlocked state), the work-item should set the value to 1 and proceed to the following code. Then the last line sets the mutex to 0 (the unlocked state).

But when I run this code, my graphics card hangs. Has anyone else tried to implement a mutex in OpenCL?

Does the following help you? Creating OpenCL semaphors

That link is very informative, but I’m looking for a good way to implement a spinlock, where one work-item waits repeatedly for the lock to be available.

I’ve tried implementing this with atomic_cmpxchg, but it’s not working. The work-items wait, but they never access the lock.

I don’t see anything particularly wrong with your code. It looks ok.

But, I have to say @daa did give a link to an article in which they do the exact same thing with atomic_xchg(). It’s a spinlock too, even though the author calls it a semaphore. Also, his solution seems a little more straightforward.

Try running the code from the link @daa gave you. If it hangs too, then you probably have an issue with the OpenCL implementation that you use. Otherwise just use the atomic_xchg() solution.

I have come to a similar problem even with code really similar to the one posted by daa (on the Linux machine where I run the code there is not Mono so I can’t run the C# example).
After some time I have realized what could be the problem: The first work-item succesfully locks the spinlock, the second finds it locked (so far everything is OK). However, because of the SIMD fashion of instructions execution, the cycle executed by second work-item blocks also the first work-item (because there is no scheduling as on CPU and all work-items execute the same instructions). This causes the first work-item to never unlock the spinlock and therefore it hangs.
Can anybody confirm that? I experience the hangs on NVidia GeForce GTX 580, my own ATI Radeon Mobile 4500 does not support atomic instructions to test it :frowning:
I have posted the problem on NVidia’s forums but according to other topics, the OpenCL implementors from NVidia do not reply very often there… The Official NVIDIA Forums | NVIDIA

I think I’ve fixed my problems. I’ve coded a global memory barrier that synchronizes work-items in different work-groups. The important functions are atom_cmpxchg and atom_xchg, and these are available if the device supports cl_khr_int64_base_atomics or cl_khr_global_int32_base_atomics.

Here’s the theory. Software barriers generally have two stages. First, each thread atomically increments a count. Then, each thread atomically checks the count to see if it equals the total number of threads. If it does, each of the threads can proceed. If not, each thread continues reading the count.

Here’s my kernel code:

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable

#define LOCK(a) atom_cmpxchg(a, 0, 1)
#define UNLOCK(a) atom_xchg(a, 0)

__kernel void mutex_test(__global int *count, __global int *mutex) {

   if(get_local_id(0) == 0) {
   
      /* Increment the count */
      while(LOCK(mutex));
      *count += 1;
      UNLOCK(mutex);

      /* Wait for everyone else to increment the count */
      int waiting = 1;
      while(waiting) {
         while(LOCK(mutex));
         if(*count == get_num_groups(0)) {
            waiting = 0;
         }
         UNLOCK(mutex);
      }
   }
}

I’ve tested this with an AMD 5850 on Linux, and it works. I’d really like to know it works on other systems. If so, we can use this barrier instead of redeploying kernels every time we want to synchronize multiple work-groups.

One last thing. If the moderator SteveBaker is the same person who writes the sjbaker.org site, I’d like to extend my gratitude. That’s a great site.

How can you be sure that the other work-groups will be run? You could be looping forever waiting for other work-groups which will be never scheduled. Try your code with a lot of work-groups (e.g. > 16000), I beg it will hang…

I think that if OpenCL explicitely forbids work-groups synchronization, you should not try such non-portable sometimes working hacks.

By the way, for the discussion above, I have solved the problem with inter-warp locking. See
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=149425&enterthread=y.

How can you be sure that the other work-groups will be run? You could be looping forever waiting for other work-groups which will be never scheduled. Try your code with a lot of work-groups (e.g. > 16000), I beg it will hang…

I think that if OpenCL explicitely forbids work-groups synchronization, you should not try such non-portable sometimes working hacks.

flavius is completely right. The code you have posted is non-portable and it’s certain that it will deadlock for large enough NDRanges.