OpenCL semaphores again and again...

Hello,
I couldn’t find any way to implement something like semaphores with OpenCL due to the problem of limited GPU resources and lack of anything like OS threads scheduling on the GPU.
Then what might be the alternative to semaphores concept on the GPU? I mean there must be a way to perform multiple operations on the same piece of data without the interference of other threads? many applications would need this… if semaphores are not allowed so what’s the alternatives? I mean how could atomics be implemented then, as I understand it’s the same idea of semaphores, correct?
If not, does anyone have resources about how atomics work?
Thanks…

Check this: OpenCL C99 Atomics

Thanks daa but unfortunately this code is not working as discussed in this thread: http://www.khronos.org/message_boards/viewtopic.php?f=28&t=3378 and many similar threads :frowning:

semaphores are not needed; you can always write the code another way.

e.g. reading a queue?
you use an atomic counter as an index, and pass in a limit as an argument.
e.g. writing a list?
do the same.
e.g. reading/writing the same queue?
you need to use at least two separate kernel invocations, one to generate, one to consume (although you could use multiple atomics to arbitrate this in a single kernel, the global memory model prevents you from doing anything useful with it).

You could also dedicate an i/o location per workitem or per workgroup and so they cannot conflict. And then collect the results afterwards (if not every result location will be set).

Because of the extreme parallelism any form of true code serialisation such as semaphores are extremely undesirable even if you could implement them - this is a fundamental issue governed by the problem of implementing high levels of parallelism and not a short-coming of the opencl apis. Rather than focusing on trying to build primitives which conflict with the hardware, try to solve the problem given the mechanisms available.

In short, scheduling and serialisation is done by the host code by invoking different kernels in the right order. Kernels can only make limited decisions on external data, such as how much of a list/array to process or write.