Work Item synchronization

Hello,
I would like to synchronize work-items in one workGroup. I try use events, but there is problem with pointer-to-pointer variable in local memory.
So I know I must use barrier, but I don´t understanding how barrirer works? I read about local memory fence and global memory fence.
Can anyone explain me how use barrier for local memory. There is example what I need to synchronize:


__kernel __attribute__((reqd_work_group_size(250, 1, 1))) void func(__global short *inData, __local short *data){
  uint localID = get_local_id(0);

  if(localID == 0)
     //write to local memory
  else
    //wait to 0 work item complete writing

  for(...){
       if(last character in local data && localID == 0){
            // wait to all work items i work group
            // rewrite local data
       }
       else if(last character in local data){
            // wait to write new data
       }
  }

}

This is example. 0 work item writing data to local memory. That data need all workitems in one workgroup.
I local memory it is not much space, so I can not write all data what i need to local memory one time. I have to rewrite that memory space.

I want to help with result this example, and how to use barrier or some different metode for synchronize workitems in workgroup.

Thanks

So, first thing first: events are not what you want (they are not designed for fine-grained synchronization across work items), and fences are not enough (they are about flushing caches and making sure that your threads have a consistent view of memory, but they do not guarantee that some of your threads are not ahead of others).

The OpenCL C instruction which you are looking for is barrier(). It combines thread synchronization and a memory fence to make sure that all threads are at the same code location and have the same view of either local memory, global memory, or both (your choice, larger-scale memory synchronization is usually more expensive).

The rules of barrier() are as follows:

1/ All threads in a work group must reach the barrier. If you have a conditional instruction (e.g. if … else), you must either put the barrier outside the conditional (after the else in your first example) or duplicate it across both branches of the conditional instruction.
2/ There are many subtle ways to break rule 1, so be careful. In particular, be wary of any construct where some threads might exit the code faster than others, such as loops with thread-dependent bounds, and function early exit with “return”.
3/ If you break rule 1, the result is undefined and hardware-dependent. For example, recent NVidia GPUs have special hardware to handle function early exit, but on other hardware it causes crashes or deadlocks.

For the rest, I let you have a look at the OpenCL C documentation for your target OpenCL version.

Ok, thanks for help.
I understand barriers, but when I test synchronization on my ATI graphics card, barrier does not work, but when I try the same test on another GPU (nVidia), all works good.
So I want to ask how it is possible? Is there any parameter in openCL for identify this different?
Second ask is about work-items in work-group. By the parameter CL_DEVICE_MAX_WORK_ITEM_SIZES, I get size 256, 256, 256, but when I start more then 64 kernels in one group in first dimension, program will frozen on waiting. I do not know how is that posible because on different GPU has same sizes but there is it work on 256 krenels in work group. So how I get thrue infromation about number of workitems in workgroup? And how can I get size defiend by kernel because CL_KERNEL_COMPILE_WORK_GROUP_SIZE, returning the same result when I edit attribute((reqd_work_group_size(X, Y, Z))).

Thanks for answering this question

I won’t be able to answer your question about nonportable barriers without extra code, but to give you just one example, you might have fallen “victim” of the nonstandard support for function early exit on NVidia devices that I alluded to above. Incorrect barrier usage is undefined behaviour in OpenCL, that means the result is unpredictable and hardware-dependent.

Regarding local work size, I think you might have been tricked by the subtle difference between CL_DEVICE_MAX_WORK_ITEM_SIZES and CL_DEVICE_MAX_WORK_GROUP_SIZE (truly a dreadful naming convention). The first one tells you how many work-items you can have on each side of a work-group (e.g. if it is {256, 256, 256}, you cannot use a local work size of {257, 1, 1}, because the first local work size dimension is too large). The second one tells you how many items you can have in a work group overall (e.g. if it is 256, you cannot have a local work size of {256, 2, 1}, because that’s 512 work items overall).

Regarding your last question about CL_KERNEL_COMPILE_WORK_GROUP_SIZE, I have no idea what is going on here. Just as a sanity check, have you looked at the return value of clGetKernelWorkGroupInfo?