different local variables share memory?

Hello guys, i am starting in OpenCL and i have stumbled upon wierd behaviour. It is probably some basic thing but i can’t figure it out. Any help appreciated.
I have following kernel function:

__kernel void test()
{
    __local float var1;
    __local float var2;
    
    int x = get_local_id(0);
    var2 = 0;
    var1 += x;
    
    barrier(CLK_LOCAL_MEM_FENCE);
    printf("x: %d var1: %f var2: %f 
", x, var1, var2);
}

which i am invoking in following way:

 
        cl::Kernel kernel_test(program, "test");
        queue.enqueueNDRangeKernel(
            kernel_test, 
            cl::NullRange,
            cl::NDRange(10),
            cl::NDRange(10),
            NULL,
            &event);
 
        event.wait();

and i get following result:

 
x: 0 var1: 9.000000 var2: 9.000000
x: 1 var1: 9.000000 var2: 9.000000
x: 2 var1: 9.000000 var2: 9.000000
x: 3 var1: 9.000000 var2: 9.000000
x: 4 var1: 9.000000 var2: 9.000000
x: 5 var1: 9.000000 var2: 9.000000
x: 6 var1: 9.000000 var2: 9.000000
x: 7 var1: 9.000000 var2: 9.000000
x: 8 var1: 9.000000 var2: 9.000000
x: 9 var1: 9.000000 var2: 9.000000
 

which suprised me because i was under impression that local variables should be shared between work-items in work-group and i am invoking one workgroup that contains 10 items. Threfore i should have a total sum<1-10> in var1. What suprised me even more is the fact, that content of var2 is identical with var1, which i don’t see any reason for.

I have ATI stream 2.2 installed, running this code on CPU (obviously) which is Phenom II X4 925

Thx for help!

You are correct that __local describes variables that are allocated in local memory and shared by all work-items of a work-group.

I think your first dilema comes from the CLK_LOCAL_MEM_FENCE barrier you are seeing. From the Spec, a barrier essentially means that:

“All work-items in a work-group executing the kernel
on a processor must execute this function before any
are allowed to continue execution beyond the barrier.
This function must be encountered by all work-items in
a work-group executing the kernel.”

The more I look at this, though, the more confused I personally become. What I would do is try removing that barrier and seeing what happens. The next step would be reduce the local work-group size to 1, and switch the

int x = get_local_id(0);

to

int x = get_global_id(0);

I have a small theory as to why it is operating strangely, but the theory breaks down when looking at the x value in the output.

I’m not sure why var2 is not 0 though.

As HolyGeneralK said, __local variables are shared across all work-items in a work-group. That means that you have to be careful when you update their values, or otherwise you will see the kind of problem you have today.

The first problem is that var1 was never initialized. You increment its value, but the starting value is undefined, so the final result will also be undefined.

In addition to that, think about this line of code from your kernel:


    var1 += x;

What is happening is that multiple work-items are trying to read and write into ‘var1’ simultaneously, stomping on each other’s feet and producing incorrect results.

One way to avoid this problem is using atomic operations (section 6.11.11 of the OpenCL 1.1. spec). Atomics guarantee that the output will be correct even if multiple work-items attempt to update the same variable at the same time. Unfortunately they are rather slow.

The local barrier you place at the end would only be useful if different work-items were writing into different locations in __local memory and after the barrier you want them to read the most up-to-date values stored in it.