I'm trying to work the following problem.
I am trying to create a tiled version of an image where per tile i need to find the min/max luminance value.
To start with i began with a 1024x512 image with tiles of 16x16 each.

Based on what i've been reading this is the kernel i came up with using barriers to synch the local threads in a workgroup. I assumed that by synchronising it and creating a __local variable for Min and Max they could actually come up with a correct value for it, but it's not the case.. My results are every frame the image changes in a way i can't really understand what values it comes up with:

on the host side here's what i'm doing:

1024*512 work items in workgroups of 16x16

Code :
    clProgram->Set( 0, clTexMem );  // 1024x512 res input
    clProgram->Set( 1, clMinMaxLumMem ); // 64x32 res output
    clProgram->Set( 2, 0.0f ); // default min value
    clProgram->Set( 3, 1.0f ); // default max value
    clProgram->Set( 4, tileSize ); // tile size (default: 16)
    clProgram->Set( 5, AppParams::showTileMaxZ?1:0 ); // a condition in the kernel that renders the min OR max just for debugging
    int workItemsX = Math::Min( (int)clDevice->info.maxWorkGroupSize, tex->GetWidth()/tileSize );
    int workItemsY = Math::Min( (int)clDevice->info.maxWorkGroupSize, tex->GetHeight()/tileSize );
    clDevice->Run2D( workItemsX*tileSize, workItemsY*tileSize, tileSize, tileSize );

Here's the kernel code:

Code :
__kernel void Image_GetMinMaxLum( read_only image2d_t srcImage, write_only image2d_t dstImage, float nearClipPlane, float farClipPlane, int tileSize, int showTileMaxZ )
    const float3 lumColor = (float3)( 0.30, 0.59, 0.11 );
    int uStart = get_global_id(0);
    int vStart = get_global_id(1);
    int lx = get_local_id(0);
    int ly = get_local_id(1);
    float4 color;
    // input image coords
    int2 coords = (int2)( uStart, vStart );
    // output image coords
    int2 coordsDst = (int2)( get_global_id(0)/tileSize, get_global_id(1)/tileSize );
    // local work group vars that store min/max values
    __local float minZ;
    __local float maxZ;
    if( showTileMaxZ > 0 )
        maxZ = nearClipPlane;
        color = read_imagef( srcImage, CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST, coords );
        float Y = dot( lumColor, color.rgb );  // Luminance
        maxZ = max(maxZ, Y);
        color.rgb = (float3)( maxZ );
//    barrier( CLK_LOCAL_MEM_FENCE );
    if( !showTileMaxZ  )
           // attempt to reset minZ to max value when first thread runs (assuming its the first one that runs??)
//        if( (lx+ly) == 0 )
            minZ = farClipPlane;
        color = read_imagef( srcImage, CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST, coords );
        float Y = dot( lumColor, color.rgb );  // Luminance
        minZ = min(minZ, Y);
        color.rgb = (float3)( minZ );
    barrier( CLK_LOCAL_MEM_FENCE );
    write_imagef( dstImage, coordsDst, color );

I'm pretty new with OpenCL but the current tests above where all based on what i read online and found in sources around the web.

I have to say that i did a test with FOR loops and it seems to work, but the thing is it's probably not the way to do it.
Basically its a 16x16 loop to compute the min/max of all items in a block of pixels and save it. With this method i create less work items, but also have 16x16 image reads more.
I'm looking for advice on this, hopefully someone can help ?