Array dimension on Kernel - AMD Histogram OpenCl example

First of all I want to present myself, I am from Chile and this is my first post on the forum. I am Ph.D. in Physics and some time ago I started this adventure on OpenCL.

The question is the following. I am studying an example of a Histogram from AMD webpage ( http://developer.amd.com/tools/heteroge … les-demos/ ) and I have an “inner conflict” with the dimensions of the data array on the kernel, the code is the following:

#define LINEAR_MEM_ACCESS
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

#define BIN_SIZE 256

/**

  • @brief Calculates block-histogram bin whose bin size is 256

  • @param data input data pointer

  • @param sharedArray shared array for thread-histogram bins

  • @param binResult block-histogram array
    /
    __kernel
    void histogram256(__global const uint
    data,
    __local uchar* sharedArray,
    __global uint* binResult)
    {
    size_t localId = get_local_id(0);
    size_t globalId = get_global_id(0);
    size_t groupId = get_group_id(0);
    size_t groupSize = get_local_size(0);

    /* initialize shared array to zero */
    for(int i = 0; i < BIN_SIZE; ++i)
    sharedArray[localId * BIN_SIZE + i] = 0;

    barrier(CLK_LOCAL_MEM_FENCE);

    /* calculate thread-histograms */
    for(int i = 0; i < BIN_SIZE; ++i)
    {
    #ifdef LINEAR_MEM_ACCESS
    uint value = data[groupId * groupSize * BIN_SIZE + i * groupSize + localId];
    #else
    uint value = data[globalId * BIN_SIZE + i];
    #endif // LINEAR_MEM_ACCESS
    sharedArray[localId * BIN_SIZE + value]++;
    }

    barrier(CLK_LOCAL_MEM_FENCE);

    /* merge all thread-histograms into block-histogram */
    for(int i = 0; i < BIN_SIZE / groupSize; ++i)
    {
    uint binCount = 0;
    for(int j = 0; j < groupSize; ++j)
    binCount += sharedArray[j * BIN_SIZE + i * groupSize + localId];

     binResult[groupId * BIN_SIZE + i * groupSize + localId] = binCount;
    

    }
    }

In the “/calculate thread-histograms/” part each element of the data array is called in the following way:

data[groupId * groupSize * BIN_SIZE + i * groupSize + localId]

Now comes my conflict. The dimensions of the data array is 10241024=1048576, the number of workgroups is 4096 and the local size is 128 (as defined by the host program). If one calculates the index for the “last thread” of groupId=4095 and localId=127 one would have an index of 4095128256+i128+127=134185087+i*128, which clearly exceeds the dimension of the data array. So how does the program handle this issue?

Well, a last question, what does LINEAR_MEM_ACCESS mean?, I have been trying to find the answer without luck… thanks for your help!

OK, I realized that I had a conceptual error and now I understand the dimensions in the problem. The global number of Threads is 4096 and the group size is 128, therefore there are 32 work groups… now everything fits…