Inconsistent results from radix sort kernel

Hi, now I’m implementing radix sorting kernel.
The implementation works almost well, but inconsistently returns incorrect results.
So, I think the cause is related to synchronization with high probability.

The kernel is below.


#define SortSize 128
ushort scan(local ushort* prefixSumNumOne, const uint lid) {
    for (uint i = 2; i <= SortSize; i <<= 1) {
        uint idx = (lid + 1) * i - 1;
        if (idx < SortSize)
            prefixSumNumOne[idx] += prefixSumNumOne[idx - (i >> 1)];
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    
    if (lid == 0)
        prefixSumNumOne[SortSize - 1] = 0;
    barrier(CLK_LOCAL_MEM_FENCE);
    for (uint i = SortSize; i >= 2; i >>= 1) {
        uint idx1 = (lid + 1) * i - 1;
        if (idx1 < SortSize) {
            uint idx0 = idx1 - (i >> 1);
            ushort temp = prefixSumNumOne[idx0];
            prefixSumNumOne[idx0] = prefixSumNumOne[idx1];
            prefixSumNumOne[idx1] += temp;
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    
    return prefixSumNumOne[lid];
}

uint split(local ushort* prefixSumNumOne, local ushort* falseTotal, const uint lid, ushort pred) {
    ushort trueBefore = scan(prefixSumNumOne, lid);
    
    if (lid == SortSize - 1)
        *falseTotal = SortSize - (trueBefore + pred);
    barrier(CLK_LOCAL_MEM_FENCE);
    
    if (pred)
        return trueBefore + *falseTotal;
    else
        return lid - trueBefore;
}

kernel void blockwiseSort(const global uint3* mortonCodes, uint numPrimitives, uchar bitID,
                          global uint* indices) {
    const uint lid0 = get_local_id(0);
    const uint gid0 = get_global_id(0);
    
    local uchar mortonCodesInBlock[SortSize];
    local ushort indicesInBlock[SortSize];
    local ushort prefixSumNumOne[SortSize];
    local ushort falseTotal;
    
    // 1. initialize local variables
    if (gid0 < numPrimitives) {
        uint idx = indices[gid0];
        uchar3 extracted = convert_uchar3(mortonCodes[idx] >> bitID) & (uchar)(0x01);
        mortonCodesInBlock[lid0] = extracted.x + (extracted.y << 1) + (extracted.z << 2);
    }
    else {
        mortonCodesInBlock[lid0] = 0x07;
    }
    indicesInBlock[lid0] = lid0;
    barrier(CLK_LOCAL_MEM_FENCE);
    
    // 2. one pass of radix sort
    uint curIdx = lid0;
    for (uint axis = 0; axis < 3; ++axis) {
        prefixSumNumOne[lid0] = (mortonCodesInBlock[curIdx] >> axis) & 0x01;
        barrier(CLK_LOCAL_MEM_FENCE);
        
        // computes permuted index by using common operation split and scan.
        uint dstIdx = split(prefixSumNumOne, &falseTotal, lid0, prefixSumNumOne[lid0]);
        
        // swap local indices
        indicesInBlock[dstIdx] = curIdx;
        barrier(CLK_LOCAL_MEM_FENCE);
        curIdx = indicesInBlock[lid0];
    }

    // 3. swap global indices
    uint idx = (gid0 < numPrimitives) ? indices[(gid0 - lid0) + curIdx] : UINT_MAX;
    barrier(CLK_GLOBAL_MEM_FENCE);
    if (idx != UINT_MAX)
        indices[gid0] = idx;
}

This kernel receives a list of Morton codes (a.k.a. Z-order curve) and indices not sorted (0, 1, 2, … numPrimitives - 1).
And it computes the sorted indices according to Morton codes order by using radix sort.
In radix sort procedure, it uses common scan and split operations. Designing Efficient Sorting Algorithms for Manycore GPUs
In this case, one radix digit is represented as a 3bits integer by incorporating extracted bit from the morton code.

  1. initialize default local indices and bitID-th radix digit numbers.
  2. loops over 3bits of a radix digit and computes sorted local indices.
  3. permutes global indices using sorted local indices.

I know this code is maybe too complicated to paste here.
But I couldn’t find the problem although I have been thinking several days.

Could anyone find a clear problem at a glance?
I think swapping on local or global is maybe doubtful.

Thanks.

Environment:
OS X 10.9.5 Mavericks
MacBook Pro Retina Late 2013
Intel Core i7 4850HQ 2.3GHz
GT 750M 2GB * used device in this problem
OpenCL 1.2

Finally, I’ve found out the problem.
In the above code, a local variable prefixSumNumOne[lid0] is passed to a function split().
I’ve thought such variable is normally copied into the argument variable that is in private memory, so it seems not to require barrier().
However, in case where the function call is inlined, this is not true.

I changed a code fragment as below. Now I can get consistent results.


for (uint axis = 0; axis < 3; ++axis) {
    uchar pred = (mortonCodesInBlock[curIdx] >> axis) & 0x01;
    prefixSumNumOne[lid0] = pred;
    barrier(CLK_LOCAL_MEM_FENCE);
    
    uint dstIdx = split(prefixSumNumOne, &falseTotal, lid0, pred);

Thank you.