quicksort on program scope : -36 error

Hi,

I’m working on image processing and I try to do median filter where we use sort.

My median.cl file structure is as following :



#define TYPE int

#define FILTERSIZE 1000
#define  MAX  FILTERSIZE

void quickSort(int *array, int length)//non-recursive algorithm
{
   int  piv, begin[MAX], end[MAX];
    int i=0 ;
    ....
    //quick sort
}

void Sort(TYPE *tab, int tailleTableau){ //bubble sort

    int i, k;
    TYPE temp;

    for(k = 0; k < (tailleTableau/2)+1; k++) {
        for(i = 0; i < tailleTableau-1; i++) {

            if(tab[i] > tab[i+1])
            {
                temp=tab[i];
                tab[i]=tab[i+1];
                tab[i+1]=temp;
            } 

        }
    }
}

__kernel void median(__global const TYPE *inputBuffer,
                     __global TYPE *outputBuffer,
                     __global const int *filterIndex,
                     const int filterIndexSize,
                     const int InputBufferSize)
{
    unsigned int x = get_global_id(0);
    TYPE pixel;
    unsigned int i;
    long offsetIndex;
    TYPE neighborhood[FILTERSIZE];

    pixel = 0;

    for(i=0; i< filterIndexSize; i++)
    {
        offsetIndex  = x + filterIndex[i];
       
        if(offsetIndex >= 0 && offsetIndex < InputBufferSize )  // in image buffer bound
        {
            neighborhood[i] = inputBuffer[offsetIndex];
        }
    }

//    Sort(neighborhood, filterIndexSize);
    quickSort(neighborhood, filterIndexSize);

    pixel = neighborhood[(int)(filterIndexSize/2)];

    outputBuffer[x] = pixel;   
}

Some little median values works, some others not. I have clEnqueueReadBuffer : CL_INVALID_COMMAND_QUEUE(-36) error.
How can I fixe this issue? Is it an allocation faillure on GPU? Too big allocation on sort function?

Note that sort algorithms are tested and work well. Kernel works well too without call to sort function.

Thanks.

If you are completely certain that the quicksort algorithm is correctly implemented, then it looks like your system fails to allocate so much private memory due to the large value of FILTERSIZE.

The good news is that you don’t need to sort the whole array to find the median value. That’s overkill :slight_smile:

See Selection algorithm - Wikipedia. Notice how quickselect in particular can be implemented without recursion very easily and runs in average O(n) time and O(1) space. Plus since it’s a variation on quicksort it should be easy to understand.

Thanks! I had never heard this algorithm.

I do it with quick selection algorithm and it’s better :

  • before, with 3x3x2 filter, that was ok, but 3x3x3 filter crash with clEnqueueReadBuffer : -36 error without error log
  • now, 5x5x5 :: work,
    7x7x7 :: clEnqueueReadBuffer : -36, without error log
    10x10x10 :: say clearly clEnqueueNDRangeKernel : -48 , log : ptxas error : Entry function ‘median’ uses too much local data (0x485a bytes, 0x4000 max)

-> I ask myself (and you :slight_smile: ) the following questions :

  • Should I find a way to decompose neighborhood datas ?
  • Should I pass a global neighborhood variables by clSetKernelArg ?
  • Is there other and/or better way to resolve this kind of issue?
  • 5x5x5 is enough ? :lol:

Should I pass a global neighborhood variables by clSetKernelArg ?

That’s the best way I can see to make it work. You could try to store ‘neighbourhood’ in local memory, but since each work-item needs its own copy of the data so that it can modify (sort) it to find the median, the amount of local memory available in your system is going to limit your work-group sizes.

For example, if you have 16KB of local memory and you want to compute a 10x10x10 convolution in floating point you can only have 4 work-items per work-group. That’s clearly too little, so this way of using local memory is not a good idea.

What yo could do is load a NxNxN block of data into local memory, then have each work-item copy a 10x10x10 piece from it into a global neighbourhood variable and sort it over there. That should reduce quite a bit the amount of bandwidth you need.

What yo could do is load a NxNxN block of data into local memory, then have each work-item copy a 10x10x10 piece from it into a global neighbourhood variable and sort it over there. That should reduce quite a bit the amount of bandwidth you need.

Let me ask some questions, and correct me if I am wrong :
-> If I have (NxNxN)xM +2 buffer size(for image), how to manage the 2 bytes left? How to determine N? Is it the nearest and low value to ‘local memory maximum size’?
-> Is NxNxN a multiple of 10x10x10?
-> If I copy neighborhoods values of a work-item for compute(sort in my case) from local memory to global memory and each work-item doing the same, will there be conflict ?
TY!

P.S. : I never work with local memory, and maybe I should start.
P.P.S. : My local memory is 16KB if needed.

Is it the nearest and low value to ‘local memory maximum size’?

Right.

Is NxNxN a multiple of 10x10x10?

Not necessarily. Notice that two neighbouring work-items almost read the same NxNxN block of pixels (the only difference is N pixels on one side).

If I copy neighborhoods values of a work-item for compute(sort in my case) from local memory to global memory and each work-item doing the same, will there be conflict ?

You have to be careful to make sure that two work-items do not attempt to write to the same location in local memory. You can use the built-in function async_work_group_copy() to move data from global to local memory easily and quite efficiently.