Writing var to output array is very slow.

Hello,

i got a problem with the performance of writing my results of the kernel function to the output array. The whole kernel function time increases from 3ms to like 80ms just because of one writing operation.
Surprisingly, the performance does not decrease if I write a constant value into the output array.

The Code of the kernel is the following:

__kernel void linearMatching(                                           
   __global float* list,                                               
   __global float* list2,                                              
   __global float* resultList,  
   __global float2* tempList,                                      

   const unsigned int count,                                            
   const unsigned int count2,                                           
   const unsigned int nDim,
   const unsigned int nBufferSize
   )                                         
{                     
    const int nWorkgrps = get_num_groups(0);
    const int global_id = get_group_id(0);                                            
    const int local_id = get_local_id(0);
    const int LOCAL_WORKSIZE = min(get_local_size(0), ARRAY_SIZE);
    //work groups needed for list2
    int nWGs_L2;
    if(count2 % LOCAL_WORKSIZE == 0)
        nWGs_L2 = (count2 / LOCAL_WORKSIZE);
    else
        nWGs_L2 = (count2 / LOCAL_WORKSIZE) + 1;

    //current list1 index
    int L1_index = global_id / nWGs_L2;
    //current list2 index
    int L2_index = (global_id % nWGs_L2) * LOCAL_WORKSIZE + local_id;
    //l2 items left to check
    int leftL2 = min(LOCAL_WORKSIZE, count2 - (global_id % nWGs_L2) * LOCAL_WORKSIZE);
    if(L1_index >= count || L2_index >= count2)  
        return;
    if(local_id >= ARRAY_SIZE)
        return;


    //builds sums for LOCAL_WORKSIZE items of l2 with one item of l1
    float diff;    
    __local float2  sums[ARRAY_SIZE];
    
    sums[local_id].x = 0;
    sums[local_id].y = (float)L2_index;
    for(int k = 0; k < nDim; k++)  {                                
        diff = (list[L1_index*nDim + k] - list2[L2_index*nDim + k]);                       
        sums[local_id].x += diff * diff;                         
    }    

    //get minimum
    const int nSearchThreads = (8 < LOCAL_WORKSIZE)? 8 : LOCAL_WORKSIZE;
    float cur_min_error = sums[0].x;
    
    int best_index = sums[0].y;
    if(local_id % (nDim/nSearchThreads) == 0){
        const int nItemstosearch = nDim/nSearchThreads;
        const int offset = (local_id / nSearchThreads) * nItemstosearch;
        const int end = min(nItemstosearch + offset, count2%LOCAL_WORKSIZE);
        for(int k = offset; k < end; k++){
            if(sums[k].x < cur_min_error){
                cur_min_error = sums[k].x;
                best_index = sums[k].y;
            }
        }
    
     

        sums[(local_id / nSearchThreads)].x = cur_min_error;
        sums[(local_id / nSearchThreads)].y = best_index;
    }

    if(local_id == 0){
        for(int k = 0; k < nSearchThreads; k++)
            if(sums[k].x < cur_min_error){
                cur_min_error = sums[k].x;
                best_index = sums[k].y;

            }    

        int index = (L1_index*nWGs_L2 + L2_index/nWGs_L2) * 2;
        resultList[index] = 11.3f; //<---- with this line, no performance decrease
        resultList[index+1] = best_index; //<- with this line it is like 30x slower
    }



}  

The last lines with “resultList” are the lines I’m talking about. With the first one, it is fast. With the second one, it is slow.

The workgrp size is 256 and a total of 256*2048 work items.

Has anybody an idea why it is so slow and how to fix it?

Another question on the side: is it in any way faster if split the work items in 2 dimensions instead of one?

kind regards,
veio

When you write a constant value the compiler is going to optimize away the best_index variable and some of the computations that go into producing it.

In particular, any reasonable compiler will remove the code between the two comments:


if(local_id == 0){
        // Start of code that will be removed
        for(int k = 0; k < nSearchThreads; k++)
            if(sums[k].x < cur_min_error){
                cur_min_error = sums[k].x;
                best_index = sums[k].y;

            }    
        // End of code that will be removed
        int index = (L1_index*nWGs_L2 + L2_index/nWGs_L2) * 2;
        resultList[index] = 11.3f; //<---- with this line, no performance decrease
        resultList[index+1] = best_index; //<- with this line it is like 30x slower
    }

Also notice that the code in red is only executed by one work-item for each work-group. That in turn is slowing down the execution of the whole work-group since the work group can’t finish until that last work-item has also finished. This means that the hardware will be heavily under-utilized.

I hope this explanation makes sense to you :slight_smile:

thanks for the answer.
it seems that u are right.
if i delete the whole function and just assign the best_index var to the output it is faster.

Also notice that the code in red is only executed by one work-item for each work-group. That in turn is slowing down the execution of the whole work-group since the work group can’t finish until that last work-item has also finished. This means that the hardware will be heavily under-utilized.

Well, the loop will be only repeated 8 times…that shouldnt take really long.
He one before that 16 times, and the first one 128 times.
The whole function is only like 500 hundred operations per worker item. That shouldnt take 80 ms (the data copying from host to device and back takes like 3ms).

And since i dont use a barrier or some other sync stuff all the 256*2048 Threads should be indepedent, shouldnt they? How many threads are really parallel? Only one workgroup? all of them?

My CPU needs like 300ms to do this sequential. And this part:

        diff = (list[L1_index*nDim + k] - list2[L2_index*nDim + k]);                       
        sums[local_id].x += diff * diff;

is repeated 67 million times.
So I dont understand why it takes so long to exec so few lines.

My Hardware is ATI Radeon 5850 and AMD Phenom II x4 965 (3.4GHz).

kind regards,
veio

Another thing: is clEnqueueWriteBuffer with blocking call really done after the function returns or does the compiler “optimize away” stuff as well?

How many threads are really parallel? Only one workgroup? all of them?

That’s going to depend completely on the device where you are running the app. I suggest reading their developers guide.

Another thing: is clEnqueueWriteBuffer with blocking call really done after the function returns or does the compiler “optimize away” stuff as well?

EnqueueWriteBuffer with blocking enabled is really done before the call returns. This is mandated by the spec.

thank you.

I just found out about the type t_image2d und used them to store my data und now it’s 38x times faster than the CPU.
Thats a result i can live with, but i still dont get why the other way is so slow :frowning:
But t_image2d for large images is not supported by to many devices if understood the literature correctly.