searching through an array

amos · July 28, 2011, 1:28pm

Well that seems to be a simplest task ever and I feel quite stupid when I get not performance speedup but certain slowdown.

All I need is to go through an array and perform a certain operation over each element - increment, for example.

Here is the code:

kernel void Test(global char *array) 
                  {
                    int i=get_global_id(0);
                    int myPart=(8192  * 4096*4)/get_global_size(0);
                    int start=i*myPart;
                    int finish=myPart*i+myPart;
                    for (int j=start; j<finish; j++) 
                    { 
                       array[j]=array[j]+1;
                    }
                  }

Array size is 819240964 so I divide it into parts and each working thread gets its own.
If I run it with 4096 working threads (each gets 8192*4-long part of the array) computation lasts about 2500 msecs (and that is the best result). Running analogous code on CPU, however, takes only 800 msecs.

What am I doing wrong? I tried to place my array as an image and it works faster - but I still dont understand why this way it works so strange.

Thanks.

david.garcia · July 28, 2011, 3:24pm

You are only running 4096 work-items each time. That’s very little. Increase the amount of work-items to expose greater parallelism. For example, you can execute all 8192 * 4096*4 work-items at the same time – and then you don’t even need a for loop in the code. The new code would look like this:


kernel void Test(global char *array) 
                  {
                    size_t i=get_global_id(0);
                    array[i]=array[i]+1;
                  }

In addition to that, there’s almost no computation in the kernel. With so little computation, the cost of transferring data from the host to the device is going to be significant compared to the cost of the actual computation.

amos · July 29, 2011, 12:59am

Thanks for the advise, maximum ammount of threads makes it faster than CPU, (though not much more). Three more increments and CPU completely loses the race.