Well that seems to be a simplest task ever and I feel quite stupid when I get not performance speedup but certain slowdown.
All I need is to go through an array and perform a certain operation over each element - increment, for example.
Here is the code:
kernel void Test(global char *array)
{
int i=get_global_id(0);
int myPart=(8192 * 4096*4)/get_global_size(0);
int start=i*myPart;
int finish=myPart*i+myPart;
for (int j=start; j<finish; j++)
{
array[j]=array[j]+1;
}
}
Array size is 819240964 so I divide it into parts and each working thread gets its own.
If I run it with 4096 working threads (each gets 8192*4-long part of the array) computation lasts about 2500 msecs (and that is the best result). Running analogous code on CPU, however, takes only 800 msecs.
What am I doing wrong? I tried to place my array as an image and it works faster - but I still dont understand why this way it works so strange.
Thanks.