About writing to Global Memory

here are two kernel functions:


__kernel void f1(__global int *tag,```)
{
   uint tid = get_global_id(0);
   if(tag[tid] == 0)
       {
          ```;
           int t = 0;
          if(condition)
              t = 1;
          tag[tid] = t;
       }

}

Here we got result1;
__kernel void f2(__global int *tag, ```)
{
    uint tid = get_global_id(0);
    if(tag[tid] == 0)
     {
         ```;
         int t = 0;
         if(condition)
            t = 1;
         tag[tid] = 1;
     }
}

Here we got result2;

and the host code just like this:


void RunCL()
{
  for(int n = 0; n < 30; ++n)
     RunKernel();
}

And i got two different results. the speed of resul1 is much slower than result2, about 40~50 times more or less. My question is: is there somewhere the compiler made some optimization? and when i take the cycle (for(int n = 0; n < 30; ++n)) into the kernel and make local arrays(local int tag[256] where 256 is group size) as tag, and i do not assign value to global memory __global int *tag unless at the last cycle. but it also very slow.
Can some one tell me what has happened? :?
many thanks :slight_smile:

PS: the actual global memory int *gtag and local memory ltag has its different name respectively.

It sounds like this is due to the time it takes to do a JIT, just-in-time, compile of your kernel. Does the 3rd run execute as fast as the 2nd?

I mean``` the host code invokes the kernel code(f1 or f2).In RunKernel() function it will call
clEnqueueNDRangeKernel() and at the last cycle it read data tag from device. The 3rd one will run faster if

 tag[tid] = t;

is replaced by

 tag[tid] = 1;

and I guess it is because the host takes the time to send command and also the consume of function invoking. :slight_smile:
PS: In

 if(tag[tid] == 0)
       {
          ```;
           int t = 0;
          if(condition)
              t = 1;
          tag[tid] = t;
       }

of the kernel code, another cycle will be executed by each of thread(item), may be i should unroll these cycles or change data structure for better task parallel? :slight_smile:

Oh, I see the difference now. I’m not very sure now that I see what you’re asking.

Of these kernels, one writes a literal to your global memory space, the other uses an initialized local variable to write to the global memory space. If RunKernel() calls k1, does it do so for all 30 calls in your loop? (same question re: k2) Does every call to k1 take an equal amount of time? How about k2?

To Alex:
yes, in each cycle of RunKernel() function, k1 or k2 will be called one time, and at each cycle of k2(where tag[tid] = 1) the time consumed is equal. but not that of k1(that depends on condition. the condition is also depend on the last cycle which can change the values of tag[] arrays).