how to implement serial calculation in kernel code?

The following piece of code is part of my kernel code for my calculation, because other part code are quite independent parallel that can be executed on each work item (no data synchronization needed), but this part looks like a serial one (the i th output needs the output the i-1 th updated value), so I think that I can make one work item do it, and other work item just do nothing when it comes to this step. So i wrote this , supposing I use work item 0 to finish the computation

//tid is the thread local id, tB and m are all pointer to local memory
//basically I need to derive array m from array tB, one element of m is derived on each step of the first loop. The value of m Is correct when I execute the kernel on CPU, but wrong on GPU. Is it because the synchronizing goes wrong on gpu? Or do you have suggestions to make it work right on gpu? Thank you so much!

barrier(CLK_LOCAL_MEM_FENCE);
if(tid==0)
{

for (i=0; i<34; i++)
{
m[i]= tB[i];

for(j = i+1; j < 34; j++)
{

tB[j]=mod_subtract(tB[j],tB[i],baseB[j]);

tB[j]=mod_mul(tB[j],Bm[33i+j-1-i(i+1)/2],baseB[j]);

}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
//then i read value m back to host code and check the values

then why not just use cpu to do the work?

hi, thanks for your reply. but i have two pieces of such code in my kernel, if i do it on the cpu, then i would need to break the kernel into 3 kernel codes? and pass the value back and forth betwee the cpu and gpu five times. that doesn’t sound efficient. do you know whether it is eligible to make one thread do this work, or other way to write this part of code? what is strange is that code i wrote like this output correct result when running on cpu, but wrong on gpu, i don’t understand why…:frowning:

when you do this serial calculation, does every one has to wait the serial result to proceed?

of course you can use one thread to calculate. and if the result is different, it just means your gpu code is not correct.

yes, the following steps in each thread need to wait for the serial result to proceed. does putting barrier(CLK_GLOBAL_MEM_FENCE); before and after this piece of code enough to synchronize all other threads with the this thread?

that’s the synchronization within a workgroup/block.

but if you need to do it on multiple workgroups, then that is not right. for synchronization among blocks I will return the control to cpu. i.e. wait the calculation kernel finish for all workgroups

i think i have set the work items to be in the same workgroup…

If you’re only using one work-group you will get only a tiny (1/4 to 1/48th) of the total GPU performance.

If you need to do this sort of synchronization across all work-items you have to wait for the kernel to finish. If the cost of doing the data transfer to the CPU is too high to do it that way, then you have two options:

  1. wait for the first kernel to finish and then run a second kernel which just does the serial part using a global size of 1
    or
  2. figure out another algorithm.

#2 is almost certainly faster, but may be difficult or impossible.

You also have to watch out that your workload is not too big and a thread doesnt “hang” too long. In my experience, if I have a kernel hang two long, then too things happen:

The OS stops drawing

It may BlueScreen.

I was running a for loop in which each subsequent cell ran longer than the last, and it glitched out before the execution was done.