SIMD execution

I’m trying to grasp the concept of how openCL programming for a GPU works. Are they strictly SIMD? Meaning, must each thread must be executing the exact same instruction at a time or does each thread just need to be executing the same code (e.g. multiple stack frames in CPU threads)? I imagine it’s somewhere in between (just SIMD in a workgroup, for example).

Here’s an example to help illustrate my confusion:

Let’s say I want to write a kernel that simulates coin-flipping. Each thread represents a person and the thread finishes when the person has seen 10 heads and it saves the total number of flips in some variable. Assuming that the threads are all seeded differently, this will mean that not every thread finishes at the same time. Is it still possible to run this on a GPU?

In the real program , would it just be sufficient to call barrier() after this part of the code to make sure everything is where it should be?

Thanks for the help and I apologize for any confusion in my explanation. I realize that I probably just don’t understand this very well and any comments would be much appreciated.

It depends a bit on the hardware, but essentially (AFAICT) most are essentially implemented using very wide SIMD engines as you suggest. The divergence tracking will be per-workgroup I think, although perhaps some of the earlier hardware models do it per-compute-unit. The distinction isn’t very important as it doesn’t affect algorithm choice so much as represent hardware optimisation.

As to your coin flipping example - oh yes, it will work but it may not be the best solution. The SIMD aspect of the implementation is hidden from the programmer and you do not have explicit ability to access it anyway. (the vliw stuff in earlier amd designs is separate, and at the thread level).

Given that each thread is independent in your example no barriers are needed - they are only needed to synchornise cooperating threads. e.g. if you used 10 threads to simulate 10 flips from 1 person concurrently, and they all had to talk to each other to find out if a solution had been reached after calculating the flips.

e.g. you could just code

 while (heads< 10) {
  heads += flip() == HEAD;
  flips++;
 }
result[person] = flips;

And when the hardware executes 64 versions of this (say), all 64 threads will execute as many times as it takes for the most unluckiest person to throw 10 heads. But as each thread ‘wins’ and exits the loop either the hardware or the compiler will set a mask to indicate any results calculated and any load/stores should be ignored, until the code converges again and they all do the result[] write at the same time.

This stuff is pretty well documented in each vendor’s programming guide, and also covered widely in magazine articles (anandtech and tomshardware usually have good high-level technical overviews of new architectures).

Got it, thank you very much for the thorough explanation. It really helped my understanding a lot and was exactly the answer I was looking for. There is a lot of documentation, but to be honest it’s a bit overwhelming and I was starting to get confused.

Thanks again! :smiley: