I suppose the traditional answer is the first style is for GPU (as in C for CUDA), and the second one for SSE, right?
GPU’s might very well be vector based as well given that they often deal with colours and vertices that have multiple components (RGBA / XYZW). I believe IBM Cell processors have multiple vector based cores, and I know of other GPUs/processors which are as well, although not the recent NVIDIA ones from what I gather. I’ve even seen designs that have multiple clusters of multiple vector cores on a single chip.
My question is how de we know which style best fits a given hardware?
Difficult, because every architecture is different (as above), so the balance of cores vs vector width can be tricky to judge.
I tend to split things depending on the type of problem. For example, I’m working on some particle system code. Each work item is a single particle, but the particles position, velocity, colour are all vector types within the code. That makes a lot of the code vector, but it’s a natural fit for the problem, and actually makes the code easier to understand.
Granted, you can write code that processes multiple scalar values using vectors, but I’d only do that if I was sure I had enough work to keep a good number of work-items. I also wouldn’t go above 128-bit vectors (e.g. float4), as I don’t know of any architecture that has words any bigger.
But why couldnt a compiler be able to optimize for SSE with the first style ? If this is already the case, why does the second style exist at all?
From what I’ve seen, 1 instance of a kernel = 1 thread / work-item, so a vector kernel wouldn’t get run across multiple scalars cores. Nor would a scalar kernel be run in parallel within a vector core. That’s not to say people couldn’t do this, just that they don’t seem to at the moment.
Compiling vector code to a scalar processor is much easier than compiling scalar code to a vector processor and getting full parallelism out of it. After all, we’ve had SSE and friends for a long time, but not much success with exploiting it unless you hand code routines. Writing the code in vector form, if the problem is suitable for it, makes the compilers job easier when targeting a vector processor and won’t really harm a scalar processor. All a scalar compiler will do is repeat code / loop n-times more.
Part of the problem is there’s a lot of information out there based on CUDA, but not much from anyone else, so it’s difficult to say what’s good in general. We just don’t know what maps well onto ATI GPUs, Larrabee, Cells, etc.
Well… I don’t.
That got a bit waffly, but hope it answers some things.