You must be wary of stale information, everyone’s compilers are still a bit raw and being developed constantly so off-hand comments about a compilers capability may be completely wrong 6 months later.
Also the definition of ‘auto vectorise’, is more involved with opencl. e.g. i believe (but have not personally used) intel’s compiler can convert multiple work-items into individual vector slices within the same cpu thread. (which is more how the gpu compilers work) - in this case simpler code is probably better.
I haven’t targeted CPUs because the performance just doesn’t warrant it beyond for a bit of debugging, but for gpu’s, often float4 is the best choice, or sometimes even just using scalar code. Remember they have been optimised heavily for graphics related tasks, and these are all float4 - so 128-bits is basically the ‘native word size’. Using float8 for example will add register pressure which will almost always reduce the potential for parallelism at best, and lead to performance-killing register spills at worst. For complex algorithms using float often ends up better than float4 for the same reason as it’s effectively just a 4-way loop unroll embedded in the code, with the same potential side-effects.
There are other techniques such as using images or local memory which don’t directly translate to CPU code either, (and visa versa for GPUs for non-parallelisable or branchy code) so it’s very hard to create generic code that runs really well on every architecture. (but i’m sure that’s what you’re getting paid the big bucks for
My own opencl/cpu experiments have been pretty disappointing: the best it seems to give you is an arguably “simpler” front-end to multi-core cpus with only a few percentage points boost over natively multi-threaded code (or i presume, openmp type things), and on x86 i’ve never found the vector unit all that impressive - if only because the scalar unit is so fast and the competing compilers are so mature.
However … looking into my crystal ball here …
Given the converging technological direction - looking at AMD’s GCN instruction set one sees a general purpose CPU with a very wide vector co-processor, not a parallel machine with a scalar helper - not to mention the historical progress of compiler technology over the years where simpler code helps the compiler … I would suggest that worrying about vectors is probably just not that useful.
For certain algorithms they will be a natural fit - and save some not insignificant typing - but trying to force all algorithms into vectorised forms for potential performance gains will be a dying art. Mostly because it’s just a pain in the rear for programmers, and it makes the compilers job harder too. e.g. I usually found vectorised code slower on nvidia, even when it was a natural fit for the algorithm (although we’re talking pretty small differences, under 10%).
Obviously this doesn’t help if you’ve got to deliver stuff that runs on current computers, but don’t expect opencl/cpu to deliver miracles on those either.
Oh and ATI doesn’t exist anymore, they are AMD now.