AMD vs Intel: Auto-vectorization

I’m looking at writing code that runs cross platform, at least in regards to desktop-class machines, so I need to support Intel, AMD, Nvidia, and ATI.

I understand from this forum post that the AMD OpenCL compiler does not perform auto-vectorisation of wider types, and that one should instead utilise 8-bit types like float8.

Question 1: Is an Intel CPU the only OpenCL device on which one can assume auto-vectorisation will take place? In other words, is this true for none of AMD, ATI or NVidia?

Question 2: Further, if vectorisation is not handled automatically, is it safe to assume I should make use of 8-bit lines in my platform-agnostic code? Or would there potentially be a shortfall in performance on Intel platforms in that case (and if so, major or minor)?

EDIT A bit confused; I found here, in Q.15, that AMD’s compiler which ATI uses, does auto-vectorise for CPU. However they don’t say whether that is Intel, AMD, or both, and I can only assume they mean AMD? – Which contradicts what I’ve seen elsewhere.

(Sorry, I’m going to rewrite my question above since I cannot edit it…)

I’m looking at writing code that runs cross platform, at least in regards to desktop-class machines, so I’d need to support Intel, AMD, Nvidia, and ATI.

I understand from this forum post that the AMD OpenCL compiler does not perform auto-vectorisation of wider types, and that one should instead utilise 8-bit types like float8. EDIT A bit confused; I found here, in Q.15, that AMD’s compiler which ATI uses, does auto-vectorise for CPU. However they don’t say whether that is Intel, AMD, or both, and I can only assume they mean AMD? – Which contradicts what I’ve seen elsewhere.

Question 1: Currently I’m aware of 3 compilers: Intel’s, AMD’s, and NVidia’s. I understand ATI uses AMD’s, as included in their Stream SDK. Is all of this correct?

Question 2: Is an Intel CPU the only OpenCL device on which one can assume auto-vectorisation will take place? In other words, is this true for none of AMD, ATI or NVidia?

Question 3: Further, if vectorisation is not handled automatically, is it safe to assume I should make use of 8-bit lines in my platform-agnostic code? Or would there potentially be a shortfall in performance on Intel platforms in that case (and if so, major or minor)?

Question 4: Given my circumstances, is there one single compiler that would work for all? – I assume not, since I understand that OpenCL kernels are compiled just before runtime, much like shader programs.

At 36:05 into this video, Mr Catanzaro discusses AMD CPU, AMD GPU, and Nvidia GPU vectorisation under OpenCL.

Basically,

  • Intel OpenCL compiler for Intel CPUs does vectorise.
  • Nvidia OpenCL compiler for Nvidia GPUs do not vectorise.
  • AMD OpenCL compiler for AMD CPUs do not vectorise.
  • AMD OpenCL compiler for ATI GPUs do not vectorise.

You must be wary of stale information, everyone’s compilers are still a bit raw and being developed constantly so off-hand comments about a compilers capability may be completely wrong 6 months later.

Also the definition of ‘auto vectorise’, is more involved with opencl. e.g. i believe (but have not personally used) intel’s compiler can convert multiple work-items into individual vector slices within the same cpu thread. (which is more how the gpu compilers work) - in this case simpler code is probably better.

I haven’t targeted CPUs because the performance just doesn’t warrant it beyond for a bit of debugging, but for gpu’s, often float4 is the best choice, or sometimes even just using scalar code. Remember they have been optimised heavily for graphics related tasks, and these are all float4 - so 128-bits is basically the ‘native word size’. Using float8 for example will add register pressure which will almost always reduce the potential for parallelism at best, and lead to performance-killing register spills at worst. For complex algorithms using float often ends up better than float4 for the same reason as it’s effectively just a 4-way loop unroll embedded in the code, with the same potential side-effects.

There are other techniques such as using images or local memory which don’t directly translate to CPU code either, (and visa versa for GPUs for non-parallelisable or branchy code) so it’s very hard to create generic code that runs really well on every architecture. (but i’m sure that’s what you’re getting paid the big bucks for :wink:

My own opencl/cpu experiments have been pretty disappointing: the best it seems to give you is an arguably “simpler” front-end to multi-core cpus with only a few percentage points boost over natively multi-threaded code (or i presume, openmp type things), and on x86 i’ve never found the vector unit all that impressive - if only because the scalar unit is so fast and the competing compilers are so mature.

However … looking into my crystal ball here …

Given the converging technological direction - looking at AMD’s GCN instruction set one sees a general purpose CPU with a very wide vector co-processor, not a parallel machine with a scalar helper - not to mention the historical progress of compiler technology over the years where simpler code helps the compiler … I would suggest that worrying about vectors is probably just not that useful.

For certain algorithms they will be a natural fit - and save some not insignificant typing - but trying to force all algorithms into vectorised forms for potential performance gains will be a dying art. Mostly because it’s just a pain in the rear for programmers, and it makes the compilers job harder too. e.g. I usually found vectorised code slower on nvidia, even when it was a natural fit for the algorithm (although we’re talking pretty small differences, under 10%).

Obviously this doesn’t help if you’ve got to deliver stuff that runs on current computers, but don’t expect opencl/cpu to deliver miracles on those either.

Oh and ATI doesn’t exist anymore, they are AMD now.

Hi Nick!

Since I know a bit about vectorization on Intel and AMD compilers, let me try to clear some things.

notzed was right about the Intel vectorization module, that it works in the way that it scalarizes your kernel and bundles them in groups of 4 on an AVX CPU to achieve 100% ALU usage. This of course breaks when you got loops of not equal length and thread divergence, where the same issue will arise as on GPUs (one thread doing no work while others are in the divergent branch), only this time inside a single AVX vector ALU unit.

Forgive me for not listening to the 1hour+ talk that you linked (I ain’t got that much time right now), but I’m absolutely positive that AMD compilers auto-vectorize on GPUs. Since ‘legacy’ AMD HW uses 4-5 way VLIW, running vectorized code is essential to reach near-optimal performance on these architectures. The compiler for VLIW devices uses an auto-vectorization module based on inner data dependancy. When running your kernels, you can check how well vectorization was done with AMD APP Profiler under “ALU packing”, which tells to what percentage are the ALU instructions packed for the VLIW processor (80+% can be considered good).

I have not found any recent detail about auto-vectorization for the CPU part on AMD. Latest was more than a year back where it stated that SSE code will only be generated if you use explicitly vector types (float4) and operations between vector types. The VLIW compiler cannot be reused at this point, since VLIW allows different operations to be executed down each lane, where SSE (SIMD) does not.

It is no wonder Nvidia doesn’t have auto-vectorization, since their architectures are all scalar and don’t benefir from vector code (some memory operations are faster if done on well aligned vectors, but other than that, there’s no reason).

About the Stream FAQ document you linked… I tend not to trust documents that start with ATI Stream… since the SDK has been renamed over a yeag ago to AMD APP SDK, so at least 2-3 SDKs have already been released since all documents talking about ATI Stream architectures/SDKs/tools… Unfortunately, the AMD website still holds such documents, which are highly out-dated.

Hopw my comments are useful.

Hello, all.

I wanted to add my empirical observations regarding vectorization speedup. I have a kernel that does a lot of floating point calculations, with plenty of transcendentals. I decided to split the job between the GPU and the CPU cores; the same code runs on both. I didn’t know how much vectorization would help, but decided to try it for the 5870. Of course I then ran it on the CPU.

Throughput on both Intel CPUs (2-core MBP i7 and 4-core MP Xeon) was tripled (I re-tested just now on both to be sure). I.E., a kernel that runs in 120ms without it runs in 40ms with it.

Throughput on the 5870 doubled, so overall well worth the effort. I could have hoped for quadrupling on both devices, but I’ll take what I got.* (I forget what the speedup is on the nVidia GPU in my laptop.)

…Just what I’ve observed; don’t know what’s going on under the hood; perhaps my code is too complex for their algorithm to auto-vectorize; it wasn’t trivial for me to do it…

  • very much looking forward to GCN in the 7980…