How does the OpenCL Platform model map to actual hardware?

I have a few questions as to how the OpenCL Platform model relates to the actual hardware:

[ol]
[li] Does the OpenCL specification dictate what a compute unit or processing element has to correspond to on the actual hardware or is that left to the device vendor?[/:m:2gve1397][/li][li] What does a compute unit and processing element usually/always map to on a CPU, GPU and accelerator device respectively?[/:m:2gve1397][/li][li] Are there situations where it would be useful to know the number of processing elements a device has?[/:m:2gve1397][/li][] Can I use the OpenCL API to query a device for its number of processing elements?[/*:m:2gve1397][/ol]

  1. The OpenCL specification dictates the capabilities of the hardware, e.g., minimum memory sizes, IEEE floating point compliance, scheduling control (mem fences and barriers), etc. It‘s then up to the vendor to design compute units and processing elements with those capabilities. For example, the vendor could put lots of independent logic flow control in the processing elements, but then that takes up more die space. So they may save die space by instead putting that control in the compute unit and make all processing elements run in SIMD. There is a wide range of options vendors have building with scalar processors and vector processors.

  2. A compute unit is some kind of core that can schedule and synchronize processing elements. A processing element it is some kind SIMD or SPMD unit with ALU(s) and SFU(s).

  3. If you know the frequency, number of compute units, and processing elements per compute unit, you could then calculate the theoretical peak FLOPS and try to load balance based on that info.

  4. You can do a device query for the PREFERRED_WORK_GROUP_MULTIPLE. This should tell you how many processing elements per compute unit. This is generally true for GPUs but I recently found that for CPUs Intel‘s platform says 16 while AMD‘s platform says 1, even for the same Intel i5 CPU device.

Those are my quick answers but someone else may correct my understanding or fill in what I may have overlooked.

Thank you for the reply.

Hmm, I can’t seem to find any such constant in the documentation.

I sort of understand what the processing elements are on a GPU but what are they on a CPU?

PREFERRED_WORK_GROUP_SIZE_MULTIPLE is introduced in OpenCL 1.1. It is better to have workgroup size to be multiple of this value to avoid waisting device’s resources. For GPU this value shows wavefront size. For CPU: AMD APP SDK is not able to auto-vectorize kernels so this value is 1 for them. Intel OpenCL SDK is able to auto-vectorize kernels so this value is 16 (or something) for this driver.

I see. Thank you.

(Also found the constant in the docs now.)