Predefined Macros: device type

Khronos should add the following (or an equivalent) predefined macros to the official OpenCL specification: CPU, GPU, ACCELERATOR, X86, and X86_64. This allows one kernel source to be maintained for all devices and the two most common ISAs, independent of the build options. Currently the AMD APP SDK supports all the requested predefined macros except ACCELERATOR.

As already mentioned in another forum post, I would be happy if this would be included in the next standard too.

Herbert

All for it. Would be a really useful feature.

Unfortunately I didn’t see this suggestion adopted in the OpenCL 1.2 specifications. However, please consider it for the next version (although 18 months is an extremely long time to wait for something seemingly so simple).

Can you elaborate a little on how these work? For each device that the code is being built for, the OpenCL implementation would look at the device type and enable each of the corresponding macros?

Considering that a device can be simultaneously a GPU, CPU, and accelerator (notice that cl_device_type is a bitfield), then multiple of these macros may be enabled at once?

As for X86 and X86_64 that sounds more thorny. Why stop with x86? Most CPUs in the planet today are probably ARM and there are different variants of the ARM instruction set as well. Then the good folks at IBM will demand Cell to be included as well. Etc.

I see, I didn’t put much consideration into cl_device_type being a bitfield as you described. Actually that confuses me some. In your example of a device being simultaneously a GPU, CPU, and accelerator, how do you control exactly where the kernel is run? For some reason I was under the impression that you would get back three distinct devices (each with only one bit turned on in their respective cl_device_type bitfields).

In your example of a device being simultaneously a GPU, CPU, and accelerator, how do you control exactly where the kernel is run?

It’s a single device. The kernel runs in it. It’s not a GPU and a CPU working separately anymore. It may sound confusing because such devices are not common yet.

Can’t you do this by defining your own macro which is passed in the options argument to clBuildProgram? It does mean that you now need to call clBuildProgram separately for devices of different types in your CL context which makes it not very elegant.

How do you optimize for a device if you don’t really know which particular hardware the software will be run? I guess you just program a simple scalar version and the hardware or runtime/driver will manage it all for you?

Do you happen to know how current AMD APUs appear to developers: two distinct devices or one device with two cl_device_type bit-fields turned on? Are there any devices available to buy now that you described? BTW, I have an AMD Radeon HD 5970 (two identical GPU dies on one board) and it appears as two distinct GPU devices, which was contrary to my hopes and expectations. I’m starting to feel like there isn’t much I can assume about CPU and GPU devices anymore. :frowning:

Calling clBuildProgram separately as you mentioned was my main motivating factor to suggest adding this feature. I naively assumed it wouldn’t be too difficult to implement since it’s already available in the AMD OpenCL SDK.

I also think that this is a useful feature to add.
It just means that these macros would be added automatically, instead of passing them manually.

I‘ve been thinking about this issue and it dawned on me that predefined macros would not be as good as a get_device_type() function that can dynamically change as the kernel is shifted around basic hardware architectures during execution.

I‘ve also been wondering how would one determine the preferred_work_group_multiple for a device like you describe? The least common multiple of the preferred_work_group_multiple for each subdevice? Better hope they‘re not all distinct prime numbers :slight_smile:

If a kernel is allowed to be shifted between different basic hardware architectures then it would be helpful to provide a function that hints/dictates which parts of the kernel code should be executed where, for example via a set_device_type() function.