Required atomic built-in functions

I know a research group that built a abstracted layer upon OpenCL to treat multiple identical GPU devices as a single device (assuming they all also have the same PCIe bandwidth). However, I can‘t imagine how they could make it work if any atomic functions are required to be complient. Are some atomics required? If they were all optional then I can think of how to do it without too much pain and it could be very useful.

I think I just found part of my answer. In OpenCL 1.0 basic atomic functions were optional, but as of OpenCL 1.1 they are required. If that is wrong please correct me.

Well, atomic functions pretty much ruins my hope of making an abstract device that integrates several similar devices. I could only think of how to do it without any atomic functions.

Any reasoning behind making it obligatory in OpenCL 1.1?

OpenCL is a software abstraction: you can implement atomics however you want, they just have to honour the contract.

e.g. for many devices you could break the kernels up into atomically bounded sections and run the kernel parts separately and then synchronise on the host.

I’m not saying it would be efficient, but I mean what do you really expect to be able to do anyway? The atomic operations require very specific specialised hardware in order to run fast, and without that you will have no choice but to resort to host-based software.

Global atomics are so slow on AMD hardware for example I wouldn’t use them except for very rarely-executed code (i.e. it’s possible calling the host already), so a high overhead is already expected. But they have global counters implemented in hardware to get around that …

Re your earlier query there’s nothing to say a research project implemented the full specification in the first place. It is possible to do without atomics entirely, at a cost of memory and extra processing steps.

Amalgamating different hardware with different performance characterstics will be a challenge! Often different hardware requires a different coding approach, it runs at a different speed, and so on: managing all that scheduling and keeping the memory close to the right kernels will be difficult.