Atomicadd for floats has been supported in CUDA since Fermi (circa 2010), but such feature still does not seem to be supported in the latest OpenCL specification. curious is this on the roadmap?

right now I am using the only known hack to get around this problem (via atomic_xchg)

but it gives me too much overhead on some processors - for example, for Intel CPU, this while(atomic_xchg()) line costs me about 25% of the run-time (in comparison, my CUDA equivalent version of this kernel shows less than 1% latency for atomicadd).

if there is no known timetable for this feature, are there evolved solution to do this with better efficiency?