OpenCL 1.1 has global atomics that work great on NVIDIA GPUs but slower on AMD GPUs. On AMD GPUs the counters32_t perform better. Here are the two versions:
//OpenCL 1.1 has atomic_inc build-in (no extension needed)
//see http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/atomic_inc.html
__kernel void globalAtomicKernelOpenCL1_1( volatile __global int* counter)
{
atomic_inc(counter);
}
//OpenCL 1.1 atomic device counters extension, usually faster on current AMD hardware
//http://www.khronos.org/registry/cl/extensions/ext/cl_ext_atomic_counters_32.txt
#pragma OPENCL EXTENSION cl_ext_atomic_counters_32 : enable
__kernel void counterAtomicKernelExt( counter32_t counter)
{
atomic_inc(counter);
}
Unfortunately kernels using counter32_t don’t compile on platforms that don’t support this extension (such as current NVIDIA OpenCL implementations)
I rather don’t duplicate kernels just for this extension, and currently a find-and-replace in the kernel source code at runtime, dependent on the ‘cl_ext_atomic_counters_32’ availability works.
const char* globalAtomicsKernelStringPatched = globalAtomicsKernelString;
if (!strstr(clInfo.m_deviceExtensions,"cl_ext_atomic_counters_32"))
{
globalAtomicsKernelStringPatched = findAndReplace(globalAtomicsKernelString,"counter32_t", "volatile __global int*");
}
Does anyone have a better suggestion?
Thanks,
Erwin