I've just migrated my program from cuda to opencl. It involved a bit work to change all the host code, like device initialization, memory allocation, kernel execution etc.

For the device code (kernels) changes were very small hovewer:
- replacing __syncthreads() with barrier(CLK_LOCAL_MEM_FENCE)
- changing from sqrtf(x) to sqrt((float)x)
- constant memory not allocated statically like in cuda, but dynamically (with __constant kernel argument, and appropriate call to clSetKernelArg() on the host side).

That were the only changes made.

Unfortunately opencl version consumes more registers than native cuda version.
Even specifying -cl-nv-maxrregcount (in clBuildProgram) and seting the amount of max registers to number achieved in native cuda compilation didn't help much - it works (judging by BUILD_LOG from clGetProgramBuildInfo), but there are spills to private memory ("local" in cuda nomenclature) and overall performance of kernel is lower.

I've been experimenting with -cl-nv-opt-level build option but achieved nothing.
Both with cuda and opencl I'am using fast math option - checked without it but nothing. Everytime opencl reg usage is higher or there are spills. The difference can be as high as 7 registers and it ruins the performance (lower number of blocks/workgroups can be processed simultaneously on multiprocessor).

What is the cause of such behaviour - poor opencl compiler?