Line-by-line time profiling for an OpenCL kernel

hi, I am working on a project to optimize an OpenCL code. This kernel is computationally dense, and I’d like to see where is the bottleneck.

we have profiled the code with CodeXL on an AMD GPU, but the profiler only reports abstract metrics, which are not exactly helpful in pinpointing the hotspots.

are there any profiler allow one to do a line-by-line profiling with OpenCL? even including commercial ones

Also, if I can run the code with a CPU backend (such as Intel’s CL library), can I use cachegrind/kcachegrind to obtain such info? any other workaround (like using nvidia visual profiler etc)?

thanks