Profiling of kernel code

I have a kernel code which is taking 8 ms. Kernel code is large, i want to know which line or part of kernel is causing bottleneck?

What is the best way to identify bottleneck inside kernel?

Note: I am using AMD machine.

I have yet to find a good OpenCL profiler to measure performance within a kernel. My usual approach is to break the kernel into smaller pieces and profile each of those. Not ideal, but it may work for you.

Same here.

Havent found anything that can profile my kernels. Just a sharp look at the file can help you profiling. Think about registers, global Mem access, coalescing, branching and so on. Dismiss anything you dont need. Try to spread your kernel over more or less workitems.

Thanks kylelutz and clint3112.