OpenCL profiling tools for Linux

So i’ve been searching a for a profiling tool that will allow me to profile/optimize my OpenCL kernels.
I’m using Ubuntu 12.04 64b, with Intel i7 3930K.
I have access to both an AMD GPU (HD6870) and NVidia GPU (GTX 580).

NVidia:
I’ve tried using NVidia’s Visual Profiler (nvvp), but when trying to debug my OpenCL application
i just get “Warning: No CUDA application was profiled, exiting”.
Altough i can’t find any mention of this in NVidia’s documentation, it seems as though
nvvp does not support profiling of OpenCL kernels, only CUDA - is this correct?

AMD:
AMD’s APP profiler is limited to Visual Studio.
While i could use windows (at least for profiling), i don’t have access to a full version of Visual Studio, and express versions don’t seem to be supported, so this is not an option.

Intel:
Only supposedly supports opencl profiling on linux by using their “VTune™ Amplifier”, which costs 899$ (not an option).
Still, I’ve tried out the trial version, but was unsuccessful as I only got profiling information on the host code. I don’t see why i should pay 900 bucks for that if i can do it for free with gprof or oprofile, so again: not really an option.

So, what other tools are available which i could use for profiling my OpenCL device code?

I’ve used nvidia’s successfully before - but it was a long time ago (nearing 2 years?). It is very cantankerous with respect to whether it works or not!! It buffers the profiling data in memory, and if that doesn’t get flushed to disk before the application exits - or a different subset makes it into a different run (as it wants to run it a pile of times) - profiling will just fail. I really couldn’t work out how to make it flush reliably, closing all contexts and devices wasn’t reliable (but it helped). It works better if you limit the counters being profiled to a minimum. e.g. just turn everything off except for cpu time (sorry i can’t remember the exact names of the counters).

I think at one point it became so useless I had to manually copy the intermediate csv files it generated on each pass - but before all passes were complete. Otherwise it would finish, scan the files, spit an error and delete them - even though there was some useful info in some of the files.

It’s not a stretch to say it wasn’t my favourite bit of software ever …

I thought the AMD one was only for ms-vs too, but then i found sprofile in the tools directory (from memory).

As far as i’m aware it’s the profiler used by the ms-vs plugin - i think the plugin is just a way to run it and display it’s output graphically (I have not used it though, i really can’t stand ms-vs). It’s a bit of a pain to work with the textual output, and its format doesn’t appear to be documented anywhere … but you can get pretty exhaustive output from it. But the basic output has the important stuff for individual kernel profiling in a human readable format, e.g. LDS conflicts, ALU load, wavefronts, kernel execution time, etc so that’s all i’ve been using.

I’ve never had trouble with it not recording stuff either, although i don’t look too closely.

The event profiling stuff is such a clumsy horrible way to time anything that I try to avoid it as much as possible, and it’s easier (and almost just as expressive) just using gettimeofday(). You really need a profiler to get into the nitty gritty though.

OpenCL support in the visual profiler seems broken with the latest nvidia driver/toolkit. It used to work well with the cuda 4.2 toolkit and the 295.41 driver.

The AMD APP includes a command-line profiler that works in Linux as well. It produces CSV files that you can then open and analyze by hand.