OpenCL on headless system: memory profiling

Hello everyone,

I am using a headless linux system (no xorg) on a 32x opteron 6300 with 2 R7 200 cards
I managed to develop my application so it works on all 3 devices with the fglrx driver (15.12).
I have ported my openCL kernel from working ansi C code. As I did not use any memory management in the code, the whole program uses variables declared in the function (which is private memory in OpenCL, I believe). subfunctions though the kernel mainly get arguments which are pointers to private memory
now, here’s my problem:
The kernel runs 64x as fast on the opteron devices as it runs on the radeons.
I suspect the GPU is pushing variables back-and-forth from global memory, through I am not sure about this.
Using valgrind, the C code consumed ± 1000 bytes of memory for execution.
It does not use big global array - the biggest and most used array is an array of 128 uchars in constant memory
How can profile the kernel memory management on a headless system? I know I should use AMD APP SDK, but I can’t find something about profiling on a headless system.

I don’t believe OpenCL driver even attempts to optimize memory transactions, unless register spill is required. You’ll probably have better luck asking your specific question on AMD’s forum rather than here. I can suggest using Analyze mode of CodeXL. It only requires a Radeon driver to be installed, I believe. Perhaps this is indeed an occupancy problem and you should use local memory for temporary storage and\or break computation into multiple kernels. Or, perhaps, CPU is simply more suited for your use-case than a low end GPU.