[QUOTE=Peccable;30489]Thanks a lot for your input. I’ve just tried the CodeXL profiler from AMD. Here is what it has to say about the kernel
I havent profiled a GPU kernel before so I’m reading a bit to see what to make of these numbers. But right away I notice especially the VALUBusy at 56.12% (time used for vector instructions) and SALUBusy at 1.09% (time used for scalar instructions), which are supposedly bad. But I guess SALUBusy beeing low is only due to most instructions being vector type.[/QUOTE]
I’m tempted to say that 50% isn’t terrible, all things considered, but I guess theoretically your code could be twice as fast (if you were aiming to be compute-bound I guess?). I tried running a test version of your kernel with similar parameters here:
[code=c++]
#define REAL_TYPE double
#define NUM_SAMPLES 6144
#define NUM_CELL_VAR 216
__kernel void assembleMatrix(__global const REAL_TYPE *glo_A, __global const REAL_TYPE *glo_B, __global REAL_TYPE *glo_M)
{
int row = get_global_id(0),
col = get_global_id(1);
if(row < NUM_CELL_VAR && col < NUM_CELL_VAR)
{
__global const REAL_TYPE *gA = &glo_A[row*NUM_SAMPLES];
__global const REAL_TYPE *gB = &glo_B[col*NUM_SAMPLES];
REAL_TYPE dM = 0.0;
for(int s = 0; s < NUM_SAMPLES; ++s)
dM += gA[s] * gB[s];
// // mad version
// for(int s = 0; s < NUM_SAMPLES; ++s)
// dM = mad(gA[s], gB[s], dM);
glo_M[row * NUM_CELL_VAR + col] = dM;
}
}
For the non [mad()](http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/mad.html) version I got the following (abbreviations [here](http://developer.amd.com/tools-and-sdks/archive/amd-app-profiler/user-guide/app-profiler-settings/), as I think you already found):
Method assembleMatrix__k1_Tahiti1
ExecutionOrder 11
ThreadID 9907
CallIndex 61
GlobalWorkSize {216 216 1}
WorkGroupSize {16 16 1}
Time 45.23496
LocalMemSize 0
VGPRs 34
SGPRs 16
ScratchRegs 0
FCStacks NA
Wavefronts 729
VALUInsts 12686
SALUInsts 205
VFetchInsts 6144
SFetchInsts 8
VWriteInsts 1
LDSInsts 0
GDSInsts 0
VALUUtilization 100
VALUBusy 1.6
SALUBusy 0.02
FetchSize 295928.5
WriteSize 401.06
CacheHit 54.29
MemUnitBusy 98.77
MemUnitStalled 0.05
WriteUnitStalled 0
LDSBankConflict 0
Interestingly, the mad() version performs slightly worse, seemingly because of more cache misses? Honestly not really sure whats going on there. The only different values I got from profiling are as follows:
Time 48.70119
VALUInsts 6542
VALUBusy 1.12
CacheHit 47
Anyway, that's all besides the point. The fact that MemUnitBusy is 90%+ while VALUBusy is small (and in the case of my device, extremely so) means that, as far as I can tell, your program is memory bound. We can tell the reads are not as good as they could be by looking at the FetchSize:
[quote="AMD"]
FetchSize: The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
[/quote]
For my parameters glo_A and glo_B each had a size of 216 * 6144 * (8 bytes), which comes out to about [b]21.23 MB[/b] total, but as you can see from my profile results the total fetch size is [b]295.929 MB[/b], or over 10x more. Luckily the cache does help us some, so it's not as big as it could be, but I think it could be improved by trying to coalesce the global reads and/or explicitly use local memory to store reused data. Obviously we won't be able to get it down to just 21.3 MB, but I think it could be cut down some, which should raise VALUBusy.