A little optimization help anyone?

Here is the kernel code:


__kernel void assembleMatrix(const int R, const int r0, const int c0, __global const REAL_TYPE *glo_A, __global const REAL_TYPE *glo_B, __global REAL_TYPE *glo_M)
{
    int row = get_global_id(0);
    int col = get_global_id(1);

    if(row < NUM_CELL_VAR && col < NUM_CELL_VAR)
    {
        __global const REAL_TYPE *gA = &glo_A[row*NUM_SAMPLES];
        __global const REAL_TYPE *gB = &glo_B[col*NUM_SAMPLES];

        REAL_TYPE dM = 0.0;

        for(int s = 0; s < NUM_SAMPLES; s++)
            dM += gA[s]*gB[s];

        glo_M[(r0 + row)*R + c0 + col] += dM;
    }
}

The identifiers in caps are defined in a separate source (generated before compilation) as


#define REAL_TYPE double
#define NUM_SAMPLES 4096  // may be defined as anything from 4096 to 65536
#define NUM_CELL_VAR 64    // may be defined as 64, 216 or 512

It seems to me that accessing both global memory arrays in the loop may cause a bottleneck, so I’ve been experimenting with precopying parts of the global data into local and private memory. However the current code still runs the fastest. Perhaps because sizeof(REAL_TYPE)*NUM_SAMPLES is too large to fit into local memory all at once on my device.

As you may see, this is a part of a numerical integration and the use of double presicion is neccesary. Anyway, clever optimization tips would be greatly appreciated.

Peccable

To me it looks like this is essentially normal matrix multiplication, no? I hate to not actually give you any explicit help, but googling (or searching these forums) for “OpenCL local memory matrix multiplication” gives a number of results, unfortunately I’m not sure which are the best. You could also take a look at Nvidia’s CUDA documentation, since your approach will be very similar to theirs.

Thanks, yes its a sum of outer products, essentially equivalent to multiplication of two rectangular matrices. I’ve found a few examples, maybe my google-fu is no good, but they all seem to suggest copying a buffer into local or private memory which, in my case, causes a significant slowdown.

I’ve also tried casting *gA and *gB into double4 and summing up dot(… , …) products which gave a slight improvement.

The only thing I’ve found that is significant is to precompute the starting memory address in the source data (the pointers to &glo_A[rowNUM_SAMPLES] and &glo_B[rowNUM_SAMPLES]). That gave about a 15 % improvement.

Currently this kernel performs about 51 Gflops. Compared to the rated 152 Gflops of my device (for double precision), this seems a bit low, no?

I haven’t had a chance to try it out yet, but you might try transposing gA and gB so that the reads from global memory are coalesced. That is to say, if NUM_SAMPLES was 8 and you had 8 threads, have the memory layout be something like:

[A00][A08][A16]…[A56]
[A01][A09][A17]…[A57]

[A07][A15][A23]…[A63]

As it is now, you read from A[0], A[NUM_SAMPLES], … , A[(global_size(0) - 1) * NUM_SAMPLES], etc, all at the same time, which I doubt is very efficient in terms of memory bandwidth. You could also probably rewrite the kernel a different way to instead process each line from A/B at the same time instead of transposing the data, but either would probably work. I forgot to mention, but if you have a profiler for your device, make sure you use it, since I am sort of guessing on what might be the bottleneck here.

Thanks a lot for your input. I’ve just tried the CodeXL profiler from AMD. Here is what it has to say about the kernel


Method				assembleMatrix__k3_Pitcairn1
ExecutionOrder			543		
ThreadID			11548	
CallIndex			2949	
GlobalWorkSize			{    216     216       1}		
WorkGroupSize			{   16    16     1}		
Time				13.14148
LocalMemSize			0		
VGPRs				48
SGPRs				26
ScratchRegs			0	
FCStacks			NA	
KernelOccupancy			50		
Wavefronts			729	
VALUInsts			27676	
SALUInsts			1742	
VFetchInsts			6913	
SFetchInsts			8	
VWriteInsts			1	
LDSInsts			0	
VALUUtilization (%)		100			
VALUBusy (%)			56.12		
SALUBusy (%)			1.09		
FetchSize			440963.38	
CacheHit (%)			39.30		
MemUnitBusy (%)			91.60		
MemUnitStalled (%)		0.03			
WriteUnitStalled (%)		0				
LDSBankConflict (%)		0			
GDSInsts			0	
WriteSize			389.38

I havent profiled a GPU kernel before so I’m reading a bit to see what to make of these numbers. But right away I notice especially the VALUBusy at 56.12% (time used for vector instructions) and SALUBusy at 1.09% (time used for scalar instructions), which are supposedly bad. But I guess SALUBusy beeing low is only due to most instructions being vector type.

[QUOTE=Peccable;30489]Thanks a lot for your input. I’ve just tried the CodeXL profiler from AMD. Here is what it has to say about the kernel

I havent profiled a GPU kernel before so I’m reading a bit to see what to make of these numbers. But right away I notice especially the VALUBusy at 56.12% (time used for vector instructions) and SALUBusy at 1.09% (time used for scalar instructions), which are supposedly bad. But I guess SALUBusy beeing low is only due to most instructions being vector type.[/QUOTE]

I’m tempted to say that 50% isn’t terrible, all things considered, but I guess theoretically your code could be twice as fast (if you were aiming to be compute-bound I guess?). I tried running a test version of your kernel with similar parameters here:

[code=c++]
#define REAL_TYPE double
#define NUM_SAMPLES 6144
#define NUM_CELL_VAR 216

__kernel void assembleMatrix(__global const REAL_TYPE *glo_A, __global const REAL_TYPE *glo_B, __global REAL_TYPE *glo_M)
{
int row = get_global_id(0),
col = get_global_id(1);

if(row &lt; NUM_CELL_VAR && col &lt; NUM_CELL_VAR)
{
    __global const REAL_TYPE *gA = &glo_A[row*NUM_SAMPLES];
    __global const REAL_TYPE *gB = &glo_B[col*NUM_SAMPLES];

    REAL_TYPE dM = 0.0;

    for(int s = 0; s &lt; NUM_SAMPLES; ++s)
        dM += gA[s] * gB[s];

// // mad version
// for(int s = 0; s < NUM_SAMPLES; ++s)
// dM = mad(gA[s], gB[s], dM);

    glo_M[row * NUM_CELL_VAR + col] = dM;
}

}



For the non [mad()](http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/mad.html) version I got the following (abbreviations [here](http://developer.amd.com/tools-and-sdks/archive/amd-app-profiler/user-guide/app-profiler-settings/), as I think you already found):

Method assembleMatrix__k1_Tahiti1
ExecutionOrder 11
ThreadID 9907
CallIndex 61
GlobalWorkSize {216 216 1}
WorkGroupSize {16 16 1}
Time 45.23496
LocalMemSize 0
VGPRs 34
SGPRs 16
ScratchRegs 0
FCStacks NA
Wavefronts 729
VALUInsts 12686
SALUInsts 205
VFetchInsts 6144
SFetchInsts 8
VWriteInsts 1
LDSInsts 0
GDSInsts 0
VALUUtilization 100
VALUBusy 1.6
SALUBusy 0.02
FetchSize 295928.5
WriteSize 401.06
CacheHit 54.29
MemUnitBusy 98.77
MemUnitStalled 0.05
WriteUnitStalled 0
LDSBankConflict 0



Interestingly, the mad() version performs slightly worse, seemingly because of more cache misses? Honestly not really sure whats going on there. The only different values I got from profiling are as follows:

Time 48.70119
VALUInsts 6542
VALUBusy 1.12
CacheHit 47



Anyway, that's all besides the point. The fact that MemUnitBusy is 90%+ while VALUBusy is small (and in the case of my device, extremely so) means that, as far as I can tell, your program is memory bound. We can tell the reads are not as good as they could be by looking at the FetchSize:

[quote="AMD"]

FetchSize: The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.

[/quote]


For my parameters glo_A and glo_B each had a size of 216 * 6144 * (8 bytes), which comes out to about [b]21.23 MB[/b] total, but as you can see from my profile results the total fetch size is [b]295.929 MB[/b], or over 10x more. Luckily the cache does help us some, so it's not as big as it could be, but I think it could be improved by trying to coalesce the global reads and/or explicitly use local memory to store reused data. Obviously we won't be able to get it down to just 21.3 MB, but I think it could be cut down some, which should raise VALUBusy.

Just an observation: The GlobalWorkSize is not an integer multiple of the WorkGroupSize (216 is not evenly divisible by 16). In OpenCL 1.x, if you specify the work group size then the global size must be a multiple of it. Then you can pass the real size as a parameter and your kernel can check to see if the global_id is within the valid size before doing work.

So I transposed the global memory buffers and this gave some improvement. Now the VALUBusy is typically 70% - 80% (SALUBusy is 10%). And overall the kernel performs about 60 Gflops.

Fetch size is still ranges from 100MB to 200MB. The fact that this varies a lot from one run to another is a bit strange. Maybe it could be due to the fact that the GPU also is connected to a screen and renders stuff for other applications?

Hmm you’re right. It has not been a problem but I’ve changed the wg size to 8x8 to be sure.

BTW, I’ve been using dot(A,B) instead of A*B. Even if A and B are not vectors it seems dot() is slightly faster than *. Guess this is highly system dependent though.