Help required understanding some results(of matrix mult.)

Hi,
I have recently started learning OpenCl, i started with a simple matrix multiplication example to see by how much a gpu can reduce computation time and also to learn how to optimize data movement.i tried the following
Matrices A,B,C are all 1024x1024
GPU: C(i, j) per work-item, all global memory [1D Work Space](1024*1024 work item)
GPU: C(i, j) per work-item, all global memory [2D Work Space]
GPU: C row per work-item, all global memory [1D Work Space]
GPU: C row per work-item, A private, B in global memory [1D Work Space]
GPU: C row per work-item, A private, B in local memory [1D Work Space]

The results are as follows 0.4308s,3.9784s,2.3082s,1.6315s,1.6561s.
i have already checked and all give the correct results.
i am using opencl 1.2, catalyst 12.4 drivers on amd 3400m APU.the following are the kernel for the first.
"__kernel
"
"void matrixmultiply(__global float *A,
"
" __global float *B,
"
" __global float C,int WidthA,int WidthB)
"
"{
"
"
"
" // Get the work-item’s unique ID
"
" int idx = get_global_id(0);
"
" float sum=0;
"
" int row;
"
" int column;
"
" row=idx/WidthB;
"
" column=idx%WidthB;
"
" // Add the corresponding locations of
"
" // ‘A’ and ‘B’, and store the result in ‘C’.
"
" for(int i=0;i<WidthA;i++)
"
" {
"
" sum+= A[row
WidthA+i]*B[i*WidthB+column];
"
" }
"
" C[idx]=sum;
"
"}
"
kernel for the second
"__kernel
"
"void matrixmultiply(__global float *A,
"
" __global float *B,
"
" __global float C,int WidthA,int WidthB)
"
"{
"
"
"
" // Get the work-item’s unique ID
"
" float sum=0;
"
" int row = get_global_id(0);
"
" int column = get_global_id(1);
"
" // Add the corresponding locations of
"
" // ‘A’ and ‘B’, and store the result in ‘C’.
"
" for(int i=0;i<WidthA;i++)
"
" {
"
" sum+= A[row
WidthA+i]B[i*WidthB+column];
"
" }
"
" C[row
WidthB+column]=sum;
"
"}
"\

My question is why is the first fastest when it access all data from the global memory.

Be a lot easier to read if you de-c-stringed it.

It’s down to the fact that 2d work-items are assigned in dimension order from 0, i.e. you’re using different addressing in the two cases.

The first case:

  • reads A with a large stride - not very fast
    But:
  • reads B coalesced (each work-item reads adjacent values)
  • writes the result coalesced.

In the second case all reads and writes are almost worst-case access pattern: largish order-of-2 stride which causes bank conflicts, and all non-coalesced.

So the code is different. If you set row=get_global_id(1), col=get_global_id(0) instead, the performance should be the same.

Thank you for the quick reply.I tried the changes u mentioned and the times are now nearly equal.So the primary reason here is how the data is read/written.