Hi,
I am writing some test code to compare various matrix-vector multiplication routines.
Thus far my code is working, however the transposed multiplication is incredibly slow compared to the normal one. I only changed the indexing, which might be the trouble. How would I loop through the matrix otherwise to make my routine more optimal? The matrix is not stored as the transposed.
Matrix A will be stored in column-major order.
This is the normal routine:
__kernel void gemv1(__global const scalar_t * a,__global const scalar_t * x,
__global scalar_t * y,int m,int n)
{
scalar_t sum = 0.0f;
int i = get_global_id(0); // row index
for (int j=0;j<n;j++)
{
sum += a[i + m*j] * x[j];
}
y[i] = sum;
}
This is the slow transpose:
__kernel void gemvt1(__global const scalar_t * a,__global const scalar_t * x,
__global scalar_t * y,int m,int n)
{
scalar_t sum = 0.0f;
int i = get_global_id(0); // row index
for (int j=0;j<m;j++)
{
sum += a[j + m*i] * x[j];
}
y[i] = sum;
}
I have more complex codes blocking the matrices and vectors, but I would like to get the simple code running first.
Thanx in advance!