Transposed matrix-vector product in OpenCL

Hi,

I am writing some test code to compare various matrix-vector multiplication routines.

Thus far my code is working, however the transposed multiplication is incredibly slow compared to the normal one. I only changed the indexing, which might be the trouble. How would I loop through the matrix otherwise to make my routine more optimal? The matrix is not stored as the transposed.

Matrix A will be stored in column-major order.

This is the normal routine:

__kernel void gemv1(__global const scalar_t * a,__global const scalar_t * x,
		    __global scalar_t * y,int m,int n)
{
  scalar_t sum = 0.0f;
  int i = get_global_id(0); // row index
  for (int j=0;j<n;j++)
    {
      sum += a[i + m*j] * x[j];
    }
  y[i] = sum;
}

This is the slow transpose:

__kernel void gemvt1(__global const scalar_t * a,__global const scalar_t * x,
		    __global scalar_t * y,int m,int n)
{
  scalar_t sum = 0.0f;
  int i = get_global_id(0); // row index
  for (int j=0;j<m;j++)
    {
      sum += a[j + m*i] * x[j];
    }
  y[i] = sum;
}

I have more complex codes blocking the matrices and vectors, but I would like to get the simple code running first.

Thanx in advance!

The accesses to memory from the transposed version are not coalesced. That means that work-items with consecutive IDs are not reading from consecutive locations in memory.

I recommend reading your hardware vendor’s programming guide. You will see a discussion on this topic.