Matrix Multiplication

Hi,

I want to test un matrix multiplication, because,i think, it’s a good way to compare the GPU perf.(OpenCL) and CPU perf(OpenMP, in my case).

So, first, I began with un simple vector addition :


__kernel void IntAdd(__global const int* a, __global const int* b, __global int* c)

{

    int iGID = get_global_id(0);

    c[iGID] = a[iGID] + b[iGID];

}

A a can deduce that :
c[0] = a[0] + b[0]
c[1] = a[1] + b[1]

Now, I’m trying the following kernel code :


#define N 32  // matrice carre


__kernel void MatMult(__global const int* a, __global const int* b, __global int* c)

{

    int row = get_global_id(0);
    int col = get_global_id(1);

    int Cres = 0;
    for(int i = 0;i< N; i++)

    {Cres += a[row*N + i ] * b[i*N + col];}
    
    c[row*N + col]= Cres;

}

I see that row={0,1,…15} and col is always 0. If I need col =1 , i should do col+1.
, and I have this result on my terminal :


##  A MATRIX  ##
    0    1    2    3
    4    5    6    7
    8    9   10   11
   12   13   14   15

##  B MATRIX  ##
    0    1    2    3
    4    5    6    7
    8    9   10   11
   12   13   14   15

##  C MATRIX  ##
    0    1    2    3
    4    5    6    7
    8    9   10   11
   12   13   14   15



I can put the full code (single .cpp file) if necessary. compilation : “g++ DemoMatMult.cpp -o MatMult -lOpenCL”.

My questions are :

  • Do “get_global_id” works like said.
  • Of course, why the code doesn’t work.

Thank you in advance.

The values returned by get_global_id() are determined by the arguments passed to clEnqueueNDRangeKernel().

In your case you want something like this:


size_t work_size[2] = {N, N};

errcode = clEnqueueNDRangeKernel(queue, kernel, 2 /*two-dimensional ndrange */,
        NULL, &work_size[0], NULL, 0, NULL, NULL); 

With that, get_global_id(0) will return values from 0 to N-1 and get_global_id(1) will do the same.

Thank you Mr Garcia.

I do like you said, but I think my “main” problem was something else.

Let me explain : matrix multiplication work with :


const char* program_source[] =

{
    "__kernel void MatMult(__global const int* a, __global const int* b, __global int* c)",

    "{",

        "int row = get_global_id(0);",
        "int col = get_global_id(1);",
        "int Cres = 3;",
        "for(int i = 0;i< 4; i++)",

        "{Cres += a[row*4 + i ] * b[i*4 + col];}",

        "c[row*4+col]= Cres;",


    "}",

};

but, it doesn’t work with a constant value (#define N 4 ):


...
#define N 4
...
const char* program_source[] =

{
    "__kernel void MatMult(__global const int* a, __global const int* b, __global int* c)",

    "{",

        "int row = get_global_id(0);",
        "int col = get_global_id(1);",
        "int Cres = 3;",
        "for(int i = 0;i< N; i++)",

        "{Cres += a[row*N + i ] * b[i*N + col];}",
        
        "c[row*N+col]= Cres;",



    "}",

};

I think " " " are not take account the value of N ?!
I don’t use the .cl file for now. Should I use it on this kind of situation.
Thanks.

You need to #define N inside the kernel source.


const char* program_source =
    "#define N 4
"
    "__kernel void MatMult(__global const int* a, __global const int* b, __global int* c)
"
    ...

So simple. I had not thought of that.
Thank you once again.

(This) Problem solved.
Best Regards!

I think I can continue here for asking my question. If inappropriate, please correct.

I did a comparison between my “optimized” matrix multiplication with OpenMP, and my “simple” mat. mult. with OpenCL.

#define N 2048 // for both -> N*N matrices
OpenMP :


...
  #pragma omp parallel for private(i,j,k), shared(a,b,c), schedule(dynamic)
  for(i=0; i<N; i++)
  {
    for(k=0; k<N; k++)
    {
      for(j=0; j<N; j++)
      {
	    c[i][j]+=a[i][k]*b[k][j];
      }
    }
  }
...

compiled with gcc -O3

, OpenCL


...
	const size_t global_work_size[2] = {N,N};
	const size_t local_work_size[2] = {16,16};

	result = clEnqueueNDRangeKernel (command_queue,
					kernel,
					2,
					NULL,
					&global_work_size[0],
					&local_work_size[0],
                    0, NULL, NULL);

OpenMP time = 1.125 s.
OpenCL time = 5.532 s.

i think the calculating is not so big, and it take time to transfer data to GPU, and we don’t need to use it on this case.
I don’t know yet use the shared memory. Maybe I should continue by learning how to use shared memory.

sorry, local memory for openCL

You could also try using the CPU device to see how that performs.

OpenMP code use CPU only:

$ time ./mult_matrix
c[2047][2047] = 1449828352
real 0m1.108s
user 0m8.550s
sys 0m0.030s

First of all the code written is in the wrong syntax .

int iGID = get_global_id(0);In the get _global_id(),there should be no parameters passed.The error is thrown as now the function takes the 0 as a parameter and thus the entire logic changes

moreover c[row*N + col]= Cres;

is not the proper way to write again,as the variable should always be present in the LHS.and the constants or the working formula should be in the RHS

In the get _global_id(),there should be no parameters passed

That is untrue. get_global_id() takes a parameter to indicate which dimension of the NDRange are you querying. Please refer to section 6.11.1 of the OpenCL specification.

c[row*N + col]= Cres;

-> I think it’s ok with this too. C program always use this kind of writing.

I hope i progress in my program.
I trying to improve the global memory code with the use of local memory like in Nvidia Best Practice guide , section “3.2.2.2 Shared Memory in Matrix Multiplication (C = AB)”

My problem is that they use (Mx16) A matrix, (16xN) B matrix, and I use square matrix, 8x8 or 16x16 for testing, from 1024x1024 to 4096x4096 maximum for real use. The problem isn’t the tile dim, but the matrix dimension :


__kernel void coalescedMultiply(__global float* a,
                                               __global float* b,
                                               __global float* c,
                                               int N,
                                               __local float aTile[TILE_DIM][TILE_DIM])
{
  int row = get_global_id(1)
  int col = get_global_id(0)

  float sum = 0.0f;

  int x = get_local_id(0);
  int y = get_local_id(1);
  aTile[y][x] = a[row*TILE_DIM+x];

  for (int i = 0; i < TILE_DIM i++) {
   sum += aTile[y][i]* b[i*N+col];
  }
  c[row*N+col] = sum;
}


In the Nvidia Practice Guide they have 16x16 TILE_DIM. It’s the column dimension for A (or row dimension for B ), so it’s ok when they do 16x16 tile dimension. But how to do for square matrix, or more precisely, if we have ,for example , 16x16 tile and 32x32 A,B matrix.

I have this code :


const char* program_source[] =
{
//    "#define N 4
",
    "#define TILE_DIM 64 
",
    "__kernel void MatMult(__global const int* a, __global const int* b, __global int* c, const unsigned int n, __local int aTile[TILE_DIM][TILE_DIM])",
    "{",
        "int row = get_global_id(1);",
        "int col = get_global_id(0);",

        "int Cres = 0;",
        "int x = get_local_id(0);",
        "int y = get_local_id(1);",
        "aTile[y][x] = a[row*n + x];",     // or a[row*TILE_DIM+x]??
        "for(int i = 0;i< n ; i++)",
        "{Cres += a[row*n + i ] * b[i*n + col];}",
        "{Cres += aTile[y][i] * b[i*n + col];}",
//        "Cres = a[row] + b[row];",
        
//        "c[row*N+col] = Cres;",
        "c[row*n+col]= Cres;",

    "}",

I hope that I’m clear.
Thanks.

Please put a barrier after the initialization of aTile. Otherwise the code won’t run properly. Something like this:


aTile[y][x] = a[row*TILE_DIM+x];

barrier(CLK_LOCAL_MEM_FENCE);

  for (int i = 0; i < TILE_DIM i++) {

So, I’m desperate. I’m trying to do working local memory since two days, but it’s work partially.

The problem is that i have right multiplication depending the dimension of matrices. I put some part of the code, if something important or for full code, please say it.


...
#define N 64          // matrix dimension
#define TILE_DIM 4

	cl_int a[N*N];
	cl_int b[N*N];
	cl_int c[N*N];
...
const char* program_source[] =
{
//    "#define N 4
",
    "#define TILE_DIM 4 
",
    "__kernel void MatMult(__global const int* a, __global const int* b, __global int* c, const unsigned int n)", // __local int aTile[TILE_DIM][TILE_DIM]
    "{",
        "int row = get_global_id(1);",
        "int col = get_global_id(0);",

        "int Cres = 0;",
//        "int block = col/TILE_DIM;",
        "int x = get_local_id(0);",
        "int y = get_local_id(1);",
        "__local int aTile[TILE_DIM][TILE_DIM];",
        "__local int bTile[TILE_DIM][TILE_DIM];",

        "aTile[y][x] = a[row*n + x];",
        "bTile[y][x] = b[y*n + col];",
        "barrier(CLK_LOCAL_MEM_FENCE);",
        "for(int i = 0;i< n ; i++)",
//        "{Cres += a[row*n + i ] * b[i*n + col];}",
//        "{Cres += aTile[y][i] * b[i*n + col];}",
        "{Cres += aTile[y][i] * bTile[y][x];}",
        
//        "c[row*N+col] = Cres;",
        "c[row*n+col]= Cres;",

    "}",
};
...
...
    int NN = N*N;

    for(int i=0;i<NN; i++)
    {
        a[i] = 2;
        b[i] = 2;
        c[i] = -1;
    }
...
	result = clEnqueueNDRangeKernel (command_queue,
					kernel,
					2,
					NULL,
					&global_work_size[0],
                    NULL,
                    0, NULL, NULL);
....


Now, if i work only on global memory, a can work with 4096x4096 matrices (#define N 4096)("{Cres += a[row*n + i ] * b[i*n + col];}"):
-> 4096x4096 --> c[16777215] =16384 //right
-> 8192 x 8192 --> c[67108863] = -1 //wrong but not important, 4096 is OK

And if i try with local memory ({Cres += aTile[y][i] * bTile[y][x];}):
with barrier and :
–> 16x16 matrices -> c[i][j]=64 //OK
–> 32x32 matrices -> c[i][0->23]=128 and c[i][24->31] = 0

If you can help me, i"ll be very grateful.
Thanks in advance.

When you call clEnqueueNDRangeKernel() you don’t pass a work-group size (you are passing NULL instead). That means that the OpenCL driver will pick a work-group size for you. You should choose explicitly a work-group size and make that equal to your tile size.

What i’ve done it’s :




	const size_t global_work_size[2] = {N,N};
	const size_t local_work_size[2] = {TILE_DIM,TILE_DIM};
.....
.....
	result = clEnqueueNDRangeKernel (command_queue,
					kernel,
					2,
					NULL,
					&global_work_size[0],
					&local_work_size[0],
                    0, NULL, NULL);

When I increase N, I should increase TILE_DIM , otherwise I have almost all c[i] wrong, so if I have :
#define N 256
#define TILE_DIM 16

, I obtain c[65535] = 1024, where it’s right everywhere (c[i]=1024)

For N equal to 512, TILE_DIM =16 -> wrong results, some random, some 1088, 1152,1184,…1280, 1344,…
For N equal to 512, TILE_DIM =32 -> wrong results, c[i]=2 everywhere.

–> Is it the limit for tile dimension / local memory?
clGetDeviceInfo(devices[d],CL_DEVICE_LOCAL_MEM_SIZE,sizeof(cl_ulong),&long_entries,NULL); --> Local Memory (KB): 16

–> I don’t think it works like this,maybe yes ? I mean increase tile dimension when we increase matrices dimensions.

Is it the limit for tile dimension / local memory?

If the problem was that the device doesn’t have enough local memory, it would return an error code when you call clEnqueueNDRangeKernel().

I don’t think it works like this,maybe yes ? I mean increase tile dimension when we increase matrices dimensions.

That should not be necessary. The tile size is mostly independent of the global size. As long as the tile size divides the global size evenly it should work.

Is it possible that you are passing the wrong value to the last kernel argument?

There must be something pretty trivial going wrong. Have you compared your code with the examples of matrix multiplication from Apple, NVidia or AMD?

If the problem was that the device doesn’t have enough local memory, it would return an error code when you call clEnqueueNDRangeKernel().

OK.

That should not be necessary. The tile size is mostly independent of the global size. As long as the tile size divides the global size evenly it should work.

That’s what I also read

Have you compared your code with the examples of matrix multiplication from Apple, NVidia or AMD?

Yes, I do. I trying to do like in Nvidia Best Practice Guide, section 3.2.2.2 where there is kernel only. And like I said, it’s not square matrix, it’s row/column matrix with tile size row/column.

There must be something pretty trivial going wrong.

I think so. But can’t find with my beginner knowledge. (please, cf. code at the end of the post.

Is it possible that you are passing the wrong value to the last kernel argument?

I don’t think so, but it’s possible. It works well with global memory. So I would not put the full code for doing by my self, but now I think it’s necessary :


#include <iostream>
#include <iomanip>  //for setw()
#include <stdio.h>

using namespace std;

#include "opencl.h"

#define N 8          // matrix dimension
#define TILE_DIM 4 

	cl_int a[N*N];
	cl_int b[N*N];
	cl_int c[N*N]; 
    unsigned int n = N;


const char* program_source[] =
{
//    "#define N 4
",
    "#define TILE_DIM 4 
",
    "__kernel void MatMult(__global const int* a, __global const int* b, __global int* c, const unsigned int n)", // __local int aTile[TILE_DIM][TILE_DIM]
    "{",
        "int row = get_global_id(1);",
        "int col = get_global_id(0);",

        "int Cres = 0;",
        "int x = get_local_id(0);",
        "int y = get_local_id(1);",
        "__local int aTile[TILE_DIM][TILE_DIM];",
        "__local int bTile[TILE_DIM][TILE_DIM];",

        "aTile[y][x] = a[row*n+ x];",
        "bTile[y][x] = b[y*n + col];",
        "barrier(CLK_LOCAL_MEM_FENCE);",
        "for(int i = 0;i< n ; i++)",
//        "{Cres += a[row*n + i ] * b[i*n + col];}",
//        "{Cres += aTile[y][i] * b[i*n + col];}",
//        "{Cres += a[row*n + i ] * bTile[i][x];}",
        "{Cres += aTile[y][i] * bTile[i][x];}",
        
        "c[row*n+col]= Cres;",

    "}",
};


int main(int argc, char **argv)
{
    cl_int errcode;

    //FOR PLATFORM AND DEVICES INFORMATIONS
    char dname[500];
    cl_uint entries;
    cl_ulong long_entries;
    int d;
    size_t p_size;


  

	const cl_uint num_entries = 10;
	cl_platform_id platforms[num_entries];
	cl_uint num_platforms;
	errcode = clGetPlatformIDs(num_entries, platforms, &num_platforms);
//    cout << "Error Code: " << errcode << endl;

	

	cl_device_id devices[num_entries];
	cl_uint num_devices;
	errcode = clGetDeviceIDs(platforms[0], CL_DEVICE_TYPE_GPU, num_entries, devices, &num_devices);
//    cout << "Error Code: " << errcode << endl;

    // PLATFORM AND DEVICES INFORMATIONS
    /* obtain information about platform */
/*
       clGetPlatformInfo(platforms[0],CL_PLATFORM_NAME,500,dname,NULL);
       printf("CL_PLATFORM_NAME = %s
", dname);
       clGetPlatformInfo(platforms[0],CL_PLATFORM_VERSION,500,dname,NULL);
       printf("CL_PLATFORM_VERSION = %s
", dname);


     for (d = 0; d < num_devices; ++d) {
           clGetDeviceInfo(devices[d], CL_DEVICE_NAME, 500, dname,NULL);
           printf("Device #%d name = %s
", d, dname);
           clGetDeviceInfo(devices[d],CL_DRIVER_VERSION, 500, dname,NULL);
           printf("	Driver version = %s
", dname);
           clGetDeviceInfo(devices[d],CL_DEVICE_GLOBAL_MEM_SIZE,sizeof(cl_ulong),&long_entries,NULL);
           printf("	Global Memory (MB):	%llu
",long_entries/1024/1024);
           clGetDeviceInfo(devices[d],CL_DEVICE_GLOBAL_MEM_CACHE_SIZE,sizeof(cl_ulong),&long_entries,NULL);
           printf("	Global Memory Cache (MB):	%llu
",long_entries/1024/1024);
           clGetDeviceInfo(devices[d],CL_DEVICE_LOCAL_MEM_SIZE,sizeof(cl_ulong),&long_entries,NULL);
           printf("	Local Memory (KB):	%llu
",long_entries/1024);
           clGetDeviceInfo(devices[d],CL_DEVICE_MAX_CLOCK_FREQUENCY,sizeof(cl_ulong),&long_entries,NULL);
           printf("	Max clock (MHz) :	%llu
",long_entries);
           clGetDeviceInfo(devices[d],CL_DEVICE_MAX_WORK_GROUP_SIZE,sizeof(size_t),&p_size,NULL);
           printf("	Max Work Group Size:	%d
",p_size);
           clGetDeviceInfo(devices[d],CL_DEVICE_MAX_COMPUTE_UNITS,sizeof(cl_uint),&entries,NULL);
           printf("	Number of parallel compute cores:	%d
",entries);
       }
//*/



	cl_context_properties properties[] =
	{ 
		CL_CONTEXT_PLATFORM,
		(cl_context_properties)platforms[0], 
		0 
	};

	
	cl_int error;
	cl_context context = clCreateContext(properties,
							num_devices, devices,
							NULL, NULL, &error);

	if(error != CL_SUCCESS)
	{
		cerr << "Erreur clCreateContext" << endl;
		return 1;
	}

	cl_command_queue command_queue = clCreateCommandQueue (context,
										devices[0],
										NULL, &error);
	if(error != CL_SUCCESS)
	{
		cerr << "Erreur clCreateCommandQueue" << endl;
		clReleaseContext(context);
		return 1;
	}

    int NN = N*N;

    for(int i=0;i<NN; i++)
    {
        a[i] = i%4;
        b[i] = i%4;
        c[i] = -1;
    }

	cl_mem buffera = clCreateBuffer (context,
		CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
		sizeof(cl_int)*NN,
		&a,
		&error);
	if(error != CL_SUCCESS)
	{
		cerr << "Erreur clCreateBuffer pour a" << endl;
		return 1;
	}

	cl_mem bufferb = clCreateBuffer (context,
		CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
		sizeof(cl_int)*NN,
		&b,
		&error);
	if(error != CL_SUCCESS)
	{
		cerr << "Erreur clCreateBuffer pour b" << endl;
		return 1;
	}

	cl_mem bufferc = clCreateBuffer (context,
		CL_MEM_READ_WRITE,
		sizeof(cl_int)*NN,
		NULL,
		&error);

	// Buffer pour la donnée aTile
//	cl_mem bufferaTile = clCreateBuffer (context,
//		CL_MEM_READ_WRITE,
//		sizeof(cl_int)*TILE_DIM*TILE_DIM,
//		NULL,
//		&error);

	cl_program program = clCreateProgramWithSource(
		context,
		sizeof(program_source)/sizeof(char*),
		program_source,
		NULL,
		&error);
	if(error != CL_SUCCESS)
	{
		cerr << "Erreur clCreateBuffer pour c" << endl;
		return 1;
	}

	
	cl_int result = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
	cl_kernel kernel = clCreateKernel(program, "MatMult", NULL);

	result = clSetKernelArg(kernel, 0, sizeof(cl_mem), &buffera);
	result = clSetKernelArg(kernel, 1, sizeof(cl_mem), &bufferb);
	result = clSetKernelArg(kernel, 2, sizeof(cl_mem), &bufferc);
	result = clSetKernelArg(kernel, 3, sizeof(unsigned int), &n);
//	result = clSetKernelArg(kernel, 4, sizeof(cl_int)*TILE_DIM*N, NULL);

	const size_t global_work_size[2] = {N,N};
	const size_t local_work_size[2] = {TILE_DIM,TILE_DIM};

	result = clEnqueueNDRangeKernel (command_queue,
					kernel,
					2,
					NULL,
					&global_work_size[0],
					&local_work_size[0],
//                    NULL,
                    0, NULL, NULL);
    clFinish(command_queue);

	result = clEnqueueReadBuffer(command_queue,
		bufferc,
		CL_TRUE,
		0,
		sizeof(cl_int)*NN,
		&c,
		0, NULL, NULL);
    clFinish(command_queue);
//*
    cout<< "##  A MATRIX  ##"<<endl;    
    for(int j=0;j<NN;j++)
    {
    	cout <<setw(5)<< a[j];
        if((j+1)%N==0){cout<<endl;}
    }
    cout<<endl;

    cout<< "##  B MATRIX  ##"<<endl;    
    for(int j=0;j<NN;j++)
    {
    	cout <<setw(5)<< b[j];
        if((j+1)%N==0){cout<<endl;}
    }
    cout<<endl;
//*/
    cout<< "##  C MATRIX  ##"<<endl;    
    for(int j=0;j<NN;j++)
    {
    	cout <<setw(5)<< c[j];
        if((j+1)%N==0){cout<<endl;}
    }
    cout<<endl;
//*/

        cout<<"c["<<NN-1<<"] ="<< setw(5)<< c[NN-1]<<endl;
//    cout<< "resultat = " << result <<endl;

	clReleaseKernel(kernel);
	clReleaseProgram(program);
	clReleaseMemObject(bufferc);
	clReleaseMemObject(bufferb);
	clReleaseMemObject(buffera);
	clReleaseCommandQueue(command_queue);
	clReleaseContext(context);

}



compiling with : g++ DemoMatMult.cpp -o DemoMatMult -lOpenCL

Thanks for helping me.

So, for now, I’m trying to test by changing index or do blocks , but I think my problem is that I’m not really understanding how local memory works and how to use it.


const char* program_source[] =
{
//    "#define N 4
",
    "#define TILE_DIM 2 
",
    "__kernel void MatMult(__global const int* a, __global const int* b, __global int* c, const unsigned int n)", // __local int aTile[TILE_DIM][TILE_DIM]
    "{",
        "int row = get_global_id(1);",
        "int col = get_global_id(0);",

        "int Cres = 0;",
//        "int block = col/TILE_DIM;",
        "int x = get_local_id(0);",
        "int y = get_local_id(1);",
        "__local int aTile[TILE_DIM][TILE_DIM];",
        "__local int bTile[TILE_DIM][TILE_DIM];",

		"int bloc = n/TILE_DIM;",
//		"int offset_row = row*bloc + col;",
//		"int offset_col = row*bloc + col;",
//		"for(int k=0; k<bloc; k++)",
//		"{",
		    "aTile[y][x] = a[row*n+ bloc+x];",
		    "bTile[y][x] = b[(y+ bloc)*n+ + col];",
		    "barrier(CLK_LOCAL_MEM_FENCE);",
		    "for(int i = 0;i< n ; i++)",
	//        "{Cres += a[row*n + i ] * b[i*n + col];}",
	//        "{Cres += aTile[y][i] * b[i*n + col];}",
	//        "{Cres += a[row*n + i ] * bTile[i][x];}",
		    "{Cres += aTile[y][i] * bTile[i][x];}",
//        "}",
//        "c[row*N+col] = Cres;",
        "c[row*n+col]= Cres;",

    "}",
};

I’ll working on! But if somebody can explain me, it’ll help me.
Thanks.