Starter: matrix mul

This is not a real problem. It is a simple starter question about OpenCL memory model.
(I know basics to run a simple kernel with OpenCL)

I want to multiply 2 really big matrices.
30000x30000 x 30000x30000

ONE THREAD CPU:
They don’t fit on physical RAM but multiplication can be transparent for C++ because of big swap file. Speed is very low off-course.

OPENCL:
What approach is used? It is transparent or I must slice the matrices?
I think this:

  • Move to GPU first row-vector from matrix A.
  • Move to GPU N column-vectors from matrix B.
  • Create N elements of matrix C in first row.
  • Move to GPU next N column-vectors from matrix B.
  • Create next N elements of matrix C in first row.
  • Move to GPU next row-vector from matrix A.

Are all of these needed, or I loose something transparent?
The problem with above code: you don’t know in which device will be executed, so you don’t know if the device has enough available memory.

I am in a mess!

That’s (un)fortunately the good and bad part about OpenCL. There is no built-in matrix multiplication in the OpenCL standard. Additionally, there is no restrictions on the amount of memory that you can use. It is up to the software and hardware developers to control memory, performance, etc.

In your hypothetical situation, you’d have to develop your own algorithm for the matrix multiplication. You’d want to slice the matrices, load them into local memory, and perform the multiplication. How you partition them depends on the device(s) you have in your system.

I’d check out NVIDIA’s and AMD’s SDKs with OpenCL. I believe both of them have some pretty involved matrix multiply examples. Matrix multiply is kind of the “hello world” of OpenCL writers :slight_smile:

Hmmmm…
The question is simpler.

For matrix multiplication, I have 2 big matrices which not fit in physical GPU memory (because matrices have size 10GB and GPU memory is 1GB).

This is handled from Vendor’s implementation of OpenCL or I must handle this in my code?

Thanks pal!

PS: They don’t fit either in physical CPU RAM, but swap file helps here.

I’ve a small doubt here. If a swap file is used for a 10Gigs of data. Won’t you lose the performance that you gain out of the GPU? i.e., saving and retrieving data to and from the swap file won’t cost you much?

Hi,

try to decompose the big matrix into smaller sub-blocks. A good presentation of this technique is given in the CUDA C Programming Guide.
The NVIDIA SDK also has a matrix multiplication example in OpenCL.

I.

Yes, I saw this approach.
So, the implementation of OpenCL can handle any size of arrays.
There is no GPU hardware limit in global array size.
‘Global’ GPU memory block can be also in system RAM or in hard disk swap file or when the OpenCL implementation believes it is efficient to store data that doesn’t fit in GPU memory.
Am I correct?

‘Global’ GPU memory block can be also in system RAM or in hard disk swap file or when the OpenCL implementation believes it is efficient to store data that doesn’t fit in GPU memory.

That is rather unlikely if your device is a GPU. What is going to happen is that when you attempt to allocate memory for a huge matrix it will return CL_OUT_OF_RESOURCES.