How matrix dimentions are passed to kernel ?

Hello,

The following kernel is used to multiply matrix by vector. Is taken from the book: “OpenCL in action”

__kernel void matvec_mult(__global float4* matrix,
__global float4* vector,
__global float* result) {

int i = get_global_id(0);
result[i] = dot(matrix[i], vector[0]);

}

How does the GPU knows that ‘i’ returned from “get_global_id” means a row and not a column ?
How does it knows that each row has 4 elements ?

The host calls:

/* Enqueue the command queue to the device /
work_units_per_kernel = 4; /
4 work-units per kernel */
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &work_units_per_kernel,
NULL, 0, NULL, NULL);

Does it mean that 4 cores (4 work units) are used to compute the result ?
mat_buff = clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, sizeof(float)*16, mat, &err);

This is how the host created the input matrix buffer. The total size of the matrix is 16 elements but who tells GPU the number of rows, columns ?

Thanks,
Zvika

The GPU doesn’t “know” that “i” is a row and not a column; it is up to your kernel to use it correctly for its intended use. It is just a work item index.

The matrix and vector are arrays of float4s. Each float4 (obviously) has 4 elements. The size of the array is determined by the host when it allocates the buffer (you didn’t include that code so I don’t know).

The host needs to tell the kernel how many elements to process. It can do this via the global work size, which is the “work_units_per_kernel”.

I hope that helps.

[QUOTE=Dithermaster;29846]The GPU doesn’t “know” that “i” is a row and not a column; it is up to your kernel to use it correctly for its intended use. It is just a work item index.

The matrix and vector are arrays of float4s. Each float4 (obviously) has 4 elements. The size of the array is determined by the host when it allocates the buffer (you didn’t include that code so I don’t know).

The host needs to tell the kernel how many elements to process. It can do this via the global work size, which is the “work_units_per_kernel”.

I hope that helps.[/QUOTE]

Dear Dithermaster,

In my project the input to the kernel in N(rows)xM(columns) float matrix.
N,M will be passed as arguments to the kernel.

The input contains also a vector with N elements.
Now I have to call the dot product of specfic column in the matrix x vector.
In memory, rows are contiguous and columns are oviously not.

I can extract a column from the matrix with a loop.
Is there another faster way ?

Thanks,
Zvika

Sorry for the very basic question. clEnqueueNDRangeKernel will do the job.

But I still wonder if the kernel can extract data from column fast (as it does for row).

Zvika

Zvika:

If you pass your matrix as a buffer of floats (not float4s) then you can access any element easily:

float element = buffer[column * row_size + row];

The example you looked at is probably specifically for 4x4 matrixes, and then using a float4 makes good sense since it can help leverage the vector architectures.

In terms of “the fastest way”: the two things to keep in mind to make this fast:

  1. Coalesced memory access: It is very important that work items executing in parallel access nearby memory, preferably adjacent, for best memory bandwidth. For example, work item 0 accessing buffer[0] and work item 1 accessing buffer[1]. If your have it set up instead that work item 0 accesses buffer[0] but work item 1 accesses buffer[row_size] then you will not get good performance (since the GPU may read 128 bits at a time and then will discard much of it).
  2. Shared local memory caching of data that will be used across many work items. Matrix multiply is often used as an example of doing this, since each data element gets seen by every row and every column. A naive algorithm will read each element N*M times but a algorithm that can cache in fast shared local memory can reduce that significantly. It makes the algorithm a bit more complicated though. Study the numerous matrix multiply examples to get a good understanding of how to leverage the limited (48K sometimes) shared local memory for best advantage.