Dynamically creating 2 dimensional local memory arrays

In openCL you can specify the amount of local memory you want to allocate in a kernel from host code by specifing the amount of memory to allocate in a parameter for local memory with the command

clSetKernelArg(myKernel, 3, localHeight * localWidth * sizeof(float), NULL);

where the kernel looks like

__kernel void matrixMul_gpu(
	__global float* A, __global float* B, __global float* C,
	__local float * As, __local float * Bs,
	unsigned int HeightA, unsigned int WidthB, unsigned int WidthAHeightB
	)

This works well if I want to create 1D arrays in local memory, but what if I want to make As and Bs 2D arrays where their sizes are dynamically specified from the host code?

You can either declare the local variable in kernel scope instead of as a kernel argument or you will have to manually index into the array as if it was 2D. This is not any different from how C99 works, is it?

Those are two work arounds that I have used.

Declaring a local memory array within the scope of the kernel is one way to make a 2D array. But the allocation size then becomes static. I couldn’t write code that dynamically sizes the array based on how much local memory is available on the devices compute units. Doing this would be useful if writing code for both nVidia’s G80 and Fermi architectures. G80 only has a local memory cache of 16 kb whereas Fermi has a cache of 48 kb. If you’re doing matrix dot product operations, it is ideal to have a block size as large as possible, on G80 the max would be 16x16, but on fermi it would be 32x32. If I create the array with in the scope of the kernel, I would have to write two kernels, one for a local mem size of 16x16 and the other for a local mem size of 32x32. Then I’m not even taking into account AMD cards or other architectures.

If I try to manually index into the code, it would be more dynamic, I could specify local memory array sizes outside the kernel so I would only have to write one kernel. Unfortunetly I would have to manually index the array with something like

As[i * MatrixWidth + j]

instead of

As[i][j]

This would work, but it takes a few more instructions, which can add up in iterations, it can get confusing, and it makes for much less elagant code.

I’m trying to get the best of both worlds, having dynamic code and having elagant code. Is dynamically allocating 2D arrays from the kernel call impossible in OpenCL 1.0? Or is there some way to do it?

I couldn’t write code that dynamically sizes the array based on how much local memory is available on the devices compute units.

You can change the size dynamically. All you have to do is change the variable declaration in the source code before you call clCreateProgramWithSource().

This would work, but it takes a few more instructions

Why would it take more instructions? A[j][i] is identical to A[j*width+i]. As for elegance, readability, etc. you can use a macro if you want.

Sorry, wrong button.

Is dynamically allocating 2D arrays from the kernel call impossible in OpenCL 1.0?

It’s not possible. See section 6.8: “Arguments to __kernel functions in a program cannot be declared as a pointer to a pointer(s).”