Memory allocation inside kernel

Hello

Is it possible to allocate memory inside kernel using malloc()? (so later I can use free()).
I wrote a kernel, but OpenCLEditor (http://www.cmsoft.com.br/download/OpenCLTemplate.zip) display some errors when I want to use malloc() / free(): http://imageshack.us/photo/my-images/802/opencl.png/

Is it possible to allocate memory inside kernel using malloc()? (so later I can use free()).

malloc/free are not supported in OpenCL C.

So to have access for independent arrays (lets say of size 15) for each work item, I have to create an array of arrays (for example) of size ‘computeSizeBlock’ before calling kernel and pass pointer to it?

So if I want to do 10 computations (run 10 times a work item) and my ‘computeSizeBlock’ is 2 I have to create two dimensional array ([2][15]) and pass it to kernel which will be executed 5 times (computeSizeBlock=2, passes = 5 so computations = 10).

Or maybe there is a better idea to allocate independent array for every work item so it can do it’s computations ?

So to have access for independent arrays (lets say of size 15) for each work item, I have to create an array of arrays (for example) of size ‘computeSizeBlock’ before calling kernel and pass pointer to it?

You can’t pass pointers to pointers as kernel arguments either. If you want each work-item to have its own little space, you can either pass a pointer to global memory and internally assign some portion of it to each work-item, or you can do the same with local memory, or you can declare a private array inside the kernel like this:


__kernel void foo()
{
    // The following variable is in private memory by default
    float myArray[15];
}

Allocating array of size = computeBlockSize*sizeOfOneArray globally and passing pointer seems to be a good solution. I will try doing it and computing offset for each workitem inside the kernel.

Can you point me to some article describing how stack memory is maintained in OpenCL kernels?
OK lets say some of arrays I need in kernel are small (uchar [4]). Is it safe to use private memory even if computeSizeBlock will be a big number?

Can you point me to some article describing how stack memory is maintained in OpenCL kernels?

It will depend a lot on the implementation. In fact, it’s possible that some implementations don’t even have a stack.

OK lets say some of arrays I need in kernel are small (uchar [4]). Is it safe to use private memory even if computeSizeBlock will be a big number?

What is computeSizeBlock? Is it the work-group size? I’m not sure what do you mean by “safe” either. It should certainly work in any reasonable implementation.

If you use a lot of private memory you will face two problems. The first one is that performance will generally decrease as you increase the amount of private memory required to run a kernel. The second problem is that at some point you may hit an implementation limit on how much private memory you can use and either compilation will fail or clEnqueueNDRangeKernel() will fail.

Thanks for reply. So I will just try different settings of work-group size / passes in case of any problems (smaller work-grup size -> more passes I need to perform in oder to do all needed computations).

And…

What is computeSizeBlock? Is it the work-group size?

Yes, I was thinking about work-group size.

I have some problems with my kernel, or maybe the way I’m using and allocating memory is wrong.
I’m trying to allocate a workSize*sizeOfArray “big” array and then in kernel for every work item I compute offset - so every work item has independent space for storage.

But I experiencing some problems and I would like you to explain me what I’m doing wrong.

OK so I need 3 bigger arrays for every work item. I’m allocating space by using clCreateBuffer(). Let’s say I want to run clEnqueueNDRangeKernel() with global_work_size parameter set to 512.

Every work item need one of the arrays to be 256 bytes long. So in advance I need to allocate 256 * workSize = 256 * 512 = 131 072 bytes array. In kernel I do some computations using only a part of a this array. To compute offset I simply use: get_local_id(0)*256.

I use these commands:
int workSize=512
int N=256;

cl_mem SBuffer = clCreateBuffer(GPUContext, CL_MEM_READ_WRITE, sizeof(uchar)*workSize*N, NULL, &errcode);
assert(errcode==CL_SUCCESS);
clSetKernelArg(OpenCLVectorAdd, 6, sizeof(cl_mem), (void*)&SBuffer); 

After executing kernel I can read the array doing:


uchar *s = new uchar[N*SIZE];
clEnqueueReadBuffer(GPUCommandQueue, SBuffer, CL_TRUE,0,SIZE*N*sizeof(uchar),s,0, NULL, NULL);

I expected the whole array will be filled with some values, but it seems only 65536 bytes were used. So now it’s clear why my computations were wrong - probably space I though will be used only by one workitem was used by many workitems.
So, 65536 bytes used - it means 2 times less then should be used.

Is my implementation correct? I mean assuming that with get_local_id(0) I can compute offset and it will work?

Maybe I just using my GPU wrong (it’s Nvidia Quadro NVS140)?
These are values which are displayed by Cloo framework:

LocalMemorySize = 16384
MaxComputeUnits = 2
MaxConstantArguments = 9
MaxConstantBAufferSize = 65536
MaxMemoryAllocationSize = 134217728
MaxSamplers = 16
MaxWorkGroupSize = 512
MaxWorkItemDimenstions = 3
MaxWorkItemSizes = 512 / 512 / 64

How should I understand these values?

I don’t see the necessary information to understand what is happening.

What are the arguments you pass to clEnqueueNDRangeKernel? When you call clCreateBuffer() you allocate “workSizeN" bytes, but when you read back the data, you pass "SIZEN” to clEnqueueReadBuffer. Is SIZE the same as workSize?

Something that seems wrong with the code is that you create the buffer with a size of “workSizeN" which implicitly assumes that there is only one work-group running. You need to allocate "worksizeN*numWorkGroups” instead, where numWorkGroups is the number of work-groups that you launch when you call clEnqueueNDRangeKernel.

What are the arguments you pass to clEnqueueNDRangeKernel? When you call clCreateBuffer() you allocate “workSizeN" bytes, but when you read back the data, you pass "SIZEN” to clEnqueueReadBuffer. Is SIZE the same as workSize?

Yes, sorry for inconsistency. workSize is the same as SIZE. (I just wanted to correct it but I don’t have “edit” button anymore at bottom of my post :/).

Something that seems wrong with the code is that you create the buffer with a size of “workSizeN" which implicitly assumes that there is only one work-group running. You need to allocate "worksizeN*numWorkGroups” instead, where numWorkGroups is the number of work-groups that you launch when you call clEnqueueNDRangeKernel.

Good point! When I do not specify local_work_size it’s up to driver/OpenCL implementation how to set up work-groups. In OpenCL specificaiton for clEnqueueNDRangeKernel it is said:

local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.

source: clEnqueueNDRangeKernel

So it was my mistake I didn’t specify it and it can explain the problem - OpenCL set up 2 work-groups each 256 work-items (for global workSize=512) and that’s why only one half of my array(/arrays) was(/were) used and values got simply overwritten by different work items from different work groups. To be sure I will just check it using get_num_groups().

So to explicitly set it I need to set local_work_size by the same value as global_work_size. And then I will have only one work group.

Right, it was the problem for sure:

With workSize set to 512 my driver/OpenCL implmentation creates two work groups - get_num_groups(0) returns “2” (because to handle global work group size = 512, 2 work groups have to be created --> my GPU supports maximum 256 work items in the work group). And my blind assumption about 1 work group was totally wrong.

So it’s clear now.

Thanks for you support David.

Last thing to clarify:

Something that seems wrong with the code is that you create the buffer with a size of “workSizeN" which implicitly assumes that there is only one work-group running. You need to allocate "worksizeN*numWorkGroups” instead, where numWorkGroups is the number of work-groups that you launch when you call clEnqueueNDRangeKernel.

There are two approaches here I think.
Fist:
“worksizeNnumWorkGroups” - worksize here should stands for size of one workGroup.

Second:
My “workSize” supposed to be a global work size which can be divided by driver/OpenCL implementation. Anyway it should be still a good size. But if I want to compute offset in kernel for every array I really should write it “get_local_id(0)*get_group_id(0)*S” (S stands for particular array size). So then I don’t care how many work groups were created by driver/OpenCL.

My “workSize” supposed to be a global work size which can be divided by driver/OpenCL implementation. Anyway it should be still a good size. But if I want to compute offset in kernel for every array I really should write it “get_local_id(0)*get_group_id(0)*S” (S stands for particular array size). So then I don’t care how many work groups were created by driver/OpenCL.

That’s not how it works. If you pass a work size when you call clEnqueueNDRangeKernel(), the driver will use that work size unchanged. The driver is not allowed to use a different work size.

Using “get_global_id(0)*S” to produce an offset is fine. And even if the driver was choosing the work size, this computation would still work correctly.

I think you didn’t understand my point.

Yes, global work size will stay unchanged, but work group size can be equal to global_work_size/2 for example and it can mess a little when you rely on get_local_id(0) to compute offset. Because there will be two (or more) work items in which get_local_id(0) will return the same value.

And I don’t think get_global_id(0)*S is fine to compute offset, because one every one pass there are only global_work_size number of work items working and I allocate memory only for these work items. So in every execution of kernel, after reading results (with blocking set to CL_TRUE) it’s safe to reuse S array for next set of work items.

In this moment I came up with some idea for next thread. So see you there I hope :wink:

And I don’t think get_global_id(0)*S is fine to compute offset, because one every one pass there are only global_work_size number of work items working and I allocate memory only for these work items.

That’s what I’m trying to explain. It’s not true that at any given time only one work-group is being executed. More than one work-group may be running concurrently and there’s no way for you to know how many of them there are [li]. That’s why I’m recommending to allocate enough memory for all the work-items that are executed in the NDRange.[/li]

[*] I guess that you could query CL_DEVICE_MAX_COMPUTE_UNITS and hope that the marketing guys did not inflate that number. I’m not sure I would trust that value TBH.

Actually, even if you knew how many work-groups may be running in parallel you still wouldn’t know which portion of the buffer you can use at any given time. I can think of workarounds using atomics, but frankly, it starts to look too complex for my taste.

Hi

I just started to work on my code to apply your hints you gave me in the other thread, but here’s how I compute offset in my current kernel:

int offset = get_local_id(0)+(get_local_size(0)*get_group_id(0));

int offset = get_local_id(0)+(get_local_size(0)*get_group_id(0));

Isn’t that identical to get_global_id(0)?

I don’t think so.

get_global_id(0) will take under consideration offset I provide when running kernel.
Line I shown will not do it.

Allocated space: array[global_work_size=20].

So if I want to run 40 work items (its size of my problem to solve) I can divide it into two runs, every run with global_work_size=20 and appropriate offset. Since it’s divided into two runs I need to allocate memory in arrays only for one run (for second run arrays will be reused). In every run get_local_id(0) will return values from 0 to 39 (part in brackets () is a “offset for offset”).

If I would use get_global_id(0) it would return values in range 0 to 39 in first run and 0+offset to 39+offset in second run. Offset then would be set to 40 so finally get_global_id(0) would return values in range 40 to 79. And I havent allocated arrays that big.