I have some problems with my kernel, or maybe the way I’m using and allocating memory is wrong.
I’m trying to allocate a workSize*sizeOfArray “big” array and then in kernel for every work item I compute offset - so every work item has independent space for storage.
But I experiencing some problems and I would like you to explain me what I’m doing wrong.
OK so I need 3 bigger arrays for every work item. I’m allocating space by using clCreateBuffer(). Let’s say I want to run clEnqueueNDRangeKernel() with global_work_size parameter set to 512.
Every work item need one of the arrays to be 256 bytes long. So in advance I need to allocate 256 * workSize = 256 * 512 = 131 072 bytes array. In kernel I do some computations using only a part of a this array. To compute offset I simply use: get_local_id(0)*256.
I use these commands:
int workSize=512
int N=256;
cl_mem SBuffer = clCreateBuffer(GPUContext, CL_MEM_READ_WRITE, sizeof(uchar)*workSize*N, NULL, &errcode);
assert(errcode==CL_SUCCESS);
clSetKernelArg(OpenCLVectorAdd, 6, sizeof(cl_mem), (void*)&SBuffer);
After executing kernel I can read the array doing:
uchar *s = new uchar[N*SIZE];
clEnqueueReadBuffer(GPUCommandQueue, SBuffer, CL_TRUE,0,SIZE*N*sizeof(uchar),s,0, NULL, NULL);
I expected the whole array will be filled with some values, but it seems only 65536 bytes were used. So now it’s clear why my computations were wrong - probably space I though will be used only by one workitem was used by many workitems.
So, 65536 bytes used - it means 2 times less then should be used.
Is my implementation correct? I mean assuming that with get_local_id(0) I can compute offset and it will work?
Maybe I just using my GPU wrong (it’s Nvidia Quadro NVS140)?
These are values which are displayed by Cloo framework:
LocalMemorySize = 16384
MaxComputeUnits = 2
MaxConstantArguments = 9
MaxConstantBAufferSize = 65536
MaxMemoryAllocationSize = 134217728
MaxSamplers = 16
MaxWorkGroupSize = 512
MaxWorkItemDimenstions = 3
MaxWorkItemSizes = 512 / 512 / 64
How should I understand these values?