using local memory

I’ve got a simple test kernel that writes into local memory, and then copies the data to an output buffer in global memory:


__kernel void foo( __global float *debug_data, __local float *shared_segment)
{
   // works
   int tid = get_local_id(1);
   *(shared_segment + tid) = tid;
   *(debug_data + tid) = *(shared_segment + tid);

where it’s a two dimensional grid of 16K by 256. The local work group size is 1x256. For the shared memory size, I’ve got 16K/256 = 64 bytes a thread in a workgroup. This behaves the way I think it should: I get back 256 floats in debug_data, with values starting at 0 and going to 255.

But now if I add a line to my test kernel:


__kernel void foo( __global float *debug_data, __local float *shared_segment)
{
   // doesn't work
   int tid = get_local_id(1);
   *(shared_segment + tid) = tid;
   *(shared_segment + tid + 256) = tid;       // new line
   *(debug_data + tid) = *(shared_segment + tid);

I get back 24 floats of valid values in debug_data, followed by zeroes. I’m completely stumped – as far as I’m concerned, shared_segment should have room for 4096 4-byte values, and the index of my last shared_segment write is 511. I’ve double-checked my kernel arguments, and I think it’s correct: a size_t set to ((16*1024)/256) for the size parameter, followed by a NULL.

Can anyone point out to me what I’m misunderstanding about allocating and using local memory?

What did you specify as the size of the local memory when calling clSetKernelArg for argument 1 which is your shared_segment variable? This is the size that will be allocated from local memory for shared_segment.

to be able to write at address shared_segment + 511, you must have, at least, allocated a size of sizeof(cl_float) * 512.

So (16*1024)/256 (that I don’t understand) is not enough…

Yes, I see how I goofed …

I was setting the kernel argument to ((16*1024)/256) because there are 256 threads per work group, and I thought the argument should be the amount of local memory per thread. If I change it to the size that I want for the entire work group – on my board, 16K less some small amount of memory (<32B) for system use – then it works fine.

BTW, at one point I did try allocating the entire 16K; but when it failed with an out of resources error, I didn’t realize it was because of the system byte allotment, and figured (incorrectly) that it needed to be a per-thread size. D’oh!

In the Nvidia bitonic sample, I see that local memory isn’t being passed through kernel arguments at all, but rather allocated wthin the kernel itself. When would you allocate local memory via a kernel argument, and when would you allocate it internally in the kernel itself, as in the bitonic sample? Is there any functional difference between the two methods at all?

Thanks to both of you for your responses. I appreciate your willingness to help out someone on the steep end of the GPU learning curve!

Using defines to setup array size need kernel recompilation, when you want to change this size.
Perhaps it permits better optimization…