Kernel uses too much local data

Hi,
I have converted a recursive algorithm into an iterative form using the stack implementation found in this forum thread :

Stack Implementation

Now I try to use the local memory to store the nodes evaluations. My GPU support 16384 bytes, so I divide my local work size to 16x16x4 to have a local memory requirement of 1024 floats : 4096 bytes.

The program will run with one __private stack of 820 x 5 bytes. If I declare two __private stack of that size with the call to the push method, the kernel will throw this error :

Build Log:
ptxas error : Entry function ‘clMyKernel’ uses too much local data (0x4cf0 bytes, 0x4000 max)

The only local buffer I declare is :


//	Support for 16x16x4 floats
#define MAX_LOCAL_MEM	1024

	__local float local_fLocalEvaluation[MAX_LOCAL_MEM];

Here is the stack implementation :


//    We need to store at most 4 depths of recursion with 9 childs per node
#define STACK_SIZE				(1 + 9 + 81 + 729)


//	Code reference : taken on khronos forum
//	http://www.khronos.org/message_boards/viewtopic.php?f=28&t=3942&p=11331&hilit=binary+tree#p11331

typedef struct _node
{
      char s8Depth;
	  int iYPos;
	  int iZPos;
}node;

typedef struct stack
{
   node n[STACK_SIZE];
   int top;
}stack_class;

void init_stack(stack_class *s)
{
   s->top = -1;
}

void push( stack_class *s, node _n )
{
   s->top = s->top + 1;
   s->n[s->top] = _n;
}

node pop( stack_class *s )
{
	node nn = s->n[ s->top ];
	s->top = s->top - 1;
	return nn;
}

I will declare the stack private, and by default the nodes are __private.


	__private stack_class stack;
	init_stack( &stack );

	node nodeRoot;
	nodeRoot.s8Depth = 0;
	nodeRoot.iYPos = y;
	nodeRoot.iZPos = z;

	push( &stack, nodeRoot );

Why are the stack global methods generating local memory usage?

Thanks

This kernel code will compile :


	__private stack_class stack;
	init_stack( &stack );

	//__private stack_class stackEval;
	//init_stack( &stackEval );

	node nodeRoot;
	nodeRoot.s8Depth = 0;
	nodeRoot.iYPos = 0;
	nodeRoot.iZPos = 0;

	for( int i = 0; i < 5000; ++i )
	{
		push( &stack, nodeRoot );
//		push( &stackEval, nodeRoot );
	}


But commenting out the second stack usage will create the compilation error :


	__private stack_class stack;
	init_stack( &stack );

	__private stack_class stackEval;
	init_stack( &stackEval );

	node nodeRoot;
	nodeRoot.s8Depth = 0;
	nodeRoot.iYPos = 0;
	nodeRoot.iZPos = 0;

	for( int i = 0; i < 5000; ++i )
	{
		push( &stack, nodeRoot );
		push( &stackEval, nodeRoot );
	}


Thinking about it, it must be the inlined compiled kernel that use the local memory.

You can disable optimization using “-cl-opt-disable” if you think that is the problem.

Also you can use the implementers offline compiler tools such as those from AMD and OpenCL to see what the openCL code compiles into.