Hi,
I have converted a recursive algorithm into an iterative form using the stack implementation found in this forum thread :
Now I try to use the local memory to store the nodes evaluations. My GPU support 16384 bytes, so I divide my local work size to 16x16x4 to have a local memory requirement of 1024 floats : 4096 bytes.
The program will run with one __private stack of 820 x 5 bytes. If I declare two __private stack of that size with the call to the push method, the kernel will throw this error :
Build Log:
ptxas error : Entry function ‘clMyKernel’ uses too much local data (0x4cf0 bytes, 0x4000 max)
The only local buffer I declare is :
// Support for 16x16x4 floats
#define MAX_LOCAL_MEM 1024
__local float local_fLocalEvaluation[MAX_LOCAL_MEM];
Here is the stack implementation :
// We need to store at most 4 depths of recursion with 9 childs per node
#define STACK_SIZE (1 + 9 + 81 + 729)
// Code reference : taken on khronos forum
// http://www.khronos.org/message_boards/viewtopic.php?f=28&t=3942&p=11331&hilit=binary+tree#p11331
typedef struct _node
{
char s8Depth;
int iYPos;
int iZPos;
}node;
typedef struct stack
{
node n[STACK_SIZE];
int top;
}stack_class;
void init_stack(stack_class *s)
{
s->top = -1;
}
void push( stack_class *s, node _n )
{
s->top = s->top + 1;
s->n[s->top] = _n;
}
node pop( stack_class *s )
{
node nn = s->n[ s->top ];
s->top = s->top - 1;
return nn;
}
I will declare the stack private, and by default the nodes are __private.
__private stack_class stack;
init_stack( &stack );
node nodeRoot;
nodeRoot.s8Depth = 0;
nodeRoot.iYPos = y;
nodeRoot.iZPos = z;
push( &stack, nodeRoot );
Why are the stack global methods generating local memory usage?
Thanks