problem understanding the behaviour of my kernel!

Hello,

I am having problems understanding what a simple 1D kernel
is doing with local memory.
The kernel does nothing useful at the moment, I simplified it for
this post, but nevertheless exhibits what seems to me strange
behaviour.

Kernel code:


#define STENCILSIZE(2)
#define x11_size_x (20)

__kernel void fluxes_x11_fast3(
			       __global double* rhoe_in,
			       __global double* rhoe_out  ) 
{   
  int ind, num_point;

  // 1D version
  int main_ref = get_global_id(0); 

  //Identification of workgroup
  int i = get_group_id(0); 

  //Identification of work item inside workgroup
  int idX = get_local_id(0); 
  int sizeX = get_local_size(0); 

  __local double lrhoe[x11_size_x+2*STENCILSIZE];

  //Copy submatrixes to local memory. One element copied per work item

    lrhoe[idX+STENCILSIZE]   = rhoe_in[main_ref];  
	
  // "Edge" effects on the left 
  if(idX == 0 && i >= 1) {

    for (ind = 0; ind <= STENCILSIZE-1; ind++ ) { 
      lrhoe[idX+ind]   = rhoe_in[main_ref-STENCILSIZE+ind];
    }
  }

		
  // "Edge" effects on the right 
  if(idX == x11_size_x-1 && i <= (int)get_num_groups(0)-2 ) {

    for (ind=1; ind<=STENCILSIZE; ind++){
      lrhoe[idX+STENCILSIZE+ind]   = rhoe_in[main_ref+ind];
    }
  }

  // synchronise all the local memory
  barrier(CLK_LOCAL_MEM_FENCE);
	       
  // main computation bit
		
  if ( main_ref >= STENCILSIZE && main_ref <=DATA_SIZE_X-1-STENCILSIZE ) {
      
 rhoe_out[main_ref] = lrhoe[idX+STENCILSIZE];
      	  
  }       
	
}       


The kernel has one input array, and one output array.
All I’m trying to do for the moment is pull the input array into local
memory, while taking “edge” effects into account (i.e. for each point
in the input array, the local array would have neighbours of this point on
each side).
In the output array, I insert a value from the local array.

What I don’t understand is the values I get from the local array, see last line of
actual code:

  • if I take lrhoe[idX+STENCILSIZE], I get the desired result, i.e. the input value.
  • if I take lrhoe[idX+STENCILSIZE-1], I expect the input array shifted by one to the
    right, but instead I get the input array shifted by 6.

Can anyone explain this behaviour, or suggest what I might be doing wrong?
I can also post host code if this is deemed useful.

Many thanks!

Olivier

That sounds odd. Also, shouldn’t the last line read like this?

rhoe_out[main_ref-STENCILSIZE] = lrhoe[idX+STENCILSIZE];

Hi David, and thanks alot for answering.
The kernel here is a much simplified finite-difference kernel,
which I have got working properly in 1 and 2D using only global memory.
The input and output arrays represent my entire domain, and thus “main_ref”
is the index into the entire domain. The line you asked about,
rhoe_out[main_ref] = lrhoe[idX+STENCILSIZE];
represents the computation of a new value rhoe from its neighbouring
values; at the moment it is simply a duplication (i.e. lrhoe[idX+STENCILSIZE] should equal
rhoe_in[main_ref]) but the idea is be to be able to do something like


double rhoe_out_private = lrhoe[idX+STENCILSIZE];
for (index=1;index<=STENCILSIZE;index++)  {
     rhoe_out_private += 
               coef[index] * (lrhoe[idX+STENCILSIZE+index] - lrhoe[+STENCILSIZE-index]);
}
rhoe_out[main_ref] = rhoe_out_private;

So as I see it, the index I want into the out array is main_ref.
I only update if stencilsize<main_ref<global_size-stencilsize due to a different treatment applied at boundaries.
Does that make sense?
Thanks again, I am at a loss!
Olivier

To add to my confusion, the kernel does exactly what I expect it to do, when run
on the cpu with the amd sdk, that’s to say it duplicates the input array
with the following line


rhoe_out[main_ref] = lrhoe[idX+STENCILSIZE]; 

and shifts it by one to the right with the following line


rhoe_out[main_ref] = lrhoe[idX+STENCILSIZE-1]; 

Does this point to a bug in the nvidia implementation, or am I perhaps misusing
some aspect of local memory to which the amd cpu implementation is less
sensitive?
Many thanks,
Olivier

It looks like a possible bug in NVidia’s compiler. I suggest sending it to their customer support.

OK, I’ll do that and update this thread when I get a response from NVIDIA.
Thanks alot for the help,
Olivier