OpenCL general optimization for neighbors accessing

Hi,

I’m starting to program with OpenCL, and I have created a naive implementation of my problem.

The theory is: I have a 3D grid of elements, where each elements has a bunch of information (around 200 bytes). Every step, every element access its neighbors information and accumulates this information to prepare to update itself. After that there is a step where each element updates itself with the information gathered before. This process is executed iteratively.

My OpenCL implementation is: I create an OpenCL buffer of 1 dimension, fill it with structs representing the elements, wich have an "int neighbors[6] where I store the index of the neighbors in the Buffer. I launch a kernel that consults the neighbors and accumulate their information into element variables not consulted in this step, and then I launch another kernel that uses this variables to update the elements. These kernels use __global variables only.

Sample code:

typedef struct{
  float4 var1;
  float4 var2;
  float4 nextStepVar1;
  int neighbors[8];
  int var3;
  int nextStepVar2;
  bool var4;
} Element;

__kernel void step1(__global Element *elements, int nelements){
  int id = get_global_id(0);
  if (id >= nelements){
    return;
  }
  Element elem = elements[id];

  for (int i=0; i < 6; ++i){
    if (elem.neighbors[i] != -1){
      //Gather information of the neighbor and accumulate it in elem.nextStepVars
    }
  }
  elements[id] = elem;
}

__kernel void step2(__global Element *elements, int nelements){
  int id = get_global_id(0);
  if (id >= nelements){
    return;
  }
  Element elem = elements[id];

  //update elem variables by using elem.nextStepVariables
  //restart elem.nextStepVariables
}

Right now, my OpenCL implementation takes basically the same time than my C++ implementation.

So, the question is: How would you (the experts :P) address this problem? I have read about 3D images, to store the information and change the neighborhood accessing pattern by changing the NDRange to a 3D one. Also, I have read about __local memory, to first load all the neighborhood in a workgroup, synchronize with a barrier and then use them, so that accesses to memory are reduced.

Could you give me some tips to optimize a process like the one I described, and if possible, give me some snippets?

Thanx.