Hi,
I’m starting to program with OpenCL, and I have created a naive implementation of my problem.
The theory is: I have a 3D grid of elements, where each elements has a bunch of information (around 200 bytes). Every step, every element access its neighbors information and accumulates this information to prepare to update itself. After that there is a step where each element updates itself with the information gathered before. This process is executed iteratively.
My OpenCL implementation is: I create an OpenCL buffer of 1 dimension, fill it with structs representing the elements, wich have an "int neighbors[6] where I store the index of the neighbors in the Buffer. I launch a kernel that consults the neighbors and accumulate their information into element variables not consulted in this step, and then I launch another kernel that uses this variables to update the elements. These kernels use __global variables only.
Sample code:
typedef struct{
float4 var1;
float4 var2;
float4 nextStepVar1;
int neighbors[8];
int var3;
int nextStepVar2;
bool var4;
} Element;
__kernel void step1(__global Element *elements, int nelements){
int id = get_global_id(0);
if (id >= nelements){
return;
}
Element elem = elements[id];
for (int i=0; i < 6; ++i){
if (elem.neighbors[i] != -1){
//Gather information of the neighbor and accumulate it in elem.nextStepVars
}
}
elements[id] = elem;
}
__kernel void step2(__global Element *elements, int nelements){
int id = get_global_id(0);
if (id >= nelements){
return;
}
Element elem = elements[id];
//update elem variables by using elem.nextStepVariables
//restart elem.nextStepVariables
}
Right now, my OpenCL implementation takes basically the same time than my C++ implementation.
So, the question is: How would you (the experts :P) address this problem? I have read about 3D images, to store the information and change the neighborhood accessing pattern by changing the NDRange to a 3D one. Also, I have read about __local memory, to first load all the neighborhood in a workgroup, synchronize with a barrier and then use them, so that accesses to memory are reduced.
Could you give me some tips to optimize a process like the one I described, and if possible, give me some snippets?
Thanx.