I am trying to overlay a grid (grid1) onto another grid (grid2) in parallel, and edit grid2 based on the values in grid1. Here is my kernel in pseudo
_kernel void simple(
global const float* grid2,
global float* output,
constant float* grid1)
{
int index = get_local_id(0);
for(int i = 0; i < grid1.width; i++)
{
for(int j = 0; j < grid1.height; j++)
{
value += grid1[i][j] * grid2[index.x+i, index.y+j];
}
}
output[index] = value;
}
The algorithm I have works, but is it possible to increase the performance of this by batching the portion of grid2 that I am using into private memory? If so, how would I batch it without writing a for loop that makes the same amount of calls to global memory?
Thanks,
Gaming