06-17-2011, 12:24 PM
I am trying to overlay a grid (grid1) onto another grid (grid2) in parallel, and edit grid2 based on the values in grid1. Here is my kernel in pseudo

_kernel void simple(
global const float* grid2,
global float* output,
constant float* grid1)
int index = get_local_id(0);

for(int i = 0; i < grid1.width; i++)
for(int j = 0; j < grid1.height; j++)
value += grid1[i][j] * grid2[index.x+i, index.y+j];

output[index] = value;

The algorithm I have works, but is it possible to increase the performance of this by batching the portion of grid2 that I am using into private memory? If so, how would I batch it without writing a for loop that makes the same amount of calls to global memory?


06-17-2011, 01:44 PM
Or maybe my problem is here. This is the call that I am executing in c++.

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &sizeIn, NULL, 0, NULL, NULL);

My problem is that I have a Quadro 135M GPU, and an integrated dual-core CPU, but the GPU takes twice as long so compute the function than the CPU, and if I am right this shouldnt be possible because the GPU has 8 cores.

07-06-2011, 03:12 PM
and if I am right this shouldnt be possible because the GPU has 8 cores

In spite of what the marketers try to sell us, performance cannot be measured in "cores".

If you are computing a convolution as it seems like you are doing, you can benefit from using local memory. See the examples in AMD's or NVidia's SDK for details on how to do it.