my name is Marcus Große and I am working in the field of 3d measurements using structured light. Our group
is evaluating the use of GPU's for image processing tasks. In order to get to know OpenCL I have written a kernel, which averages twentyone gray value images and writes the results into global device memory for later usage (see provided kernel code below).

The runtime of the kernel (which is measured using the clGetEventProfilingInfo) is about 113ms (GPU).
To get that fast I am using loop-unrolling as described here "http://developer.amd.com/gpu/ATIStreamSDK/ImageConvolutionOpenCL/Pages/ImageConvolutionUsingOpenCL.aspx" (about 10ms faster compared to non unrolled case).
An implementation on the CPU takes about (140ms, no loop-unrolling used and only one core used). So the
for this problem there seems to be no big performance gain, when using the GPU. I have a few question related to that result.

1) The problem may be that for every memory access there is only one addition made, so that the memory bandwith hinder a faster execution. Is this plausible?
2) As image dimension (global buffer dimension) is a multiple of 16 memory accessed should be coalesced in my implementation. Is there a way to check this or can someone point me to problems in my kernel-code that surpress coalesced memory access?
3) Are there other options to decrease execution time?
4) We use a NVIDIA-Geforce 9500GT. When switching to a more recent model (perhaps the upcoming Fermi-Cards), which speed-up may be achieved for this presented problem (factor >10?)?
5) I adressed the same problem, using image_2d and image_3d instead of the one dimensional buffers "l" and "r". The runtime is about the same compared to using two one dimensional buffers. I had expected a speed up due to caching of memory reads?

Questions not related to results.

6) I am also eager to see more examples written in OpenCL, which handle image processing. Perhaps someone can point me to a link or book?
7) If I do not assign the local variable avgl,...,avgl3 to the global buffer avgL the GPU seems to skip the entire
calculation of theses values which makes it difficult to track memory read/write time consumption compared to calculation time consumption. Is there a work around?

thanks in advance,
Marcus Große

I am using CUDA-Toolkit 3.0 + NVIDIA 9500GT

//'l' contains image data of one camera, 'r' contains image data of a second camera, average values are computed for both cameras (stored into 'avgL' and 'avgR')
__kernel void AverageKernel(__global float* avgL,__global float* avgR, __global float* l, __global float* r)
//get position of workitem in image
unsigned int nx = get_global_id(0);
unsigned int ny = get_global_id(1);
float inv_pics=1.0f/21.0f;
//variables used for loop unrolling
float avgl=0.0f;
float avgr=0.0f;
float avgl2=0.0f;
float avgr2=0.0f;
float avgl3=0.0f;
float avgr3=0.0f;
int c=0;
//average calculation of 21 images of size 640x480
for(int c=0;c<7;c++)//loop-unrolling
//writing results to global device memory