Calculation of average values of an image sequence

Hello,

my name is Marcus Große and I am working in the field of 3d measurements using structured light. Our group
is evaluating the use of GPU’s for image processing tasks. In order to get to know OpenCL I have written a kernel, which averages twentyone gray value images and writes the results into global device memory for later usage (see provided kernel code below).

The runtime of the kernel (which is measured using the clGetEventProfilingInfo) is about 113ms (GPU).
To get that fast I am using loop-unrolling as described here “http://developer.amd.com/gpu/ATIStreamSDK/ImageConvolutionOpenCL/Pages/ImageConvolutionUsingOpenCL.aspx” (about 10ms faster compared to non unrolled case).
An implementation on the CPU takes about (140ms, no loop-unrolling used and only one core used). So the
for this problem there seems to be no big performance gain, when using the GPU. I have a few question related to that result.

  1. The problem may be that for every memory access there is only one addition made, so that the memory bandwith hinder a faster execution. Is this plausible?
  2. As image dimension (global buffer dimension) is a multiple of 16 memory accessed should be coalesced in my implementation. Is there a way to check this or can someone point me to problems in my kernel-code that surpress coalesced memory access?
  3. Are there other options to decrease execution time?
  4. We use a NVIDIA-Geforce 9500GT. When switching to a more recent model (perhaps the upcoming Fermi-Cards), which speed-up may be achieved for this presented problem (factor >10?)?
  5. I adressed the same problem, using image_2d and image_3d instead of the one dimensional buffers “l” and “r”. The runtime is about the same compared to using two one dimensional buffers. I had expected a speed up due to caching of memory reads?

Questions not related to results.

  1. I am also eager to see more examples written in OpenCL, which handle image processing. Perhaps someone can point me to a link or book?
  2. If I do not assign the local variable avgl,…,avgl3 to the global buffer avgL the GPU seems to skip the entire
    calculation of theses values which makes it difficult to track memory read/write time consumption compared to calculation time consumption. Is there a work around?

thanks in advance,
Marcus Große

I am using CUDA-Toolkit 3.0 + NVIDIA 9500GT

Kernel-Code:
//‘l’ contains image data of one camera, ‘r’ contains image data of a second camera, average values are computed for both cameras (stored into ‘avgL’ and ‘avgR’)
__kernel void AverageKernel(__global float* avgL,__global float* avgR, __global float* l, __global float* r)
{
//get position of workitem in image
unsigned int nx = get_global_id(0);
unsigned int ny = get_global_id(1);
float inv_pics=1.0f/21.0f;
//variables used for loop unrolling
float avgl=0.0f;
float avgr=0.0f;
float avgl2=0.0f;
float avgr2=0.0f;
float avgl3=0.0f;
float avgr3=0.0f;
int c=0;
//average calculation of 21 images of size 640x480
for(int c=0;c<7;c++)//loop-unrolling
{
avgl+=l[nx+640ny+c3640480];
avgr+=r[nx+640ny+c3640480];
avgl2+=l[nx+640ny+(c3+1)640480];
avgr2+=r[nx+640ny+(c3+1)640480];
avgl3+=l[nx+640ny+(c3+2)640480];
avgr3+=r[nx+640ny+(c3+2)640480];
}
//writing results to global device memory
avgL[nx+640*ny]=(avgl+avgl2+avgl3)inv_pics;
avgR[nx+640
ny]=(avgr+avgr2+avgr3)*inv_pics;
};

My guess is that most of your time is just getting the data over to the card. You are doing very little work on it, so the difference in time between a good CPU implementation (which needs to access all the data once) and a good GPU implementation (which needs to access it once to copy it to the GPU and then again to do the averaging) is probably going to be in favor of the CPU. To get good performance from a GPU you need to do a lot of math and/or access the data a lot so you can take advantage of the large on-chip bandwidth and amortize the cost of getting it to the GPU in the first place.

The time given is only the kernel execution time (113ms), which is yielded by clGetEventProfilingInfo. As data transfer is done before the execution of the kernel, it should not be part of the measured time (I linked the event to kernel execution so it should only reflect kernel-execution time). I agree that the problem may not be perfectly suitable for GPUs.

I missed that part. My guess is that you are just memory bound on the card, but you should have far higher memory bandwidth than on the CPU so I’m not sure why it’s not a lot faster. You should be able to get an upper bound for the performance here by looking at the memory bandwidth of your card. What is your local workgroup size? Femi will undoubtedly be faster, but probably just because everything on the card is faster. I wouldn’t expect a benefit from images here because your reuse is very limited. You are going along in X across the work-items so you’ll already get full bandwidth there, and the y-dimension reuse is probably too far apart for the cache sizes since you’re doing 21 images at a time.

If you don’t write back avgl,…avgl3 the compiler will (correctly) determine that your code doesn’t have any outputs and therefore eliminate it as deadcode. One way around this is to write to a dummy location in local memory.

I posted the same thread here “The Official NVIDIA Forums | NVIDIA” to increase the chance for feedback. One user assumed (like you) that bandwith may be the problem. The clBandwithTest tells me that device-to-device transfer is about 13GB/s, which is according to that user in the same range as cpu to host-memory bandwith.

The dummy variable is a good tip, i will try that. Thanks for your replies.