how do work-groups affect performance?

Hi,

I have been developing with OpenCL for a week now and facing the first performance issues. I am processing a video with a resolution of 320x240 (floating point data). I do several computations for a single image.
My GPU is a nVidia Quadro FX 370 which only has 16 cuda cores and OpenCL 1.0 support.
My basic steps are as follows:

  1. Read an image from memory to device
  2. enqueue a single kernel with a work-group size of 64x8 (max workgroup size is 512, so this fits pretty nice) and a global size of 320x240.
  3. wait until kernel is fnished
  4. read results from device memory
  5. go to 1.

I benchmarked this with a processing of 1000 frames (only measured the enqueueKernel+finish with performance counter).
Result on GPU: 16s
I also have a hand-written SSE version running on the CPU (only a single thread) with:
Result CPU (SSE): 30s

I am a bit disappointed as I hoped moving from CPU to GPU would be at least 4 times faster (SSE = 4 instructions parallel, 16 CUDA cores = 16 instructions parallel).

I wonder how can I improve the perfomance. Do I have to enqueue the same kernel multiple times to fully use the GPU? If yes, which global size do I choose? Have I to split the global size by hand?

All pixels are independend in one image, but a pixel must be processed from first frame to last frame in order.

Thanks for any advice.

  1. Read an image from memory to device
  2. enqueue a single kernel with a work-group size of 64x8 (max workgroup size is 512, so this fits pretty nice) and a global size of 320x240.
  3. wait until kernel is fnished
  4. read results from device memory
  5. go to 1.

The first rule of improving the performance of any code is measuring where the time is being spent. All OpenCL vendors provide tools to measure how long each part of the code is taking.

I can give some general advice based on what you described above, although general advice is not as useful as actually measuring performance.

Step 3 is not necessary; you can remove it safely.

Steps 1 and 4 are not ideal either. The way this code works, you submit a very small piece of work to the device, then wait until the device is finished, then read it back. That means that at any point in time either the device is waiting for the application or the application is waiting for the device. Instead, what you want is both the application and the device doing work at the same time.

One way to do this is to have more than one image. For example, you can have two images and make the application load one of them while the device is computing on the other. The code would look like this:


cl_mem image[2], swap;

// Initialize image[0] and image[1] here.

// Write data into image[0] here.

for(i = 0; i < number_of_iterations; ++i)
{
    // Run the kernel on image[0]:
    clSetKernelArg(kernel, 0, sizeof(image[0]), image[0]);
    clEnqueueNDRangeKernel();

    // Write data for image[1] here.

    // Read image[0] here.

    // Swap image[0] and image[1]:
    swap = image[0];
    image[0] = image[1];
    image[1] = swap;
}

Notice what is happening when the code is running:


1. Write Image0 (outside of the loop)
// First iteration of the loop:
2. NDRange Image0
3. Write Image1. // This happens while the device is running the NDRange.
4. Read Image0.
// Second iteration of the loop:
5. NDRange Image1.
6. Write Image0.
7. Read Image1.

You may get better results with 3-4 images instead. This technique is called multiple buffering. The reason it works is because it allows the application and the device to do work at the same time instead of waiting for each other all the time.

One last thing. You didn’t comment on whether you are using blocking or non-blocking writes. In this case you should be using non-blocking memory writes. Again, the reason is to improve concurrency between the application and the device.

Hello there, sorry for asking this in your topic David, but I fount this multiple buffering a very nice technique, but I have a question about it, in David example he reads the image from memory and then write it back, in my case I create an openGL texture, then manipulate it directly on the kernel, and then simply display the texture on screen.
So the difference from David method is that I copy the image to an openGL texture, but don’t retrieve it back to memory.
And with this scenario I’m not certain if multiple buffering can be applied.

Any ideas?

Thanks!

in my case I create an openGL texture, then manipulate it directly on the kernel, and then simply display the texture on screen.

If the texture is only being accessed from the device then I don’t see how multiple buffering is going to improve performance.

Step 2: Size of Work-Group 64x8 = 512 is correct, but:

  • one of the SM (Streaming Multiprocessor) resource limitations is the number of work-items that can be simultaneously tracked and scheduled. For the most devices is 1024 work-items. So when the size of your work-group is 512 => just only 2 work-group (1024/512 = 2) simultaneously tracked and sheduled (we can normaly assign up to 8 work-group to each SM). So your cofiguration lead to Underutilization of execution resource. The best configuration were 16x16 ( = 256 ) work-group size (because 1024/256 = 4 work-groups / SM).
  • if possible (depends on your Algorithm) you can copy the pixel into local memory to reduce the global memory access. This will increase ur performance.
  • see the OpenCL-GL interoperation for usinng Pixel Buffer Object.