OpenCL on Windows much slower than on Mac?

Hi,

I’m newbie to the development of OpenCL. I have a simple convolution kernel for RGB 24bit image :


__kernel void ConvolveRGB(const global uchar *in,
global uchar *out,
int width, int height)
{
const int x = get_global_id(0);
const int y = get_global_id(1);
const int row = width * 3;

if (x > 0 && x < width - 1 && y > 0 && y < height - 1)
{
  for (int b = 0; b < 3; ++b)
  {
    int t = (y * width + x) * 3 + b;
    int v = 1 * in[t - row - 3]
             +2 * in[t - row]
             +1 * in[t - row + 3]
             +2 * in[t - 3]
             +4 * in[t]
             +2 * in[t + 3]
             +1 * in[t + row - 3]
             +2 * in[t + row]
             +1 * in[t + row + 3];
    v /= 16;
    if (v > 255) v = 255;
    out[t] = v;
  }
}
}

And the caller program like :


void ProcessRGBOnGPU()
{
size_t g_work_size[] = {IMAGE_WIDTH, IMAGE_HEIGHT};
size_t l_work_size[] = {16, 16};
cl_event events[1];

clEnqueueWriteBuffer(CLQueue, CLRGBInBuf, false, 0, IMAGE_WIDTH * IMAGE_HEIGHT * 3, RGBIn, 0, 0, 0);
clEnqueueNDRangeKernel(CLQueue, CLKnlRGB, 2, 0,
                                          g_work_size,
                                          l_work_size,
                                          0, 0, events); 
clEnqueueReadBuffer(CLQueue, CLRGBOutBuf, true, 0, IMAGE_WIDTH * IMAGE_HEIGHT * 3,
                                   RGBOutGPU, 0, 0, 0);
}

The problem is the same program running on my same MacBook Pro (9400M/9600M GT) notebook for 2048*2048 image needs about 5ms, but on Windows XP it takes about 250ms! Why the performance is so much different?

Any help is really appreciated. Thanks in advance.

ZhaoYu

Are these run from the same laptop? (e.g. Virtual Machine / Bootcamp install of XP) If yes, whose API are you invoking on the XP side? NVIDIA? Are they run from different machines? Which graphics card is the XP machine using? API NVIDIA?

I’m fairly new too, so… take this with a grain of salt. If I had this problem, my primary guess is that its a difference in implementation of the APIs. Each kernel should take roughly the same amount of time if it isn’t relying on behavioral defaults.

Try this on Windows:


// Force command queue to finish (or just make your WriteBuffer a blocking call)
clEnqueueWriteBuffer(CLQueue, CLRGBInBuf, false, 0, IMAGE_WIDTH * IMAGE_HEIGHT * 3, RGBIn, 0, 0, 0);
clFinish(CLQueue);
clEnqueueNDRangeKernel(CLQueue, CLKnlRGB, 2, 0,
                                          g_work_size,
                                          l_work_size,
                                          0, 0, events); 
clFinish(CLQueue);
clEnqueueReadBuffer(CLQueue, CLRGBOutBuf, true, 0, IMAGE_WIDTH * IMAGE_HEIGHT * 3,
                                   RGBOutGPU, 0, 0, 0);

If that doesn’t work, it could be a hardware difference that is butting up against the global size of your problem. I see you’ve attempted to tile it based on the l_work_size[] but you’re not doing anything special with the memory (for example, copying global memory into shared memory across a workgroup). Try the local size to be 512x512 instead of 16x16. How does that behave?

Yes, the tests are on same laptop (bootcamp with XP). Actually I also tried on a Dell desktop computer with GTX 310, but it takes about 80ms too.

The test program only invokes OpenCL API not CUDA. I just compile it with XCode and Visual C++ 2008 on Mac and PC.

I also suspect it’s the ICD implementation problem. You know the OpenCL is a really new standard, maybe every provider didn’t come to same place yet.

Sorry, your suggestion code can’t help. You can’t setup 512512 for the local size. The maximum group size is just 512, so that means 2222 is the max setting.

I know there are some improvements can be done about memory access. But my question is why Apple did much better than NVidia (50X)?

I can think of no good reasons. It could be any number of things from memory management at the OS level, to queue management inside the library – and several other things at that.

You could probably run a code profiler to check where your code is spending most of it’s time. Make your EnqueueWriteBuffer/Read calls blocking calls then check if thats what is taking so long, or kernel execution.

Thanks. I will try more.

I don’t know whether a software claimed optimized with OpenCL really do better job on all platform. I’m not doubt NVidia GPU can do great job if I optimize for it. But to get consistent performance is just what “industry standard” does, didn’t it? I have to give up optimize our software with OpenCL, until it get better support from different provider.

The standard is young. It only guarantees that implementations will execute the requirements correctly. They generally have few, if any, performance mandates.

That said, Apple knows very well how to optimize OS memory management for its Mac series of products. NVIDIA has to be a bit more generic.

The reason I recommend you profile the code is to determine if it is memory management, or the kernel. If it is kernel, then you may be able to figure out how to optimize the kernel. If it is memory management, you may have to wait for a more mature technology.

You may also want to check the drivers you are using on the Windows install are up to date (just a thought).

Thanks Alex. You know Nvidia has some documents about how to optimize OpenCL memory. And I also tested the Nvidia CUDA filter example, it can get more than 60fps (i.e. less than 2ms) on my Mac.

BTW, I did install lasted driver and CUDA sdk 3.1.