GPU/CPU Thread allocation

I have implemented an algorithm using OpenCL and have several versions: CPU-only, multi GPU, and multi GPU + CPU. I am working on Mac OS X 10.6.1 using dual NV GT120’s and dual Intel Xeon Quad Core CPUs. I am encountering some performance behavior I do not understand and am hoping someone can clarify. My main question is around how the OpenCL implementation chooses to allocate work to the CPU.

I have done some profiling of the CPU path and multi GPU path independently on this machine and have established a rough ratio of the performance of each path and use this to up-front determine how much work to send on each path.

The basic usage is:

Generate workload for each GPU and the CPU (I have one GPU context with two devices and one CPU context with one device)
Create a thread for each workload (three total) and in each thread:

  • Copy the data to the device using OpenCL
  • Invoke the kernel multiple times in a loop (each execution is in a loop and requires a Host read back on each iteration)

Now, what puzzles me is this: if I just run the application to use just a single thread and one OpenCL context on the CPU, I get the best performance. Watching in Mac System Monitor, I see the application use approximately 18 threads and consume 1200%+ of the CPU. However, when I run my three threaded version where each thread is sending work to a device (thread 1 - CPU, thread 2 - GPU 0, thread 3 - GPU 1) I see the application create approximately 24 threads and only use about 300% of the CPU initially. As soon as the GPU threads retire because they have finished their work, the thread running on the CPU immediately starts consuming 1200%+ again. So in other words, by having the GPU threads, I am massively slowing down the thread that runs on the CPU, and hence overall getting worse performance than just running on the CPU context. I have tried setting thread priorities and that did not seem to have an impact.

Could someone help me understand this behavior?

Thanks in advance for any help you can provide.

It sounds like something (possibly due to problems in Apple’s OpenCL framework) is blocking the execution of the CPU task until the GPU task finishes. I’ve had performance problems with 2 GPU + 1 CPU configurations on my machine (MacBook Pro) but they were related to buffer reads taking a very long time when multiple devices were active together. This may or may not be related to your problem, but one way to investigate this is to turn on profiling, get events for each action, and look at the time spent enqueued vs. executing. If your problem is similar to mine you will see that your GPU read operations are taking a hugely variable amount of time until they execute, which will slow down performance significantly if you need to wait for a read-back before you can proceed.

The right thing to do is to file a bug against Apple.
http://bugreporter.apple.com

I promise to do so for my issues as soon as I get around to writing a simplified test case. :slight_smile:

dbs2,

Thanks for your reply. I think you were on the right track. I tested my code also with the ATI Stream SDK v2.0 beta on Linux and was seeing similar behavior on different hardware. I think the issue is that my threads which talk to the two GPUs are consuming so much CPU overhead from all of the readbacks I am doing that they are starving the CPU context for resources. I still don’t entirely understand how the underlying implementation allocates threads to the OpenCL CPU context, but I would guess the behavior will be much more sane on an algorithm that does less device -> host readbacks from the GPU.

Either way, the overall performance of my algorithm on the OpenCL CPU-only version is far better than what I had initially hoped for. I don’t need it to run any faster than it already does so I think I’ll leave it for now :slight_smile:

Thanks.