Multi-GPU System, multiple contexts or command queues?

I’ve been working with OpenCL for the past year now, and I’m about to start my first multi-GPU design. I’m going to have some data that will be split up among 9 GPUs. I’m trying to decide if the easiest way to do this is by having 1 context with 9 command queues (1 for each GPU) or have 1 context for each GPU. All of the input data for each GPU will be unique to that GPU, with exception to a few constants. If anyone has any thoughts, I would love to hear them.

Just following up my own post in case anyone else is searching for this information. I found this thread which sums it up pretty well (in my opinion). Would still be interested in hearing other views on this topic.

http://forums.nvidia.com/index.php?showtopic=176628

I’m trying to decide if the easiest way to do this is by having 1 context with 9 command queues (1 for each GPU) or have 1 context for each GPU.

If you have a single context with multiple command queues, you can make commands in one queue depend on commands in another queue. In other words, it allows you to synchronize between devices.

Synchronization between two or more contexts is not possible.

I am very surprised at the results from the thread you linked. If the data is true and I’m understanding it correctly, it looks like there’s no concurrency between the two devices, which is very odd. There are several possible reasons for that, including the application introducing incorrect dependencies between commands.

I too would expect that a single context with multiple devices would be preferable. In addition to being able to synchronize between them, they could also then share buffers. My wild-assed guess about the test result in the linked thread is that the author perhaps has a buffer that is inadvertently being shared between devices and thereby is being implicitly synchronized. I would be seriously disappointed in the nVidia implementation if it was incapable of having multiple devices operating concurrently!