Some examples I've seen simply use the first device returned by clGetContextInfo with CL_CONTEXT_DEVICES. This is obviously fine for single-GPU systems, but what happens on multi-GPU systems? Will all but one GPU sit idle, or does OpenCL spread the load to all devices even if there aren't any command queues created for them? What is the right way to make sure a program scales well from single to multiple GPUs (devices)?