clBuildProgram() for multiple AMD GPUs

Hi,

yesterday I fixed a nasty bug that is only present on AMD platforms with multiple GPUs. To sum up my findings: If you build one program separately for each device, enqueueing one of the program’s kernels will fail with CL_INVALID_PROGRAM_EXECUTABLE. Not cool, if you want to include different code paths depending on the device architecture.