Since we have split the input data into different buffer objects for different devices, you will need different kernel objects for different devices as well. For example, if we have assigned buffer B1 and B2 to device D1, then you will create a kernel object K1 and set the kernel arguments to B1 and B2, then you will enqueue an NDRange using K1 on D1. Do the same for buffers B3 and B4 assigned to device D2: create a kernel K2, set the kernel arguments to B3 and B4, then enqueue an NDRange using K2 on D2. Etc.
thank you again for your quick reply, but I am still not quite clear.

let me ask this way: in your pseudo-code, the context and program are created for all devices, but kernel is created for each device. However, in clCreateKernel(), I can not find an argument to specify which device to associate with. I can only give "program" as the first argument, but it is already associated with all devices.

Can you explain a little bit more?