comparing the time required to add two arrays of integers on available devices

I have also posted this question here on StackOverflow and Reddit.

I’m very new to the whole OpenCL world so I’m following some beginners tutorials. I’m trying to combine this and this to compare the the time required to add two arrays together on different devices. However I’m getting confusing results. Considering that the code is too long I made this GitHub Gist.

On my mac I have 1 platform with 3 devices. When I assign the j in

cl_command_queue command_queue = clCreateCommandQueue(context, device_id[j], 0, &ret);

manually to 0 it seems to run the calculation on CPU (about 5.75 seconds). when putting 1 and 2 then calculation time drops drastically (0.01076 seconds). Which I assume is because the calculation is being ran on my Intel or AMD GPU. But Then there are some issues:

I can adjust the j to any higher numbers and it still seems to run on GPUs.
When I put all the calculation in a loop, the time measured for all the devices are the same as calculating on CPU (as I persume).
The time required to do the calculation for all 0<j are suspiciously close. I wonder if they are really being ran on different devices.
I have clearly no clue about OpenCL so I would appreciate if you could take a look at my code and let me know what are my mistake(s) and how I can solve it/them. Or maybe point me towards a good example which runs a calculation on different devices and compares the time.