clEnqueueNDRangeKernel for two different devices not performing as expected

This is the code snippet where the OpenCL Kernels are queued

globalWorkSizeCPU[0] = (size_t)ceil(((float)ncols) / ((float)DIM_LOCAL_WORK_GROUP_X)) * DIM_LOCAL_WORK_GROUP_X;
globalWorkSizeCPU[1] = (size_t)ceil(splittingPoint) * DIM_LOCAL_WORK_GROUP_Y;

offsetCPU[0] = 0;
offsetCPU[1] = 0;

globalWorkSizeGPU[0] = (size_t)ceil(((float)ncols) / ((float)DIM_LOCAL_WORK_GROUP_X)) * DIM_LOCAL_WORK_GROUP_X;
globalWorkSizeGPU[1] = (size_t)ceil(((float)nrows) / ((float)DIM_LOCAL_WORK_GROUP_Y) - splittingPoint) * DIM_LOCAL_WORK_GROUP_Y;

offsetGPU[0] = 0;
offsetGPU[1] = (size_t)ceil(splittingPoint) * DIM_LOCAL_WORK_GROUP_Y;

errcode = clEnqueueNDRangeKernel(clGPUCommandQue, clGPUKernel, 2, offsetGPU, globalWorkSizeGPU, localWorkSize, 0, NULL, NULL);
errcode = clEnqueueNDRangeKernel(clCPUCommandQue, clCPUKernel, 2, offsetCPU, globalWorkSizeCPU, localWorkSize, 0, NULL, NULL);

The problem here is that I am enqueuing both the kernels but from the timing results, it seems as if the kernels are not getting queued parallely.
The command queues are getting scheduled one after another.

I am using the Odroid XU-3 board which has a ARM CPU and a Mali GPU. Both devices run on different platforms.

Could anyone help me to solve the issue? Kind of urgent!!

I also tried reversing the order of the Enqueue functions but it did not work.

Aha, you encountered the same problem I has faced to.
It shouldn’t to be, but it seems to be the fact.
In my case, my laptop has a Intel CPU and a NVIDIA CPU, the CPU is more powerful than the GPU, about 3:1.
I let them work together in your way(i think so), and the fact is it spend the same time as with the only CPU.
The work is distributed to the 2 devices as the ratio of 3:1, but it not save my time.
I am still looking for the reason, but I am not so optimistic.
Good luck!

Did you find something useful for this issue?
I tried all kinds of things but in vain.

Could someone else please help me in this?

I have tested my program on anther computer, it told me that perhaps we have no problem.
This computer is equiped with 4 power GPUs(GeForce GTX 970), and the CPU has no OpenCL driver(It is not my computer, so…).
When I used one GPU, the task spended 3604MS. And it took 1404MS when all 4 GPUs were used.
I made a different context for every GPU, even if they are all the same.
The result is no so ideal, but it is ok.
The reason that makes us puzzle is(I guess) that we have a too powerful CPU and too weak GPU.

I ran the program only on the CPU and only on the GPU. According to that, the GPU is more powerful than CPU but not too powerful.
I am still very puzzled on this one.
Could someone please help me on this?