my code need to run multiple kernel repeatedly, in order. what I did was,
when clCreateCommandQueue, I set 'cl_command_queue_properties properties' as '0', or 'CL_QUEUE_PROFILING_ENABLE ' if need to do the timing.
and then between each 'clEnqueueNDRangeKernel' or 'clEnqueueReadBuffer', I used 'clEnqueueBarrier(commandqueue)' to do the barrier.

but I have strange problem that I think it should be related to kernel not executed in order.

Is there anything I am missing here? Thank you very much.

btw, I create one context, one program, one commandqueue, and many different kernels for the same program, and run on 1 gpu.