How to know if the kernels are executing concurrently?

I have a NVIDIA GPU with Compute Capabiity 3.0, so it should support 16 concurrent kernels. I am starting 10 kernels by looping through clEnqueueNDRangeKernel for 10 times. Each of the kernel is tied to a different command queue. How do I get to know that the kernels are executing concurrently?

One way which I have thought is to get the time before and after the NDRangeKernel statement. I might have to use events so as to ensure the execution of the kernel has completed. But I still feel that the loop will start the kernels sequentially. Can someone tell me if this is the right way to start concurrent kernels…?

Also what if I start more than 16 kernels (say 20), will the kernels be executed in a batch of 16 kernels i.e. first 16 will be executed in first batch and then remaining 4 kernels in the next batch…?

The start/end time of execution from the event information should show this as they are all from the same reference. The profiler should show concurrent execution, and that’s easier than adding manual timing code. And if nothing else, the total execution time should be better.

You’d have to refer to the nvidia docs on how it manages lots of jobs (if they deem that important enough to include), the obvious choice would be to just run them in a FIFO manner, but that is only a guess.