Concurrency between a CPU kernel execution and a GPU data transfer

Hi. I am testing a simple heterogeneous computing program with OpenCL using one CPU and one GPU.

The CPU has one NDRangeKernel().
The GPU has two WriteBuffer(), one NDRangeKernel(), and one ReadBuffer().

And say,
CPU_time = NDRangeKernel(),
GPU_time = 2*WriteBuffer() + NDRangeKernel() + ReadBuffer().

Both CPU and GPU jobs are totally independent.
I expected a result that if the CPU and the GPU are running concurrently, total elapsed time should be max(CPU_time,GPU_time).

But actual results showed me kind of (CPU_time+GPU_time) which argues the CPU and the GPU are not executed in parallel.
So I analyzed with profiler to find what was wrong.

One strange thing was observed when the CPU has a heavy job whereas GPU computes small. (real data : CPU takes 0.05 sec and GPU takes 0.01sec)
It seems to me that since the CPU is busy for its computation, the first WriteBuffer() operation was delayed until the CPU complete.

Does anyone have this problem before?

Totally correct. If you stress your cpu before you have uploaded your data to the gpu, read/write operations are stalled because the cpu is busy working on your cpu openCL tasks. If you want to compare execution time, don’t let the CPU work while you initialize the GPU or try to reduce the workload of the CPU so that there is still some performace on the CPU to start the GPU jobs.

It will also depend on your host CPU. Some CPU’s (used?) to be able to do data transfer without going through the CPU. Pinned memory might let you bypass the GPU but there will have to be something running somewhere on the CPU obviously to co-ordinate all this.