Sometimes I get huge kernel overhead.
I measure the time of the time using two ways using:
Code :
double start_time_total = get_time();
        cl::Event event;
                                             cl::NullRange, // offset
                                             cl::NullRange, // local
                                             NULL, // pre-requisite events
double gpu_profiling_time = 
        event.getProfilingInfo<CL_PROFILING_COMMAND_END>() - 
        gpu_profiling_time *= 1.0e-9; // Convert to seconds
double end_time_total = get_time();
gpu_total_time = end_time_total - start_time_total;
Where, get_time() uses gettimeofday() to get the current time in seconds as double.

When the CPU is used as the OpenCL device the difference between gpu_total_time and gpu_profiling_time makes sense.
However, when I use my GPU (AMD 6750M, on MacBook Pro) the overhead is sometimes huge, 0.000619s compare to 0.032589s (~X50 slower when measured from the host side).
The problem is consistent with specific kernels.

Here is the prototype of the kernel if it helps:
Code :
kernel void resize(
                   __read_only image2d_t src, 
                   __write_only image2d_t dst, 
                   int width, 
                   int height,
                   float scale_x, 
                   float scale_y)

Note that the problem does not exist on Windows with NVidia hardware (at least for the specific device that I tried).

Any idea for solution?

Thanks in advance!