Huge kernel overhead on Mac

Hello,
Sometimes I get huge kernel overhead.
I measure the time of the time using two ways using:


double start_time_total = get_time();
        
        cl::Event event;
        queue.enqueueNDRangeKernel(kernel, 
                                             cl::NullRange, // offset
                                             global
                                             cl::NullRange, // local
                                             NULL, // pre-requisite events
                                             &event);
double gpu_profiling_time = 
        event.getProfilingInfo<CL_PROFILING_COMMAND_END>() - 
        event.getProfilingInfo<CL_PROFILING_COMMAND_QUEUED>();
        gpu_profiling_time *= 1.0e-9; // Convert to seconds
double end_time_total = get_time();
gpu_total_time = end_time_total - start_time_total;

Where, get_time() uses gettimeofday() to get the current time in seconds as double.

When the CPU is used as the OpenCL device the difference between gpu_total_time and gpu_profiling_time makes sense.
However, when I use my GPU (AMD 6750M, on MacBook Pro) the overhead is sometimes huge, 0.000619s compare to 0.032589s (~X50 slower when measured from the host side).
The problem is consistent with specific kernels.

Here is the prototype of the kernel if it helps:


kernel void resize(
                   __read_only image2d_t src, 
                   __write_only image2d_t dst, 
                   int width, 
                   int height,
                   float scale_x, 
                   float scale_y)

Note that the problem does not exist on Windows with NVidia hardware (at least for the specific device that I tried).

Any idea for solution?

Thanks in advance!
Yoav

It looks like the way you are measuring time on the host side is incorrect. Starting the clock before calling clEnqueueNDRangeKernel() and stopping it once it returns doesn’t measure the same thing as gpu_profiling_time.

clEnqueueNDRangeKernel() is analogous to ordering a pizza. It takes very little time. However, what you want to know is how long it takes to bake the pizza in the oven. That’s what gpu_profiling_time is giving you.

This has been discussed a few times in the past. I suggest using the search feature to find more information.

Sorry, I forgot to mention that I do use queue.finish() before calling get_time() at second time on the host side. I’m aware that clEnqueueNDRangeKernel() is a non-blocking operation.
However the problem exists despite the using of queue.finish(). Something is really strange there.

Here is the fixed code snippest:


double start_time_total = get_time();
        
        cl::Event event;
        queue.enqueueNDRangeKernel(kernel, 
                                             cl::NullRange, // offset
                                             global
                                             cl::NullRange, // local
                                             NULL, // pre-requisite events
                                             &event);
        queue.finish();
double gpu_profiling_time = 
        event.getProfilingInfo<CL_PROFILING_COMMAND_END>() - 
        event.getProfilingInfo<CL_PROFILING_COMMAND_QUEUED>();
        gpu_profiling_time *= 1.0e-9; // Convert to seconds
double end_time_total = get_time();
gpu_total_time = end_time_total - start_time_total;

I searched measuring time but couldn’t find anything about this problem.

Thanks in advance,
Yoav

I’ve just noticed that the huge overhead happens only at the first time I run the kernel.

I do build it in advance and cache the program, but are there addition operations that Apple’s implementation does when running a kernel for the first time?

I see the same thing and believe this first-time delay is normal and predominately related to lazy buffer allocation on the compute device.

You didn’t post you kernel code / arguments, but I can speculate.
If you have an output buffer declared on the device, the implementation has no reason to actually allocate it until you run the kernel the first time.

Even the kernel code might not be moved over to the device until the first use then cached afterward.

That makes sense, thanks.
I think that the specification should allow prevention of such lazy operations by adding flags to the kernel and the buffer constructors.
It might be the case that the application can do the allocation asyncroniuously but not the running of the kernel itself. In such cases, the lazy approach is a waste of time.

How is the lazy approach a waste of time? It only happens once, it still has to happen once no matter what happens. If you’re doing any micro-benchmarks it is a given that you cannot get reliable results if you do not let the system `warm up’ first - i.e. do a couple of dummy runs.

BTW the ‘lazy allocation’ is at the operating system level, and beyond such a specification’s scope. Although a unix process can be given a big virtual address space, those pages do not exist until they are accessed. This is not something a driver could change.