How expensive is calling clEnqueueNDRangeKernel?

I have a function Run() that calls execution of two kernels:

QUOTE
void Run()
{
//I’am using C++ bindings
queue->enqueueNDRangeKernel(*kernelRow, cl::NullRange, *globalRangeRow, *localRangeRow, NULL, eventRow);
queue->enqueueNDRangeKernel(*kernelColumn, cl::NullRange, *globalRangeCol, *localRangeCol, NULL, eventCol);
queue.finish()
}

// As you see, I’m using events (eventRow, eventCol) because of profiling.

How expensive (time performance) is calling enqueueNDRangeKernel (or clEnqueueNDRangeKernel ).

With Nvidia OpenCL Profiler, I got total time of execution (on GPU) 351 ms, but when I measured time of running of method Run()
I got 622 ms.

Why this difference is so large?

I tested on NVIDIA GT240.
I also tested on ATI HD 5670 and difference is much smaller.

When is data transfered to GPU, on calling clEnqueueNDRangeKernel or when buffer is created (clCreateBuffer)?

The overhead of calling clEnqueueNDRangeKernel should be fairly small.
I guess the problem is the data transfer. If you use clCreateBuffer with CL_MEM_COPY_HOST_PTR, the data only gets copied to the device when you call clEnqueueNDRangeKernel, because only then does the runtime know which device is using the data. Try using clEnqueueWriteBuffer to copy data to your device and see if it makes a difference.

There may also be some extra overhead associated with the first launch of a kernel. You should measure several kernel launches and then average the results.

There is a noticeable overhead that scales with the size of the buffers pointed to in the kernel arguments associated with invoking enqueueNDRangeKernel() for the first time even if they’ve already been written to the device, at least when using Apple’s implementation in Snow Leopard. From what I’ve discovered this can be alleviated by invoking a dummy kernel (i.e one with no instructions) with the same arguments before running the actual kernel. Or alternatively just invoking the original kernel repeatedly (of course this will take longer).

EDIT: minor grammatical changes.

Thanks Barneybear, you’re right.
I solved my problem, simply invoking a dummy kernel.