Kernel execution time is 0

Hello, i’m currently working with my first project in openCL c++ wrapper with Nvidia SDK. The program runs and kernel does simple math,such as multiplies two arrays and squares the result.Each array has 10 thousand numbers. I had some trouble with recording time, but some people on the internet helped to find the correct syntax and methods. But even after proofing enabled and events made, I get 0 time. My kernel code :

 std::string kernel_code =
		"   void kernel simple_add(global const float* A, global const float* B, global float* C){ "
		"       		int id =get_global_id(0); 		"
		"       C[id]=sqrt(A[id]*B[id]);               "
		"   }     

And the code, which launches kernel and proofing :

cl::CommandQueue queue(context, default_device, CL_QUEUE_PROFILING_ENABLE);
	queue.enqueueWriteBuffer(buffer_A, CL_TRUE, 0, sizeof(float) * 10000, A);
	queue.enqueueWriteBuffer(buffer_B, CL_TRUE, 0, sizeof(float) * 10000, B);
	cl::Kernel kernel_add = cl::Kernel(program, "simple_add");
	kernel_add.setArg(0, buffer_A);
	cl::Event event;
	queue.enqueueNDRangeKernel(kernel_add, cl::NullRange, cl::NDRange(10000), cl::NullRange, NULL,&event);
	queue.finish();
	float C[10000];
	queue.enqueueReadBuffer(buffer_C, CL_TRUE, 0, sizeof(float) * 10000, C);
	cl_ulong time_start, time_end;
	time_start = event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
	time_end = event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
	double time = time_end - time_start;
	std::cout << "START: " << time_start << "
 ";
	std::cout << "END: " << time_end << "
 ";
	std::cout << "TIME: " << time << "
 ";

Any ideas why is it giving 0 ?

Does this kernel even run? Check the NDRange return value. NVIDIA limits the the number of workitems to the size of a block CUDA way.

How is it done ? I’m quite new at openCL.

cl_int status = queue.enqueueReadBuffer(buffer_C, CL_TRUE, 0, sizeof(float) * 10000, C);
if (status !=CL_SUCCESS){
//Something happened
}

If case this is what happens to your code, I can only advice to install a CPU-based driver from AMD or Intel, I failed to find a newbie-friendly solution. Or restrain yourself to safe numbers like 512 or 256 work-items.

I checked and queue was made, so I checked again in my kernel. Well I found the reason for 0 time, and it’s quite embarrassing. Turns out I didn’t include C buffer in the args list (REALLY STUPID), must be of all the editing and messing with the code. Kernel always ran, but as it didn’t have anything to read, it didn’t do anything at all. As the reason for 0 time. Just fixed it, and now it gives 9344 nanoseconds. I’m really sorry that I wasted your time, but thanks for the help anyways.