A weird problem with "clEnqueueNDRangeKernel"!

Hello guys,

I’m new to opencl and I’m experiencing a weird issue with it! I have a reduction kernel and I repeat it several times! The problem is that when I profile the execution of kernel the elapsed time (queued->end) is almost same and a bit increasing but when I measure the elasped time within “C++” code the time for the execution of line “clEnqueueNDRangeKernel” increases with a rapid rate!! I have attached both the code and the output of profiling! :shock:

	// execute the kernel
	globalWorkSize[0] = this->reduction_NumBlocks * this->reduction_NumThreads;
	localWorkSize[0] = this->reduction_NumThreads;

	//Start Time
	ttt.start();

clErrNum = clEnqueueNDRangeKernel(clCommandQueue, kernelReduction, 1, 0,
			globalWorkSize, localWorkSize, 0, NULL, &timing_event);
	// check if kernel execution generated an error
	oclCheckError(clErrNum, CL_SUCCESS);

	clFinish(clCommandQueue);
	ttt.stop();

	//Check Elapsed Time
	clGetEventProfilingInfo(timing_event, CL_PROFILING_COMMAND_QUEUED,
	sizeof(time_start), &time_start, NULL);
	clGetEventProfilingInfo(timing_event, CL_PROFILING_COMMAND_END,
	sizeof(time_end), &time_end, NULL);
	cout<<"ElapseTime(Execute):"<<(time_end - time_start)/1000<<"us	TTT:"<<ttt.getElapsedTimeInMicroSec()<<endl;

output::
[ul]


GeForce GTX 550 Ti
Device Timer Resolution:1000ns
GpuExecutionTime:160us	C++ElapsedTime:177
GpuExecutionTime:156us	C++ElapsedTime:167
GpuExecutionTime:156us	C++ElapsedTime:166
GpuExecutionTime:189us	C++ElapsedTime:242
GpuExecutionTime:158us	C++ElapsedTime:215
...
GpuExecutionTime:156us	C++ElapsedTime:253
GpuExecutionTime:162us	C++ElapsedTime:261
GpuExecutionTime:157us	C++ElapsedTime:262
GpuExecutionTime:156us	C++ElapsedTime:254
GpuExecutionTime:157us	C++ElapsedTime:254
GpuExecutionTime:160us	C++ElapsedTime:261
GpuExecutionTime:167us	C++ElapsedTime:279
GpuExecutionTime:157us	C++ElapsedTime:264
...
GpuExecutionTime:159us	C++ElapsedTime:263
GpuExecutionTime:157us	C++ElapsedTime:261
GpuExecutionTime:157us	C++ElapsedTime:260
GpuExecutionTime:157us	C++ElapsedTime:263
GpuExecutionTime:183us	C++ElapsedTime:287
GpuExecutionTime:159us	C++ElapsedTime:275
GpuExecutionTime:158us	C++ElapsedTime:285
GpuExecutionTime:184us	C++ElapsedTime:289
GpuExecutionTime:163us	C++ElapsedTime:271
GpuExecutionTime:264us	C++ElapsedTime:384
..
GpuExecutionTime:156us	C++ElapsedTime:304
GpuExecutionTime:161us	C++ElapsedTime:314
GpuExecutionTime:157us	C++ElapsedTime:308
GpuExecutionTime:160us	C++ElapsedTime:305
GpuExecutionTime:158us	C++ElapsedTime:311
GpuExecutionTime:156us	C++ElapsedTime:308
GpuExecutionTime:157us	C++ElapsedTime:307
GpuExecutionTime:164us	C++ElapsedTime:320
GpuExecutionTime:159us	C++ElapsedTime:328
GpuExecutionTime:157us	C++ElapsedTime:306
GpuExecutionTime:157us	C++ElapsedTime:309
GpuExecutionTime:157us	C++ElapsedTime:312
...
GpuExecutionTime:157us	C++ElapsedTime:326
GpuExecutionTime:158us	C++ElapsedTime:326
GpuExecutionTime:159us	C++ElapsedTime:330
GpuExecutionTime:158us	C++ElapsedTime:328
GpuExecutionTime:158us	C++ElapsedTime:335

[/ul]
Any kind of help is appreciated.

P.S. The size of input and other related vairables are fixed!

Really no Idea?
I’m completely confused.

Does that event leak every iteration?

Try finishing the queue before recording the begin and end times.

Might be interesting to run the test in a profiler and see where the main thread spends its time between the start and end timer calls.

Actually, I found the reason!
It was due to the a small memory leakage which was increasing every iteration and caused the delay!
anyway, thanks a lot for you care.