Looping kernels produce not constant timings

Hi OpenCL community,

I would appreciate if any of you can help me with the following issue. I have a program in which I use the same kernels over and over inside a “for” loop. The pseudo-code of my program is the following


Initilialize OpenCL (devices, queue, kernels, create buffers, set arguments, etc)

for[parameters]{

   read data

   tic()
   rewrite buffers with CL_TRUE enabled
   toc()

   tic()
   run kernel 1
   clFinish()
   toc()

  ...

   tic()
   run kernel n
   clFinish()
   toc()   

  tic()
  read output buffer  
  toc()
  
  tic()
  C functions using the output
  toc()

}

Where tic and toc are time measurement functions similar to Matlab which I use to profile the performance of my code. I am not using the OpenCL profiler functions because I am working with the Nexus10 and they are not working.properly.

My question is the following:
When I plot the times for all the running kernels, I observe that there are iterations in which they are not relatively constant (it starts at some timing value and then randomly jump to a higher time for some iterations and then it goes back to a time that is between the min (expected one) and the maximum) as it should be. Do anyone have a hint of what may be causing this?.

I tried changing the clFinish with clFlush, using both or none. Also, when I run only one iteration of the process with the same input that produces the maximum value it works fine producing the minimum expected time. Finally, if I add a sleep(100ms) at the end of the loop the times are constant (at the minimum value) for all the kernels as they should be.

Thanks for your time and advise.

LC

It could be other GPU operations are getting “caught” by your clFinish and you’re also timing those. Things like OpenGL drawing your screen. Try creating an OpenCL ‘event’ for each kernel and get the profiling stats from those events to measure the execution time of the kernels themselves. Are those more consistent? You could also use vendor tools to measure kernel performance (e.g, NVIDIA Parallel Nsight, AMD APP Profiler). Note: make sure to clReleaseEvent each event after you get the stats you need. Also note: In order to get high performance OpenCL code, you shouldn’t be calling clFinish. Just queue up work and read back results (with blocking reads).

Another source of this kind of behavior could be dynamic frequency adjustment in the GPU. As it heats up, it could be reducing the GPU frequency. As it cools, it could raise the frequency again. You look for ways to query the dynamic GPU frequency to confirm this.