OpenCL Kernel thread execution time

Hello!
I am trying to do my homework, but there is a problem.
Task is to measure full opencl kernel execution time and measure each kernel thread time.

I read this http://software.intel.com/sites/landingpage/opencl/optimization-guide/Profiling_Operations_Using_OpenCL_Profiling_Events.htm and learned how to measure full execution time. But i cant find anything about measure each kernel thread time.

I’m trying to use standart methods in kernel, such as gettimeofday(i work in FC17), but there is no successfull - when i’m trying to make import <sys/time.h> in kernel, i had an errors, when trying to run program, smth like this:catastrophic error, type long long is undefined.

Is there a possibility to measure each kernel thread execution time?
Thanks.

There is no direct way of measuring the time taken by each thread.

Estimating the time per thread might be possible for realy simple kernels but requires assumptions about how the hardware works.

I would suggest you ask your lecturer to explain what it is he/she wants.

As for using functions like gettimeofday, they are not available in OpenCL. Only functions that are part of the OpenCL specification are available (chapter 6 of the OpenCL specification describes them in detail, the reference card lists all of them).

Right, you can’t measure each thread. But you can measure the time for the entire kernel.

Set CL_QUEUE_PROFILING_ENABLE on your command queue. Then, when you call clEnqueueNDRangeKernel, pass in a cl_event object, and after you call clFinish (or otherwise wait for the kernel to complete), call clGetEventProfilingInfo and subtract the value you get for CL_PROFILING_COMMAND_START from the value you get for CL_PROFILING_COMMAND_END. This will be the kernel execution time in nanoseconds.

In the OpenCL 1.2 specification, see “5.12 Profiling Operations on Memory Objects and
Kernels”

Thanks, guys.
I understand yours, but what about measure ticks?
I wrote a kernel:


#include <sys/times.h>
__kernel void square(
   __global float* input,
   __global float* output,
   __global float* timesS,
   __global float* timesE,
   const unsigned int count) 
{
   struct tms time_buf;
   clock_t start_time = times(&time_buf);
   int i = get_global_id(0); 
   if (i < count)
       for (int j = 0; j < 1000000; ++j)
           output[i] = input[i] * input[i];
   if (i < count) {
       timesS[i] = start_time;
       clock_t end_time = times(&time_buf);
       timesE[i] = end_time;
   }                                                                   
} 

Full time for this kernel smth around 2000 ms.
times() work in kernel, but result ticks count at start and end of kernel usually the same, sometimes difference = 32(1 step), but this is 32 ticks = 320 ms(1 tick = 10 ms).
What is the reason of this behaviour?