Measuring Time inside kernel on intel iGPU

duttasankha · November 4, 2018, 4:54pm

Hello Everyone

I am very new to OpenCL, however I have fair amount of experience on GPU programming using CUDA. I used to use clock function inside CUDA kernel to measure ticks of certain operations inside the kernel. I wrote a simple OpenCL vector addition kernel and tried to run it on the intel integrated GPU. The program ran fine and gave correct output. But then I tried to use the clock function inside the kernel function and there is JIT compilation error while execution the clBuildProgram. The vector addition kernel that I wanted to execute is provided below:



__kernel void testVecAdd(__global const int *a,__global const int *b,__global int *c,
		         __global float *t){

clock_t start = clock();

 int gid = get_global_id(0);
 c[gid] = a[gid] + b[gid];
 
 t[gid] = (float)(clock()-start)/CLOCKS_PER_SEC;

}

The errors are as follows:



/home/duttasankha/Desktop/SANKHA_ALL/IGPU_RESEARCH_RELATED/OCL_PRAC_DIR/test_OCL_1.cl:6:2: error: use of undeclared identifier 'clock_t'
 clock_t start = clock();
 ^
/home/duttasankha/Desktop/SANKHA_ALL/IGPU_RESEARCH_RELATED/OCL_PRAC_DIR/test_OCL_1.cl:11:19: error: implicit declaration of function 'clock' is invalid in OpenCL
 t[gid] = (float)(clock()-start)/CLOCKS_PER_SEC;
                  ^
/home/duttasankha/Desktop/SANKHA_ALL/IGPU_RESEARCH_RELATED/OCL_PRAC_DIR/test_OCL_1.cl:11:27: error: use of undeclared identifier 'start'; did you mean 'sqrt'?
 t[gid] = (float)(clock()-start)/CLOCKS_PER_SEC;
                          ^~~~~
                          sqrt
CTHeader.h:5277:40: note: 'sqrt' declared here
double16 __attribute__((overloadable)) sqrt(double16);
                                       ^
/home/duttasankha/Desktop/SANKHA_ALL/IGPU_RESEARCH_RELATED/OCL_PRAC_DIR/test_OCL_1.cl:11:27: error: taking address of function is not allowed
 t[gid] = (float)(clock()-start)/CLOCKS_PER_SEC;
                          ^
/home/duttasankha/Desktop/SANKHA_ALL/IGPU_RESEARCH_RELATED/OCL_PRAC_DIR/test_OCL_1.cl:11:34: error: use of undeclared identifier 'CLOCKS_PER_SEC'
 t[gid] = (float)(clock()-start)/CLOCKS_PER_SEC;
                                 ^

Failed to build program...: -11 (CL_BUILD_PROGRAM_FAILURE)
Build failed!

I was able to do this in the CUDA as it supports clock function. But similar goals was not achieved with the intel iGPU. I also tried other functions to measure the ticks but none of them worked as well. I also tried offline compilation using ioc64 but I got same errors. I was just wondering if someone could tell me is there anything wrong I am doing in here or getting the ticks using clock (or similar) functions is not possible in the intel integrated GPU. It is absolutely necessary for me to get this execution traces. So if using clock function is not a viable option then I was wondering what would be the alternate option in here to achieve same goals and how can I use it? Thank you.

bashbaug · November 5, 2018, 7:47am

But then I tried to use the clock function inside the kernel function and there is JIT compilation error while execution the clBuildProgram.

Short answer is that standard OpenCL C does not have a clock() function… but this probably isn’t what you want to do, anyhow. The reason why is because a clock() function will operate per-work-item, but you probably have thousands if not millions of work items, and you probably don’t care how long it takes any one work item to execute so long as the entire NDRange of work items executes quickly. As an analogy, in most cases you don’t want to measure the performance of each individual inner loop execution, rather you want to measure the performance of the entire loop.

what would be the alternate option in here to achieve same goals and how can I use it?

OpenCL “event profiling” is one alternate option that allows you to query when an OpenCL event transitions between states using clGetEventProfilingInfo():

https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clGetEventProfilingInfo.html

If you’re OK measuring the performance of the entire NDRange command then you can use the START and END queries to figure out how long your kernel was executing on the device. You can do this by instrumenting your code yourself, or tools can manage the queries for you, for example:

github.com

intel/opencl-intercept-layer/blob/main/docs/controls.md#deviceperformancetiming-bool

# How to Use the Intercept Layer for OpenCL Applications

This file is automatically generated using the script `generate_controls_doc.py`.
Please do not edit it manually!

By default, the Intercept Layer for OpenCL Applications will not modify any OpenCL
calls.  You may notice some status messages being printed to stderr, but otherwise
your application should run exactly as it does without the Intercept Layer for
OpenCL Applications.

## Controls

The Intercept Layer for OpenCL Applications is controlled using the Windows
registry, Linux configuration files, or environment variables on all OSes.

### Windows Registry

On Windows, the Intercept Layer for OpenCL Applications reads its registry keys
from:

This file has been truncated. show original

Hope this helps!