Test kernel, Error:E010:Irreducible ControlFlow Detected

I’m running some extremely simple kernels (single instruction looped many times) to test the speed of different instructions. The following code gives me the error:

Error:E010:Irreducible ControlFlow Detected


__kernel void acosh_test( const unsigned int N ) 
{ 
	float2 i = float2(2,2); 
	for( int x=0; x<N; x++ ) 
	{ 
		acosh(i); 
	} 
}

it works fine with “acos” instead of “acosh”.

using ATI 5870, tested in program and with ATI Stream Kernel Analyzer, with the same result.

I’ll keep posting problems with specific instruction/type errors.


__kernel void atan2_test( const unsigned int N ) 
{ 
   float2 i,j; 
   for( int x=0; x<N; x++ ) 
   { 
      atan2(i,j); 
   } 
}

gives me a long hangup in the compiler, sometimes gives a “build failed” error, but never actually compiles. does it matter that i and j are not initialized to any value? for other datatypes, it works fine. Side question, if variables are declared but uninitialized, what values will they take? memory garbage, or zeroes?

got “Error:E010:Irreducible ControlFlow Detected” again, for

float2 and atan2pi(i,j), same loop as before

Garbage. This behavior is inherited from C99.

As for the irreducible control flow error message, I would report it to ATI instead of here.

By best semi-random guess is that since you are not using the values returned by acosh(), the compiler is aggressively optimizing the code in a way that at some point mid-compilation the code has irreducible control flow, at which point the compiler gives up.

As a side note, it is very unlikely that the sort of kernels that you are compiling will give you any useful information about the cost of evaluating those trigonometric functions. Since you are throwing away the values returned by these functions, the compiler will optimize the code away and not call the functions even once. Even the loop will not be evaluated. A good compiler will produce an empty kernel.

Thanks David! Everything now outputs to a location in a buffer, and the tests compile/run fine.

follow ups:

How meaningful would stats like this be? The time is taken from the elapsed time for the clEnqueueNDRangeKernel event to finish, and time/instruction is simply the total elapsed time divided by the number of iterations.

Is there any way to generate ISA code the way that the OpenCL Profiler does, using API calls? I get the ir code using clGetProgramInfo( …, CL_PROGRAM_BINARIES, … ), but can i get more info on what is being sent to hardware?

How meaningful would stats like this be? The time is taken from the elapsed time for the clEnqueueNDRangeKernel event to finish, and time/instruction is simply the total elapsed time divided by the number of iterations.

It may be worth asking ourselves first how frequently do we expect to call functions like acosh() before we spend time measuring their performance. Chances are that for our kernels it doesn’t matter how slow they are.

Leaving that aside, I would recommend creating your command queue with the property CL_QUEUE_PROFILING_ENABLE and then using clGetEventProfilingInfo() (in particular, CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END) to estimate the cost of executing the kernel.

You also have to ensure that all the iterations in the loop have an effect on the final output value of the kernel. If you do something like the following the compiler may replace your loop with a single function call:


__kernel void foo(__global float* in, __global float* out)
{
    int i;
    uint id = get_global_id(0);

    for(i = 0; i < N; ++i)
    {
        // Don't do this :)
        // The compiler may recognize that since all iterations write into the same
        // memory location, only the last one needs to be evaluated.
        out[id] = ComplexFunction(in[id] + i);
    }
}

Is there any way to generate ISA code the way that the OpenCL Profiler does, using API calls?

Unfortunately there isn’t.