Reduce the number of register

Hello All,

I wrote a kernel, this kernel will execute an equation, and when I use the visual profiler I found that I cant run all available thread in parallel, for example the gpu can run 1563 thread in parallel but I can reach only 1024 due to register factor, and the profiler said that I use 26 register per thread which is high, any one can give me a hint how to reduce the number of register per thread?

The kernel:


__kernel void relCalculation(const __global int* a,
                             const __global double* fProb,
                             const __global float* T,                             
                             __global int* Output,
                             const int max,
                             const int idx,
                             const int col,                             
                             const double rel,
                             const double t)
{

    const int i = get_global_id(0);

    if (i <= max )
    {         

         double GP = 1;
        for (int j = 0; j < col; ++j)
        {
            GP *= pow( (1 - pow(fProb[j + idx], a[i*col + j]) ), t/T[j]);
        }

        if ( GP >= rel)
            Output[i] = 1; 
        else
            Output[i] = 0; 

    }
}


only http://forums.nvidia.com/index.php?showtopic=171148

But, maximum parallelism doesn’t mean maximum performance: the system dynamics are much more complex than that.