I wrote a kernel, this kernel will execute an equation, and when I use the visual profiler I found that I cant run all available thread in parallel, for example the gpu can run 1563 thread in parallel but I can reach only 1024 due to register factor, and the profiler said that I use 26 register per thread which is high, any one can give me a hint how to reduce the number of register per thread?

__kernel void relCalculation(const __global int* a,
const __global double* fProb,
const __global float* T,
__global int* Output,
const int max,
const int idx,
const int col,
const double rel,
const double t)
{
const int i = get_global_id(0);
if (i <= max )
{
double GP = 1;
for (int j = 0; j < col; ++j)
{
GP *= pow( (1 - pow(fProb[j + idx], a[i*col + j]) ), t/T[j]);
}
if ( GP >= rel)
Output[i] = 1;
else
Output[i] = 0;
}
}

only http://forums.nvidia.com/index.php?showtopic=171148

But, maximum parallelism doesn't mean maximum performance: the system dynamics are much more complex than that.