Better pipeline utilization causes lower performance in GPUs

I’m developing a simple OpenCL kernel, which is only doing computation with no memory access at all. Here is the kind of kernel we are executing on the GPU:

__kernel void WGS512MAPI8LLXOPS64(const __global float *GIn, __global float *GOut, const int M, const int N, const int P) {
         const int XGL = get_global_id(0);
         const int XGRid = get_group_id(0);
         const int XGRnum = get_num_groups(0);
         const int XLSize = get_local_size(0);
         const int XLid = get_local_id(0);
         // Just a private variable
         float temp1 = 1.0;
         float temp2 = 1.0;
         float temp3 = 1.0;
         float temp4 = 1.0;
         float tempOut;
         float MF = (float) M;
         float NF = (float) N;
         float PF = (float) P;


         // Start of a new level of for loop
         long baseIndex1 = XGRid*XLSize*8+XLid;
         temp1 += temp1 * MF;
         temp1 += temp1 * MF;
         temp1 += temp1 * MF;
         temp1 += temp1 * MF;
         temp1 += temp1 * MF;
         temp1 += temp1 * MF;
         temp1 += temp1 * MF;
         temp1 += temp1 * MF;
         ...
         temp1 += temp1 * MF;
         tempOut = temp1 + temp2 + temp3 + temp4;
         GOut[XGRid*XLSize*8+XLid] = tempOut;
}

The total number of “FMA operations” is about 1024. Based on the kernel, every instruction require the previous instruction to be finished first, due to existence of data dependency. I have tried to optimize execution of above kernel, but utilizing more temp variables to increase the gap between data dependent operations like below:

__kernel void WGS512MAPI8LLXOPS64(const __global float *GIn, __global float *GOut, const int M, const int N, const int P) {
         const int XGL = get_global_id(0);
         const int XGRid = get_group_id(0);
         const int XGRnum = get_num_groups(0);
         const int XLSize = get_local_size(0);
         const int XLid = get_local_id(0);
         // Just a private variable
         float temp1 = 1.0;
         float temp2 = 1.0;
         float temp3 = 1.0;
         float temp4 = 1.0;
         float tempOut;
         float MF = (float) M;
         float NF = (float) N;
         float PF = (float) P; 

         // Start of a new level of for loop
         long baseIndex1 = XGRid*XLSize*8+XLid;
         temp1 += temp1 * MF; temp2 += temp2 * NF;
         temp1 += temp1 * MF; temp2 += temp2 * NF;
         temp1 += temp1 * MF; temp2 += temp2 * NF;
         temp1 += temp1 * MF; temp2 += temp2 * NF;
         temp1 += temp1 * MF; temp2 += temp2 * NF;
         ...
         temp1 += temp1 * MF; temp2 += temp2 * NF;
         tempOut = temp1 + temp2 + temp3 + temp4;
         GOut[XGRid*XLSize*8+XLid] = tempOut;
}

Executing and calculating the total the total GFLOPs of both kernel (while the total number of operations for both are the same), the first kernel gives around 1186.17 GFLOPs and the second kernel gives around 600.58 GFLOPS, which is completely against my first assumption. Since I have rolled out memory access completely, I cannot come up with another explanation for this numbers.

So I’m wondering if anyone knows what’s going on on the device level while these kernels are being executed.

Step zero when optimizing: use profiler: NVIDIA’s Nsight or AMD’s CodeXL. If you have a Radeon GPU you can use CodeXL to look at the resulting assembly. My guess is overzealous loop unrolling, but again, that’s only a guess.

Hi Salabar, I basically tried to figure out how to profile OpenCL code on Nvidia, but seems like they have completely eliminated OpenCL code profiling. Specifically I’m developing on a remote Linux server, which I cannot utilize the nsight eclipse on it.

Try this workaround. They were going to deprecate this tool ages ago, though someone on this message board claimed it still works.