Effect of OpenCL kernel size in performance

I have two almost identical OpenCL kernels, with minor differences. One kernel performs 256 FMA operations, and the other one performs 512 operations, in the loop body.

Here is the general structure of the kernel:

__kernel void WGSXMAPIXLLXOPS8(const __global float * restrict GIn, __global float * restrict GOut, const float M, const float N, const float P) {
        const int XGL = get_global_id(0);
        const int XGRid = get_group_id(0);
        const int XGRnum = get_num_groups(0);
        const int XLSize = get_local_size(0);
        const int XLid = get_local_id(0);
        // Just a private variable
        float MF = (float) XGL;
        float NF = (float) N;
        float PF = (float) P;
        float tempOutTotal = 0;

        // Start of a new level of for loop
        for (int lcdd = 0; lcdd < 2; lcdd++) {
                float temp1 = 1.0;
                temp1 = temp1 * MF + temp1;
                temp1 = temp1 * MF + temp1;
                ...
                temp1 = temp1 * MF + temp1;
                temp1 = temp1 * MF + temp1;
                GOut[XGL] = temp1;
        }

}

Now, I calculate the GFlops of these kernels by dividing the total number of floating point operations by the time it takes to finish it. I deploy 141076 number of work-item onto the GPU. For kernel with 256 FMA operations, I get around 1696.5 GFlops and for the kernel with 512 FMA operations, I get around 2043.74GFlops.

From my point of view, I have enough parallelism and I have enough operations in the kernel. Unless my assumption is wrong.

Now the question is: Why having more operations in the kernel improves the performance? I understand having parallelism does matter and also each kernel should do enough operations. So, what exactly cause this specific gap in the performance?

To be more specific, is there any relation between occupancy and the number and the type of operations a kernel does?

You have a very compute heavy kernel and parallelism is meant to deal with memory latency. In your case GPU has to juggle threads for no benefit. Your larger kernel has bigger ratio of useful work and thread scheduling overhead and therefore bigger flop/s value. Granted, 15% seems quite a lot for this small number of threads, so you should feed this code into CodeXL (AMD) or NSight (NVIDIA) kernel analyzer to find out what the problem really is.

Isn’t the thread scheduling overhead zero in the GPU? my assumption was “warps” will be assigned to SM and each clock cycle there is one available warp to be scheduled on the SM. Is there any other scheduling on another level happening for the above code?

On AMD hardware it takes 4 cycles to switch to another wave hread. It’s not much, but is not zero either. It also takes time to configure workgroups before starting the kernel, but this time is probably not taken into account when measuring running time. In this case you’re probably right, but speculating without profiling info won’t take us too far.