Results 1 to 4 of 4

Thread: Effect of OpenCL kernel size in performance

  1. #1
    Junior Member
    Join Date
    Sep 2017
    Posts
    4

    Question Effect of OpenCL kernel size in performance

    I have two almost identical OpenCL kernels, with minor differences. One kernel performs 256 FMA operations, and the other one performs 512 operations, in the loop body.

    Here is the general structure of the kernel:

    Code :
    __kernel void WGSXMAPIXLLXOPS8(const __global float * restrict GIn, __global float * restrict GOut, const float M, const float N, const float P) {
            const int XGL = get_global_id(0);
            const int XGRid = get_group_id(0);
            const int XGRnum = get_num_groups(0);
            const int XLSize = get_local_size(0);
            const int XLid = get_local_id(0);
            // Just a private variable
            float MF = (float) XGL;
            float NF = (float) N;
            float PF = (float) P;
            float tempOutTotal = 0;
     
            // Start of a new level of for loop
            for (int lcdd = 0; lcdd < 2; lcdd++) {
                    float temp1 = 1.0;
                    temp1 = temp1 * MF + temp1;
                    temp1 = temp1 * MF + temp1;
                    ...
                    temp1 = temp1 * MF + temp1;
                    temp1 = temp1 * MF + temp1;
                    GOut[XGL] = temp1;
            }
     
    }

    Now, I calculate the GFlops of these kernels by dividing the total number of floating point operations by the time it takes to finish it. I deploy 141076 number of work-item onto the GPU. For kernel with 256 FMA operations, I get around 1696.5 GFlops and for the kernel with 512 FMA operations, I get around 2043.74GFlops.

    From my point of view, I have enough parallelism and I have enough operations in the kernel. Unless my assumption is wrong.

    Now the question is: Why having more operations in the kernel improves the performance? I understand having parallelism does matter and also each kernel should do enough operations. So, what exactly cause this specific gap in the performance?

    To be more specific, is there any relation between occupancy and the number and the type of operations a kernel does?

  2. #2
    Senior Member
    Join Date
    Apr 2015
    Posts
    321
    You have a very compute heavy kernel and parallelism is meant to deal with memory latency. In your case GPU has to juggle threads for no benefit. Your larger kernel has bigger ratio of useful work and thread scheduling overhead and therefore bigger flop/s value. Granted, 15% seems quite a lot for this small number of threads, so you should feed this code into CodeXL (AMD) or NSight (NVIDIA) kernel analyzer to find out what the problem really is.

  3. #3
    Junior Member
    Join Date
    Sep 2017
    Posts
    4
    Isn't the thread scheduling overhead zero in the GPU? my assumption was "warps" will be assigned to SM and each clock cycle there is one available warp to be scheduled on the SM. Is there any other scheduling on another level happening for the above code?

  4. #4
    Senior Member
    Join Date
    Apr 2015
    Posts
    321
    On AMD hardware it takes 4 cycles to switch to another wave\thread. It's not much, but is not zero either. It also takes time to configure workgroups before starting the kernel, but this time is probably not taken into account when measuring running time. In this case you're probably right, but speculating without profiling info won't take us too far.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Proudly hosted by Digital Ocean