Results 1 to 4 of 4

Thread: Reduce / remove loop dependency in HSL

  1. #1
    Newbie publiczne's Avatar
    Join Date
    Nov 2017
    Location
    Warsaw, Poland
    Posts
    1

    Reduce / remove loop dependency in HSL

    In OpenCL / RTL design, there is a way to reduce loop dependency by making the accumulator into a shift register to improve the pipeline factor like the code below:

    Code :
    float shift_reg[DEPTH];
    for(int i = 0; i < DEPTH; i++) {
        shift_reg[i] = 0;
    }
    for(int i = 0; i < loop_bound; i++) {
        shift_reg[DEPTH - 1] = shift_reg[0] + arr[i];
        #pragma unroll
        for(int j = 0; j < DEPTH - 1; ++j) {
            shift_reg[j] = shift_reg[j + 1];
        }
    }
    float sum = 0;
    #pragma unroll
    for(int i = 0; i < DEPTH - 1; ++i) {
        temp_sum += shift_reg[i];
    }
    result = temp_sum;

    I don't quite understand this method. And Can I use normal register(array) instead of shift register to implement this?

  2. #2
    Senior Member
    Join Date
    Apr 2015
    Posts
    292
    This is the manual implementation of the CPU technique called Register renaming. Purpose is the same. https://en.wikipedia.org/wiki/Register_renaming

    And Can I use normal register(array) instead of shift register to implement this?
    Shift register is a normal register array.

  3. #3
    Senior Member
    Join Date
    Apr 2015
    Posts
    292
    Purpose is the same
    When looking at the code more closely, though, I'm kinda not sure what the hell is actually going on in there. That rotation in the inner loop creates a false dependency, does it not?

    If you do something like this:
    Code :
    float shift_reg[DEPTH];
    for (int i = 0; i < loop_bound / DEPTH; ++i){ //assume they divide exactly
         for (int j = 0; j < DEPTH; ++j){
             shift_reg[j] += arr[i * DEPTH + j];//Will probably be replaced by a compiler with "load loop" and "compute sum loop"
        }
    }
     
    float sum = 0;
    #pragma unroll
    for(int i = 0; i < DEPTH - 1; ++i) {
        temp_sum += shift_reg[i];
    }
    result = temp_sum;

    It is obvious that sums associated with every register can be computed independently, which allows a CPU or GPU utilize their pipelining capabilities better. Your code I don't really understand either.
    Last edited by Salabar; 11-14-2017 at 03:59 AM.

  4. #4
    Senior Member
    Join Date
    Apr 2015
    Posts
    292
    I've just realized you were talking about FPGA programming. In this case the inner loop probably can be performed in one cycle, but I don't quite have enough knowledge on the topic to tell the difference between my variant and yours. It's probably some FPGA compiler magic that detects your code, but not mine for whatever reason.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Proudly hosted by Digital Ocean