HLS Fundamentals / Part 2

In my last two posts, I introduced the question that proved the most challenging in the HLS Bluebook quiz (here) and presented some fundamental concepts about loop unrolling and loop pipelining and explained why answer 2 was not the right one (here).

Let’s now see what happens in the case of answer 1, when we unroll LOOP0 by 4 and pipeline the design with II=1.

Partially unrolling by 4 means that we transform the loop into a new one which now has only 8/4=2 iterations, and where each iteration of the new loop implements 4 iterations of the original loop. The corresponding C code would look like:

void acc(int din[8], int &dout)
{
  int tmp;
  LOOP0: for(int i=0; i<8; i+=4) {
    tmp+=din[i+0];
    tmp+=din[i+1];
    tmp+=din[i+2];
    tmp+=din[i+3];
  }
  dout = tmp;
}

The schedule for one loop iteration would look like as follows:

|RD0|ADD|ADD|
|RD1|   |   |
|RD2|ADD|   |
|RD3|   |   |

In the first cycle, 4 inputs are read. In the following cycles, these 4 values are summed together, possibly using a balanced adder tree.

If the design was not pipelined, the second iteration of the partially unrolled LOOP0 would start after the end of the first iteration and the schedule would look like:

|RD0|ADD|ADD|RD4|ADD|ADD|
|RD1|   |   |RD5|   |   |
|RD2|ADD|   |RD6|ADD|   |
|RD3|   |   |RD7|   |OUT|

Instead, the design is pipelined with II=1, meaning that the second iteration of LOOP0 should start 1 cycle after the start of the first iteration. Similarly the next design iteration (to process a new set of inputs) would start 1 cycle after the start of the last LOOP0 iteration. The schedule drawn below shows the two iterations of the partially unrolled LOOP0 corresponding to one design iteration, followed by two more iterations of LOOP0 corresponding to a second loop iteration.

|RD0|ADD|ADD|
|RD1|   |   |
|RD2|ADD|   |
|RD3|   |   |
    |RD4|ADD|ADD|
    |RD5|   |   |
    |RD6|ADD|   |
    |RD7|   |OUT|
        |RD0|ADD|ADD|   
        |RD1|   |   |   
        |RD2|ADD|   |   
        |RD3|   |   |   
            |RD4|ADD|ADD|
            |RD5|   |   |
            |RD6|ADD|   |
            |RD7|   |OUT|

There are 2 iterations of LOOP0, new iterations start every clock cycle, the output is produced at the end of the last iteration: this implies that the RTL generated with these constraints will produce new results every 2 cycles.

Preparing Recommendations

An important point to notice as well is that the throughput of the design is independent of its latency. The above examples were drawn with the assumption that each addition took a full clock cycle. One loop iteration is shown to take 3 clock cycles. Faster adders could have produced a shorter schedule. This would have meant a shorter ramp-up time (time to first output), but the data rate would stay the same (1 output every 2 clock cycles).

So answer 1 is not the correct one. There are only two possible choices left, and I hope that with all these recent explanations, finding the solution should now be easy…

About Thomas Bollaert

imageMy first encounter with HLS, back then behavioural synthesis, dates more than 15 years. Since then my ventures have led me to explore many aspects of the ESL design flow, including HW/SW co-design, architecture exploration and of course, C synthesis. Five years ago, I joined Mentor to develop the Catapult C product line in Europe. Recently, my little family followed me all the way from Paris to Oregon, where I now serve as product marketing manager for Mentor Graphics' high-level synthesis product line. Visit Thomas Bollaert’s Blog

More Posts by Thomas Bollaert

More Blog Posts

Preparing Recommendations

Comments

No one has commented yet on this post. Be the first to comment below.

Add Your Comment

Please complete the following information to comment or sign in.

(Your email will not be published)