HLS Fundamentals / Part 2
Blog Post
Posted Apr 25, 2011
by Thomas Bollaert
Follow on Twitter
Go URL
What is a Go URL?In my last two posts, I introduced the question that proved the most challenging in the HLS Bluebook quiz (here) and presented some fundamental concepts about loop unrolling and loop pipelining and explained why answer 2 was not the right one (here).
Let’s now see what happens in the case of answer 1, when we unroll LOOP0 by 4 and pipeline the design with II=1.
Partially unrolling by 4 means that we transform the loop into a new one which now has only 8/4=2 iterations, and where each iteration of the new loop implements 4 iterations of the original loop. The corresponding C code would look like:
void acc(int din[8], int &dout)
{
int tmp;
LOOP0: for(int i=0; i<8; i+=4) {
tmp+=din[i+0];
tmp+=din[i+1];
tmp+=din[i+2];
tmp+=din[i+3];
}
dout = tmp;
}
The schedule for one loop iteration would look like as follows:
|RD0|ADD|ADD|
|RD1| | |
|RD2|ADD| |
|RD3| | |
In the first cycle, 4 inputs are read. In the following cycles, these 4 values are summed together, possibly using a balanced adder tree.
If the design was not pipelined, the second iteration of the partially unrolled LOOP0 would start after the end of the first iteration and the schedule would look like:
|RD0|ADD|ADD|RD4|ADD|ADD|
|RD1| | |RD5| | |
|RD2|ADD| |RD6|ADD| |
|RD3| | |RD7| |OUT|
Instead, the design is pipelined with II=1, meaning that the second iteration of LOOP0 should start 1 cycle after the start of the first iteration. Similarly the next design iteration (to process a new set of inputs) would start 1 cycle after the start of the last LOOP0 iteration. The schedule drawn below shows the two iterations of the partially unrolled LOOP0 corresponding to one design iteration, followed by two more iterations of LOOP0 corresponding to a second loop iteration.
|RD0|ADD|ADD|
|RD1| | |
|RD2|ADD| |
|RD3| | |
|RD4|ADD|ADD|
|RD5| | |
|RD6|ADD| |
|RD7| |OUT|
|RD0|ADD|ADD|
|RD1| | |
|RD2|ADD| |
|RD3| | |
|RD4|ADD|ADD|
|RD5| | |
|RD6|ADD| |
|RD7| |OUT|
There are 2 iterations of LOOP0, new iterations start every clock cycle, the output is produced at the end of the last iteration: this implies that the RTL generated with these constraints will produce new results every 2 cycles.
Preparing RecommendationsAn important point to notice as well is that the throughput of the design is independent of its latency. The above examples were drawn with the assumption that each addition took a full clock cycle. One loop iteration is shown to take 3 clock cycles. Faster adders could have produced a shorter schedule. This would have meant a shorter ramp-up time (time to first output), but the data rate would stay the same (1 output every 2 clock cycles).
So answer 1 is not the correct one. There are only two possible choices left, and I hope that with all these recent explanations, finding the solution should now be easy…
More Blog Posts
Preparing RecommendationsRecent Posts
- Mentor ESL in TSMC Reference Flow 12
- 48th DAC - Gary’s Magic Formula
- DAC: 9th ESL Symposium
- HLS Fundamentals / Part 2
- HLS Fundamentals: Loop Unrolling and Loop Pipelining
- HLS Contest: And the winner is...
- A Designer’s Perspective on ESL Methodologies for an OFDM Modem Design
- Catapult C and the 7 Samuraïs
- The Why, What and How of HLS @ DATE 2011
- DVCon: Wally Rhine's Keynote
Comments
No one has commented yet on this post. Be the first to comment below.
Add Your Comment
Please complete the following information to comment or sign in.