Synthesizing parallel designs from sequential C++ - part 3
Blog Post
Posted Aug 9, 2009
by Thomas Bollaert
Follow on Twitter
Go URL
What is a Go URL?A couple of weeks ago, I started explaining how sequential C++ could be parallelized to produce high-performance concurrent hardware. In the first two posts (part 1 and part2) I developed the notions of resource bottlenecks and spatial parallelism; in this third post we will explain the notion temporal parallelism. This concept is essential as it allows parallelizing data dependant operations - a case found in a majority of designs.
As explained previously, two or more operations which do not dependent on each other can be parallelized, provided that sufficient resources are available to process these operations in parallel:
x = a * b;
y = c * d;
High-level synthesis can parallelize these two multiplications as 1) there are no variable dependencies and 2) dedicated hardware will be created to match the desired performance, so there are no resource limits.
But what happens when dependencies exist? How can parallel hardware be synthesized in this case? Let’s modify our simple example above by adding a dependency between the two multiplications:
x = a * b;
y = c * x;
One intuitively understands that spatial parallelization techniques cannot apply in this case. Indeed, “x”, the result of the first operation must be available in order to compute the second multiplication. In other words, we have cascaded logic and a dependency chain between the two operations.
Preparing RecommendationsFor the sake of this example, let’s assume that each multiplication takes 1 clock cycle. In the first cycle, we’ll compute x = a * b and in the second cycle, we’ll compute y = c * x, as shown in the example below.
| step 1 | step 2 |
|-----------|-----------|
| x = a * b | y = c * x |
At this point, there is no parallelism in our design. To obtain parallelism, we need to introduce a new dimension to our equation; and this is dimension is time.
Spatial parallelism explained in my previous post allows creating parallelism within one iteration of a function. By adding the notion of time, we can now create parallelism across iterations.
Up to now, we have always considered our simple example from the perspective of a single run, producing one result. But by going to hardware, we build a structure which will iterate again and again. Our simple design will not compute just one value of “y”, but many of them, over time.
So let’s add time to our example. Let’s denote a0..y0 the values corresponding to the first iteration, a1…y1 those corresponding to the second iteration, etc…
| step 1 | step 2 | step 3 | step 4 |
|--------------|--------------|--------------|--------------|
| x0 = a0 * b0 | y0 = c0 * x0 |
| x1 = a1 * b1 | y1 = c1 * x1 |
| x2 = a2 * b2 | y2 = c2 * x2 |
As we see from the diagram above, as we compute “y0″, it is possible to begin the computation of “x1″. So the “x0″ and “y0″ multiplications can’t happen in parallel, but the “x1″ and “y0″ can. Our two multipliers will indeed run concurrently, but as they do, they will process values corresponding to different data sets.
In the hardware designer’s vocabulary, what I just described is called pipelining. Pipelining is the bread and butter of RTL design.
As we did when looking at spatial parallelism, let’s now generalize our example. Let’s replace our simple multiplications by more complicated functionality and let’s encapsulate this functionality in separate functions “foo” and “bar”.
The pseudo-code for our modified C program would look like:
x = foo(a);
y = bar(x);
The two functions can be as complex as desired, the same pipelining technique applies.
| step 1 | step 2 | step 3 | step 4 |
|--------------|--------------|--------------|--------------|
| x0 = foo(a0) | y0 = bar(x0) |
| x1 = foo(a1) | y1 = bar(x1) |
| x2 = foo(a2) | y2 = bar(x2) |
As shown in these two examples, it is perfectly possible to build parallel and pipelined implementations from sequential C++ by leveraging parallelism across design iterations.
This fundamental notion makes the problem of parallelizing C++ in hardware implementations quite different than in software implementations. In software, instructions are executed one by one. In hardware, processes run continuously. Temporal parallelism is about taking advantage of this specific hardware property.
A C synthesis tool like Catapult C is perfectly capable of leveraging this parallelization potential and creating highly efficient parallel multi-block designs. The net benefit for the user is a very simple input source which does not require explicit description of concurrency, thereby making models more abstract, more reusable and easier to create and verify.
This ability to parallelize pure sequential C++ to efficient concurrent hardware is precisely what allowed Hitachi to create 57% of a recent ASIC in 82% less time.
More Blog Posts
Preparing RecommendationsRecent Posts
- Mentor ESL in TSMC Reference Flow 12
- 48th DAC - Gary’s Magic Formula
- DAC: 9th ESL Symposium
- HLS Fundamentals / Part 2
- HLS Fundamentals: Loop Unrolling and Loop Pipelining
- HLS Contest: And the winner is...
- A Designer’s Perspective on ESL Methodologies for an OFDM Modem Design
- Catapult C and the 7 Samuraïs
- The Why, What and How of HLS @ DATE 2011
- DVCon: Wally Rhine's Keynote
Comments (↓ Add Your Own)
1 Comment on this Post
Commented on 9:50 PM, Jan 5, 2010
By Top High-Level Synthesis Stories of 2009 « Thomas Bollaert’s Blog
Add Your Comment
Please complete the following information to comment or sign in.