Synthesizing parallel designs from sequential C++ - part 3

A couple of weeks ago, I started explaining how sequential C++ could be parallelized to produce high-performance concurrent hardware. In the first two posts (part 1 and part2) I developed the notions of resource bottlenecks and spatial parallelism; in this third post we will explain the notion temporal parallelism. This concept is essential as it allows parallelizing data dependant operations - a case found in a majority of designs.

As explained previously, two or more operations which do not dependent on each other can be parallelized, provided that sufficient resources are available to process these operations in parallel:

  x = a * b;
  y = c * d;

High-level synthesis can parallelize these two multiplications as 1) there are no variable dependencies and 2) dedicated hardware will be created to match the desired performance, so there are no resource limits.

But what happens when dependencies exist? How can parallel hardware be synthesized in this case? Let’s modify our simple example above by adding a dependency between the two multiplications:

  x = a * b;
  y = c * x;

One intuitively understands that spatial parallelization techniques cannot apply in this case. Indeed, “x”, the result of the first operation must be available in order to compute the second multiplication. In other words, we have cascaded logic and a dependency chain between the two operations.

Preparing Recommendations

For the sake of this example, let’s assume that each multiplication takes 1 clock cycle. In the first cycle, we’ll compute x = a * b and in the second cycle, we’ll compute y = c * x, as shown in the example below.

 |  step  1  |  step  2  |
 |-----------|-----------|
 | x = a * b | y = c * x |

At this point, there is no parallelism in our design. To obtain parallelism, we need to introduce a new dimension to our equation; and this is dimension is time.

Spatial parallelism explained in my previous post allows creating parallelism within one iteration of a function. By adding the notion of time, we can now create parallelism across iterations.

Up to now, we have always considered our simple example from the perspective of a single run, producing one result. But by going to hardware, we build a structure which will iterate again and again. Our simple design will not compute just one value of “y”, but many of them, over time.

So let’s add time to our example. Let’s denote a0..y0 the values corresponding to the first iteration, a1…y1 those corresponding to the second iteration, etc…

 |    step 1    |    step 2    |    step 3    |    step 4    |
 |--------------|--------------|--------------|--------------|
 | x0 = a0 * b0 | y0 = c0 * x0 |
                | x1 = a1 * b1 | y1 = c1 * x1 |
                               | x2 = a2 * b2 | y2 = c2 * x2 |

As we see from the diagram above, as we compute “y0″, it is possible to begin the computation of “x1″. So the “x0″ and “y0″ multiplications can’t happen in parallel, but the “x1″ and “y0″ can. Our two multipliers will indeed run concurrently, but as they do, they will process values corresponding to different data sets.

In the hardware designer’s vocabulary, what I just described is called pipelining. Pipelining is the bread and butter of RTL design.

As we did when looking at spatial parallelism, let’s now generalize our example. Let’s replace our simple multiplications by more complicated functionality and let’s encapsulate this functionality in separate functions “foo” and “bar”.

The pseudo-code for our modified C program would look like:

  x = foo(a);
  y = bar(x);

The two functions can be as complex as desired, the same pipelining technique applies.

 |    step 1    |    step 2    |    step 3    |    step 4    |
 |--------------|--------------|--------------|--------------|
 | x0 = foo(a0) | y0 = bar(x0) |
                | x1 = foo(a1) | y1 = bar(x1) |
                               | x2 = foo(a2) | y2 = bar(x2) |

As shown in these two examples, it is perfectly possible to build parallel and pipelined implementations from sequential C++ by leveraging parallelism across design iterations.

This fundamental notion makes the problem of parallelizing C++ in hardware implementations quite different than in software implementations. In software, instructions are executed one by one. In hardware, processes run continuously. Temporal parallelism is about taking advantage of this specific hardware property.

A C synthesis tool like Catapult C is perfectly capable of leveraging this parallelization potential and creating highly efficient parallel multi-block designs. The net benefit for the user is a very simple input source which does not require explicit description of concurrency, thereby making models more abstract, more reusable and easier to create and verify.

This ability to parallelize pure sequential C++ to efficient concurrent hardware is precisely what allowed Hitachi to create 57% of a recent ASIC in 82% less time.

About Thomas Bollaert

imageMy first encounter with HLS, back then behavioural synthesis, dates more than 15 years. Since then my ventures have led me to explore many aspects of the ESL design flow, including HW/SW co-design, architecture exploration and of course, C synthesis. Five years ago, I joined Mentor to develop the Catapult C product line in Europe. Recently, my little family followed me all the way from Paris to Oregon, where I now serve as product marketing manager for Mentor Graphics' high-level synthesis product line. Visit Thomas Bollaert’s Blog

More Posts by Thomas Bollaert

More Blog Posts

Preparing Recommendations

Comments (↓ Add Your Own)

1 Comment on this Post

[...] #3 Synthesizing parallel designs from sequential C++ [...]

Add Your Comment

Please complete the following information to comment or sign in.

(Your email will not be published)