Synthesizing parallel designs from sequential C++ - part 2

In my previous post, I introduced the topic of synthesizing parallel hardware from sequential C++ code, a task that some would like to think is impossible to achieve. While countless tape-outs incorporating blocks synthesized from pure C++ provide repeated and unchallengeable proof that such concurrency is possible, doubt still persists.

At the source of this false debate is the confusion between the programming language and the machine executing the code.

C++ is not inherently sequential, but the processor it runs on usually is. Processors and microcontrollers have fixed architectures with a finite set of resources. During code execution, if, one observes A, then B, then C, it usually is because there are not enough resources to perform A, B and C together, in parallel.

Preparing Recommendations

Let’s take a very simple example: 

   r1 = (a * b) + c;
   r2 = (x * y) + z;

 If the CPU running this code has only one ALU, then the 2 multiply-add operations will have to run sequentially. But that doesn’t mean that the language itself is inherently sequential.

Why? Take the case of DSPs: they typically come with multiple parallel ALUs so that multiple operations can indeed be executed concurrently. The Texas Instruments C6000 family provides quad 16-bit and octal 8-bit multiply-accumulate performance. A TigerSHARC from Analog Devices can run up to four 32-bit instructions per cycle. The architecture of these devices permits parallelism and high-throughput.

What languages are used to program DSPs? C, C++ or assembly…  Does anyone in the DSP community challenge the fact that C or C++ can be parallelized when sufficient resources are available? Software compilers will parallelize the program based on the resources available on the target processor.

So the first factor to take into  consideration when parallelizing C++ code is the quantity of available execution resources.

While this will constitute the bottleneck when compiling software on a fixed architecture, the quantity of resources will never be a problem for high-level synthesis (HLS). The synthesis tool has complete freedom – under user control – to create all the resources needed for optimal performance.

If we extended our simple example to 64 multiply-add operations, a C synthesis tool like Catapult C would be able to create a design with 64 parallel multiply-add operators. This is effectively building parallel hardware from sequential C++ code.

Such a design can be built because, in our example, there are no dependencies between the various operations. The value of “r2” does not depend on the value of “r1”. In other words, the execution order implied by the code sequence is not part of the functionality. The code looks sequential, but the functionality isn’t. A different sequence, including a fully parallel one, would also provide valid results.

Is my example too simple? Let’s take another one to show that the very same concepts can be applied to much more complex designs.

  r1 = foo(x);
  r2 = bar(y);

Still too simple? Not so fast! The important thing to consider in this example is that the internal complexity of functions “foo” and “bar” is not relevant. These two functions can be as complex as needed; they can still be parallelized since there are no expressed dependencies between their respective input and outputs.

Although the C code would clearly execute sequentially, hardware can be built where “foo” and “bar” run in parallel. Replace “foo” and “bar” by “Reed-Solomon” and “FFT” and now you’ve got something which starts to make sense.

As you may have concluded,  the second factor which will influence the ability to parallelize is the existence and nature of potential data dependencies.

In my two examples we can parallelize the code because there are no dependencies between specific clusters of operations. This allows us to leverage what is often referred to as “spatial parallelism”.

My third post in the series will cover the case of data dependencies, explaining how a sequential C++ program with dependencies can also be parallelized through pipelining, leveraging another dimension of parallelism known as “temporal parallelism”. 

About Thomas Bollaert

imageMy first encounter with HLS, back then behavioural synthesis, dates more than 15 years. Since then my ventures have led me to explore many aspects of the ESL design flow, including HW/SW co-design, architecture exploration and of course, C synthesis. Five years ago, I joined Mentor to develop the Catapult C product line in Europe. Recently, my little family followed me all the way from Paris to Oregon, where I now serve as product marketing manager for Mentor Graphics' high-level synthesis product line. Visit Thomas Bollaert’s Blog

Preparing Recommendations

Comments (↓ Add Your Own)

1 Comment on this Post

[...] be parallelized to produce high-performance concurrent hardware. In the first two posts (part 1 and part2) I developed the notions of resource bottlenecks and spatial parallelism; in this third post we [...]

Add Your Comment

Please complete the following information to comment or sign in.

(Your email will not be published)