Synthesizing parallel designs from sequential C++ - part 2
Blog Post
Posted Jul 16, 2009
by Thomas Bollaert
Follow on Twitter
Go URL
What is a Go URL?In my previous post, I introduced the topic of synthesizing parallel hardware from sequential C++ code, a task that some would like to think is impossible to achieve. While countless tape-outs incorporating blocks synthesized from pure C++ provide repeated and unchallengeable proof that such concurrency is possible, doubt still persists.
At the source of this false debate is the confusion between the programming language and the machine executing the code.
C++ is not inherently sequential, but the processor it runs on usually is. Processors and microcontrollers have fixed architectures with a finite set of resources. During code execution, if, one observes A, then B, then C, it usually is because there are not enough resources to perform A, B and C together, in parallel.
Preparing RecommendationsLet’s take a very simple example:
r1 = (a * b) + c;
r2 = (x * y) + z;
If the CPU running this code has only one ALU, then the 2 multiply-add operations will have to run sequentially. But that doesn’t mean that the language itself is inherently sequential.
Why? Take the case of DSPs: they typically come with multiple parallel ALUs so that multiple operations can indeed be executed concurrently. The Texas Instruments C6000 family provides quad 16-bit and octal 8-bit multiply-accumulate performance. A TigerSHARC from Analog Devices can run up to four 32-bit instructions per cycle. The architecture of these devices permits parallelism and high-throughput.
What languages are used to program DSPs? C, C++ or assembly… Does anyone in the DSP community challenge the fact that C or C++ can be parallelized when sufficient resources are available? Software compilers will parallelize the program based on the resources available on the target processor.
So the first factor to take into consideration when parallelizing C++ code is the quantity of available execution resources.
While this will constitute the bottleneck when compiling software on a fixed architecture, the quantity of resources will never be a problem for high-level synthesis (HLS). The synthesis tool has complete freedom – under user control – to create all the resources needed for optimal performance.
If we extended our simple example to 64 multiply-add operations, a C synthesis tool like Catapult C would be able to create a design with 64 parallel multiply-add operators. This is effectively building parallel hardware from sequential C++ code.
Such a design can be built because, in our example, there are no dependencies between the various operations. The value of “r2” does not depend on the value of “r1”. In other words, the execution order implied by the code sequence is not part of the functionality. The code looks sequential, but the functionality isn’t. A different sequence, including a fully parallel one, would also provide valid results.
Is my example too simple? Let’s take another one to show that the very same concepts can be applied to much more complex designs.
r1 = foo(x);
r2 = bar(y);
Still too simple? Not so fast! The important thing to consider in this example is that the internal complexity of functions “foo” and “bar” is not relevant. These two functions can be as complex as needed; they can still be parallelized since there are no expressed dependencies between their respective input and outputs.
Although the C code would clearly execute sequentially, hardware can be built where “foo” and “bar” run in parallel. Replace “foo” and “bar” by “Reed-Solomon” and “FFT” and now you’ve got something which starts to make sense.
As you may have concluded, the second factor which will influence the ability to parallelize is the existence and nature of potential data dependencies.
In my two examples we can parallelize the code because there are no dependencies between specific clusters of operations. This allows us to leverage what is often referred to as “spatial parallelism”.
My third post in the series will cover the case of data dependencies, explaining how a sequential C++ program with dependencies can also be parallelized through pipelining, leveraging another dimension of parallelism known as “temporal parallelism”.
Preparing RecommendationsRecent Posts
- GTC - mission accomplished
- Measuring RTOS performance - a Web seminar
- The value of MISRA-compliant software
- Preparing for GTC
- Who needs a Web server?
- More on Yocto Terminology - recipes and packages
- PowerPoint hints and tips #2
- New book
- Video
- In an open-source world, it’s all about integration
Comments (↓ Add Your Own)
1 Comment on this Post
Commented on 6:50 AM, Aug 10, 2009
By Synthesizing parallel designs from sequential C++ - part 3 « Thomas Bollaert’s Blog
Add Your Comment
Please complete the following information to comment or sign in.