Andrew Stubbs, an esteemed member of our Embedded Software Tools team, recently got a chance to show off his debugging and performance tuning skills on an innovative multi-process, multi-language, network distributed application.
Andrew built a three element client-server software build system enhancement called cs that provides a build performance boosting cache feature while capturing usage data about the build. The cs architecture consisted of:
- A local compiler wrapper client written in C, caches build artifacts and sends usage data to the daemon.
- A local daemon written in C, collects usage data from all local clients and uploads that data to the server.
- A remote server written in Python.
The performance goals were roughly:
- No user-visible slow down in build performance with the cs feature enabled.
- Transmit as little data as possible.
Translating qualitative performance goals like this into a quantitative goal is not an exact science, so an arbitrary but small 3% overhead threshold was chosen. In other words, the build should take no more than 3% longer when the cs feature was enabled.
When both caching and usage collection were enabled, the performance boost from the caching was more that enough to offset the small losses incurred by the usage collection and transmission. This configuration met the performance goals.
But despite carefully design and coding for efficiency, the cs feature failed to meet the performance goal in the case where usage data was collected but the performance boosting cache was disabled. In this case, testing showed up to a 7% overhead.
The source of the performance problem could lie in the local client, the local daemon, the remote server or the network. To find and fix the performance problem, Andrew considered a number of traditional profiling and analysis tools including gprof and callgrind. Neither gprof or callgrind were useful for analyzing long-running processes or for the Python-based server. Gprof in particular was not useful for very short lived processes like the client, or for processes that don’t exit like the daemon. Callgrind was great for identifying “hot” functions, but had I/O measurement issues. Neither gprof or callgrind provided trace capabilities, but instead provided only statistical reports.
Debug and Analysis Methods with Sourcery Analyzer
Andrew conducted his debugging and performance tuning with Sourcery Analyzer. The local cs daemon was instrumented with LTTng User Space Tracepoints. A cs-enabled build of a medium sized application (specifically GNU binutils) on a Linux host generated trace data which was then imported into Sourcery Analyzer. Remote server logs were translated into a compatible trace format then also imported. Sourcery Analyzer was then used to generate waveforms like the one below. Andrew used these waveforms to analyze and debug the performance problem.
Andrew created a domain specific custom analysis agent to better understand the behavior of the cs components. Specifically, the custom analysis agent did two things:
- Automatically generated the right type of waveform for each of the User Space Tracepoints in the daemon.
- Generated advanced data flow visualization of the data moving between the client and the daemon and the server.
The client-server interaction waveforms were intended to allow intuitive visualization of where the time spent in a round trip was going. Synchronized clocks permitted a single transaction to be traced from start to finish, and both best and worst case examples to be viewed. The net effect of this exploration was to show that there was not one particular hotspot or bottleneck. Overall, the data was flowing smoothly between the client, the daemon and the server.
The transmit queue (number of waiting “jobs”) rose to a maximum of 149 jobs, and the average over the whole run was 82.
The average round-trip time is 1.1 seconds.
The 8 connections manage to move an average of approximately 10 HTTPS requests every second.
The biggest part of the non-network related transmission overhead is the SSL encryption, but encrypted transmission was a hard requirement and could not be removed.
The performance problem was not in the data collection (which had already previously been optimized), and not in transmission, per se, but simply that the transmission process consumes CPU time that is then not available to the rest of the build.
In order to hit previously established performance targets, the data transmission needed to be removed from the critical path, particularly on resource starved uniprocessor systems where the daemon most noticeably takes time from the compiler and the rest of the build process.
As a result of this analysis, the daemon now collects data as the build progresses and stores it in RAM, until the build has completed or the transmission queue reaches capacity. Once the build becomes idle and generates no additional data for 30 seconds, the daemon then transmits the data to the server. If a new build begins then it pauses transmission until it becomes idle again. The data-collection overhead varies from machine to machine, but is now closer to 1%.
The cs build enhancement became part of our Sourcery CodeBench embedded software development tools in the 2013.05 release. More information about the cs feature is located here:
About Sourcery Analyzer
Find, analyze and fix performance bugs in your application with Sourcery Analyzer, a powerful debug and performance analysis tool available with our Sourcery CodeBench embedded software development tools.
Learn more about Sourcery Analyzer here:
To learn more about Mentor Graphics’ embedded software tools, platforms and services visit http://mentor.com/embedded.