I never cared for really long blog entries, so I’m splitting this up into a couple of posts. If you missed my last post you’ll want to go back and read it – or you’ll probably be hopelessly lost as we dive into my multi-core Sudoku solving dilemma.
As you’ll recall, last time I took my clever Sudoku solver and ran it on an 8 core PPC design, and found I got a 3 X speed up. When I looked into why, it turned out that the process which was creating the worker threads was being preempted. This limited the number of active worker threads to between 3 and 5, instead of the expected 25.
I solved the problem by adding a pthread barrier to the created threads, this had the effect of getting all 25 threads launched before diving into the solving phase of the program. And it worked, instead of only a few active threads, I have 25. They are all runnable. And I have lots of cores – 8 to be exact. But things ran more slowly. Using the System Analyzer I could easily see why:
All of the threads are created and quickly reach a state of blocking on I/O – waiting for the pthread barrier to be released by Linux. But if you look at the cpu utilization (at the top of the graph) you’ll see that we have lots of idle CPUs (recall grey is idle and blue is running a user thread). Each thread goes through each open square in the puzzle and determines if its digit goes in that square. Then it returns a list of solved squares to the main process and exits. The main process collates all the results and then launches a new set of threads if the puzzle is not solved. For each group of 25 threads Linux appears to be clustering those threads on only 2 or 3 cores. I assume that Linux does this presuming that the threads will be communicating with each other, and that communication across different cores is expensive and therefore is clumping them together on just a few cores. This is just a guess – I have some friends who are experts in this and I’m going to consult them to try to get to the bottom of this.
In the meantime, I’m still trying to get to that elusive multi-core performance and have my Sudoku puzzles solved faster. So I decide to try something – processor affinity. As I create each thread, I can force the thread to run on a particular core. This is done using the function call sched_setaffinity(). As a quick start to this I decide to force the threads in a round robin fashion to the cores. The thread solving the 1st digit (‘A’) is forced to core 0, the thread solving for ‘B’ to core 1, and so on. The 9th thread goes on core 0, and the 10th on core 1. I take the bottom 3 bits of the thread number and set the processor affinity of that thread to that core. Here’s how I coded it:
Note: this only works if “CORES” is an even power of 2.
When tracing it and viewing it with the System Analyzer, here’s what I see:
Much, much better. You’ll notice that we get all 8 cores running full steam at the start of a round, but as they run out of work to do, the cores start to idle. This happens about halfway through the round.
What’s happening is that the time to solve each digit differs based on the configuration of the puzzle. Some digits get solved quickly, and some take more time. Inevitably, one of the cores will have a couple of digits which take a long time to solve, while other cores will have a couple of digits which can be solved quickly. Which leaves some cores idle at the end of a round This implementation is more than 5X faster than a single core solution, so this is pretty good, but if I can evenly spread the work across the cores better, clearly, I can get closer to 8X.
The next thing I tried was to create a “work queue”. Instead of each core solving a predetermined set of digits, each worker thread will take a digit from a queue. Using affinity I fixed each worker thread to a core. As long as there are unsolved digits in the queue the thread will keep on working. If one core gets several digits which can be solved quickly, it will ask for more work. This approach should effectively “level” the load across the cores. Once the queue is empty, the worker thread returns its solved squares to the main thread and expires.
Now the trace looks like this:
This multi-core version of the solver runs over 7 times faster than the original single threaded version. As the work gets completed there are still some idle CPU cycles at the end of a round of solving, I can think of a couple of ways these could be put to good use, but I think I am at a point of diminishing returns – I am running the algorithm as a multi-threaded SMP program in about 13.8% of the time that it took to run the single threaded version. The theoretical upper limit is 12.5% (1/8th) – so if I were to squeeze out every last CPU cycle, I only have about 1% of the original run time left to optimize. At this point, it’s not worth the effort (and my boss wants me to get back onto “real” work ). You might notice that I am launching a new set of threads on each “round” of solving – it was easy to code that way. I was originally concerned that the overhead of launching and tearing down so many threads would impact performance – some urban legend I remember from my youth. It turns out that the time taken to launch the threads is less than 0.1% of the run time of the program. Again, this is so small that there is no point in convoluting the code to optimize this out.
I ended up running 7.2 times faster on 8 cores – not a perfect utilization of the additional cores, but pretty good. Without the visibility of seeing how the threads were being handled I don’t think I could have done it. In the end, the code changes were really pretty minimal, and mostly related to thread launching and the management of a work queue. The main solving algorithm was completely untouched. Doing a diff between the original solver and the final implementation shows 123 lines added and 17 lines changed and 8 deleted. Just for the record, that is out of 1557 total lines of code.
There is one thing left to investigate. All 8 worker threads are making lots of read references to the puzzle array, which is in a common memory area shared by all 8 threads. Only the main thread updates this area once all the worker threads have completed, so there are no race conditions. I can’t help but wonder if the cache coherency logic in the processor is getting thrashed by all these threads accessing a common area. My thought is to make a local copy of the puzzle for each thread and have it reference only the copy as it goes about finding the solution. It will be interesting to see if that has any impact on performance. Given the small amount of performance left to reach the theoretical limit, I’m guessing I’ll find the shared caches working pretty efficiently in this case.