They say one of the first steps to fixing a problem is to acknowledge that you have one. Hi, I’m Russ, and I have a porting problem. If you work on software in the embedded world, you probably have a porting problem, too. Your code may be an embedded application, drivers, or diagnostics, but at some point the code probably started out on a desktop machine of some sort. When you moved it to run on the real hardware, you needed to port it.
I find that most folks underestimate the effort involved in porting code…especially C code. C and C++ are particularly pernicious in this area, because, unlike many other languages, they leave quite a few details up to the compiler writer. This is probably one of the reasons that it is so prevalent today – it is easy to move the compiler from one platform to another. And, this makes folks think that they can move a working C program from one platform to another.
Last week I spent a lot of time helping one of my customers figure out why a piece of code that was running perfectly on his Dell was now failing when running on his ARM platform. The funny thing is that this code was not touching the hardware at all – this was simply some C code that was running in memory. Usually when code breaks as it is moved from a Dell to an ARM, the cause is the boundaries where hardware gets involved. So, this had us stumped.
Ultimately, we tracked it down to this snippet of code:
When the program was originally written, the array “count” was counting up the number of connections that were made to a channel. Later, they needed to distinguish between a connection count that was 0 and a channel that had never had a connection. Rather than create another variable, the programmer added a constant called “UNUSED” which had the value -1 to denote that a channel had never been connected. He though this was safe, since a count can never take on a negative value. “UNUSED” was assigned to the “count” array on initialization and then compared later.
Just a quick aside on programming style – you should NEVER overload the meaning of a variable like this. It is an invitation for bugs to crawl into your code. However, I see this done all over the place. Consider the Linux function fgetc– it returns data, unless it returns -1 (note the similarity here), which is a status denoting an end of file condition. Good coding practice would dictate that the status of the function should always be returned through one path and the data always returned through another path.
I have had developers defend this type of variable overloading, citing Unix or Linux as “really good code” that uses this technique. Yes, Linux does it. But that does not make it a good example. Don’t even get me started about their use of a global variable called “errno”. My freshman computer science professor would have thrown me out of the degree program, with prejudice, had I ever proposed such a monstrosity.
Anyway, back to the debugging. The code in Listing #1, when running on the ARM, took the else path. But when running on the Dell, it took the if clause. We double-checked this, and it was indeed happening. Next we checked to see if “count[chan]” had changed, or if “chan” had changed between the assignment and the comparison – neither had changed. For folks who like puzzles and consider themselves competent in C coding, think about how this could be the case before reading on.
This just looked broken to me. I wrote a small snippet of code that replicated the problem:
On the Dell machine, as you would expect, we see that the world is sane. However, when we are on the ARM system, the code takes the else clause. The world has gone insane! How is it possible for this simple code to run differently between the ARM and the Pentium?
Next I looked at the assembly code that was produced by the ARM compiler:
I looked at the constant being loaded into R0, and it was in fact a -1. The CMN instruction is a compare negative. It will negate the second argument and then compare it with the first. By my math, a -1 and a -1 (or a 1 negated) are equal. I started to suspect a compiler bug.
Now, let me say that I have never in my life found a compiler bug… despite seeing a lot of unbelievable things happen in code – and often wanting to blame the compiler for not implementing “what I meant” (rather than “what I coded”). Compiler bugs are really quite rare – and the solution is often tied back to an incorrect understanding of the language or a simple programmer error. But, in this case, it really looked to me like the load instruction was incorrect. It was doing an unsigned load (LDRB) of the -1 (or 0xFF) in memory, which was not sign extending it as it was placed in R0. So the CMN instruction, which did perform a sign extension, would result in a comparison of 0xFF with 0xFFFFFFFF – which are not equal. Doing an unsigned load here is incorrect; as every high school student knows, the default type for integer variables in C is “signed.”
To test this theory, I punched the LDRSB (load register signed byte) opcode in place of the LDRB and re-ran the program. The code executed as I expected. The world was once again sane.
So why was the ARM compiler broken in this case? Had I found the first compiler bug in my career?
Some careful reading of the K and R C book was in order. Before I claimed a compiler bug, I wanted to review the rules for type promotion and sign extension.
It turns out that both the ARM compiler and the GCC we used on the Dell machine were compiling the code correctly. Differently, yes, but both are correct as per the C language definition.
If the variable is defined as a “signed char” the code will run the same on both compilers. Likewise, if the variable is declared as “unsigned char” the code will run the same. And, it is true that the default type for integers in C is “signed.” However, for “chars” and “bit fields,” it is left to the compiler writer to decide if sign extension will take place when the variable is promoted for assignment or comparison when the signedness (yeah, I know that’s not a word) is not declared.
So GCC treats this as signed, and the ARM compiler treats it as unsigned. Because of this, you will get different results. This seems broken to me. But, greater minds than mine have pondered this question. It turns out that Dennis Ritchie himself has weighed in on this one, and he feels that the language specification is correct. You can read his justification here (also, here is the GCC view on this problem). The solution, Dennis claims, is if you want portable code you must declare your chars as either signed or unsigned. This ensures that the code will be portable – well, you won’t run into this problem. C has a couple of dozen other gotchas.
The moral of today’s post is that you need to validate your software on the target system where it will be run. No matter how much validation you do on your laptop, you cannot be sure that there are no porting or platform dependencies in your code. The only way to find these is to test on the actual design where the code will finally run. Another example of this happened when I was validating code that was working fine on a desktop machine and on a development card using the actual processor. But when put on the real hardware, it failed miserably. Turned out that there was a region of memory on the design which was only accessible 8 bits at a time (narrow bus, for some hardware centric reason); all the other platforms executed 32 bit accesses just fine to all of memory. But, it would bus fault on the real hardware. This was a really hard problem to find. Yeah, the hardware team did mention this limitation – once on one line on page 386 (of a 487 page technical specification). Somehow we missed it.
Bottom line is - there is no substitute for running on the actual design.