Things go wrong. Electronic components die. Systems fail. This is almost inevitable and, the more complex that systems become, the more likely it is that failure will occur. In complex systems, however, that failure might be subtle; simple systems tend to just work or not work.
As an embedded system is “smart”, it seems only reasonable that this intelligence can be directed at identifying and mitigating the effects of failure …
Self-testing is the broad term for what embedded systems do to look for failure situations. I suspect that many conference papers have been presented on this topic [or could be - now, there is a thought ...]. It would even make the subject of a book or, at least, an extended series of articles. But I just want to identify some of the key issues.
Broadly, an embedded system can be broken down into 4 components, each of which can fail:
CPU failure is not too common, but is far from unknown. Unfortunately, there is very little that a CPU can do to predict its own demise. Of course, in a multicore system, there is the possibility of the CPUs monitoring one another.
Software failure is obviously a possibility and defensive code may be written to avoid some possible failure modes. This is quite a large subject that I will return to another day, but I did discuss some aspects on a post last year. Of course, a bug in the software might lead to a totally unpredictable failure.
Peripherals can fail in many and varied ways. Each device has its own possible failure modes. To start with, the self-test software can check that each peripheral is responding to its assigned address and has not failed totally. Thereafter, any further self-test is very device dependent. For example, a communications port may have a “loop back” mode, which enables the self test to verify transmission and reception of data.
Memory is a critical of an embedded system of course and is certainly subject to failure from time to time. Considering how much memory is installed in modern systems, it is surprising that catastrophic failure is not more common. Like all electronic components, the most likely time for memory chips to fail is on power up, so it is wise to perform a comprehensive test then, before vital data is committed to faulty memory.
If a memory chip is responding to being addressed, there are broadly two possible failure modes: stuck bits [i.e. bits that are set to 0 or 1 and will not change]; cross-talk [i.e. setting/clearing one bit has an effect on one or more other bits]. If either of these failures occurs while software is running, it is very hard to trace. The simplest test to look for these failures on start-up is a “moving ones” [and "moving zeros"] test. The logic for moving ones is simple:
set every bit of memory to 0
for each bit of memory
verify that all bits are 0
set the bit under test to 1
verify that it is 1 and that all other bits are 0
set the bit under test to 0
A moving zeros test is the same, except that 0 and 1 are swapped in this code.
Coding this test such that it does not use any RAM to execute [assuming start up code is running out of flash] is an interesting challenge, but most CPUs have enough registers to do the job.
Of course, such comprehensive testing cannot be performed on a running system. A background task of some type can carry out more rudimentary testing using this kind of logic:
for each byte of memory
turn off interrupts
save memory byte contents
for values 0x00, 0xff, 0xaa, 0x55
write value to byte under test
verify value of byte
restore byte data
turn on interrupts
These testing algorithms, as described, assume that all you know about the memory architecture is that it spans a series of [normally contiguous] addresses. However, if you have more detailed knowledge – which memory areas share chips or how rows and columns are organized – more optimized tests may be devised. This is desirable, as a slow start up will impact user satisfaction with a device.
A final question is what to do if a failure is detected. Of course, this will be different for every system. Broadly, the system should be put in a safe state [shut down?] and the user advised.