[Beowulf] Curious about ECC vs non-ECC in practice
Lux, Jim (337C)
james.p.lux at jpl.nasa.gov
Tue May 24 12:07:10 PDT 2011
> -----Original Message-----
> From: David Mathog [mailto:mathog at caltech.edu]
> Sent: Tuesday, May 24, 2011 11:38 AM
> To: Lux, Jim (337C); beowulf at beowulf.org
> Subject: RE: [Beowulf] Curious about ECC vs non-ECC in practice
> Jim Lux posted:
> > "The Therac-25 Accidents" (Postscript ) or (PDF). This paper is an
> updated version of the original IEEE Computer (July 1993) article. It
> also appears in the appendix of my book.
> Well that was really horrible.
> Are car computers ECC? When all they did was engine management a memory
> glitch wouldn't have been too terrible, but now that some of them
> control automatic parking and other "higher" functions, and with around
> 100M units in circulation just in the USA, if they aren't ECC then
> memory glitches in running vehicles would have to be happening every day.
Car controllers tend to have mask ROM for their software which is pretty upset immune. The "PROM" (which today might be flash or EEPROM) holds all the coefficients for things like the fuel injection/timing, but doesn't hold the code for controlling, say, the ABS.
I would imagine (but do not know) that they do things similar to what we do in spacecraft controllers: store critical data multiple times, lots of self checks on algorithm operation, etc. The report on the Toyota Throttle controller said this:
"The Main and Sub-CPUs use two types of memory: non-volatile ROM for software code and volatile Static Ram (SRAM). The SRAM is protected by a single error detect and correct and a double error detect hardware function performed by error detection and correction (EDAC) logic."
There's a whole reliability of software community out there with everything from certifiable processes to coding standards (MISRA) designed to make it easy to inspect and verify that the code is doing what you think, and that it handles off-nominal cases.
I haven't read the whole report, but there was an analysis of the software in the Toyota controllers recently.
"The NESC team examined the software code (more than 280,000 lines) for paths that might initiate such a UA, but none were identified" (UA-Unintended Acceleration)
The team examined the VOQ vehicles for signs of electrical faults, and subjected these vehicles to electro-magnetic interference (EMI) radiated and conducted test levels significantly above certification levels. The EMI testing did not produce any UAs, but in some cases caused the engine to slow and/or stall. (That's probably closest to what you'd see from a memory upset)
Section 6.5, page 64 of the report, is "System Fail-Safe Architecture"
It's pretty sophisticated, with multiple parallel schemes to prevent runaway or failure. I'm impressed at the level of thought they gave to not just shutting down the engine, but in leaving an adequate limp-home capability when one or more parts in the chain fails (e.g. if the throttle plate actually sticks, it can control the engine by turning on and off the fuel injectors). There's also an independent mechanism that detects if the pedal isn't pressed (or the redundant pedal position sensors have failed), in which case the engine cannot exceed 2500RPM, if it does, the fuel turns off, and then turns back on when the speed drops below 1100RPM
And, since we Beowulfers are for the most part software weenies..
The ECM for the 2005 Camry uses a NEC V850 E1 processor. The software is in ANSI C, and compiled with Greenhills compiler.
There are 256kSLOC of non-comments (along with 241kSLOC of comments) in .c files and another 40kSLOC (noncomment) in various .h files.
They ran it through Coverity and CodeSonar (both of which we use at JPL), as well as SPIN (using SWARM to run it on a cluster.. now how about that)
More information about the Beowulf