Frequency of upsets was Re: [Beowulf] ECC support on motherboards?

Jim Lux James.P.Lux at jpl.nasa.gov
Tue May 13 15:27:11 PDT 2008


At 02:16 PM 5/13/2008, Håkon Bugge wrote:
>At 19:17 13.05.2008, Perry E. Metzger wrote:
>>So another question is, how can you reliably test any of this stuff?
>>It isn't like you can reliably induce single bit errors and see if the
>>hardware catches them. (A special memory module that let you test
>>would be a wonderful thing, but I've never even heard of such a thing.)


More on upsets..

Here's an interesting paper from Boeing in the 
late 90s that asserts that a leading cause of 
these upsets is atmospheric neutrons.  Gives 
rates too.. (see also the link below to the 
presentation which uses some of this data)

http://www.boeing.com/assocproducts/radiationlab/publications/SEU_at_Ground_Level.pdf

looks like for 4M DRAMs, 1E-12 upset/bit hour is a nice round number (Table 4)
Some data from Fermilab with 160 Gbit of DRAM 
showed 2.5 upset/day.  Extrapolating (always 
dangerous with these kinds of radiation effects 
data, but I'll plunge in regardless).. that means 
a workstation with 4-8 Gbyte of DRAM might see an upset per day.

Any sort of ECC would catch this and correct it, of course.

There is a paper from Gary Swift, here at JPL, 
that discusses that some radiation induced upsets 
will be multiple bit errors by their nature (i.e. 
imagine a bullet tearing through a bunch of 
memory cells.. more than one gets hit).  But this 
is for Cassini era Solid State Recorders (e.g. 
early 90s, late 80s components) and, it's in 
space, where the radiation environment is quite 
different than terrestrially.  Swift &  Guertin, 
"In-Flight Observations of Multiple-Bit Upset in 
DRAMS", IEEE Trans on Nuc Sci, V47, #6, Dec 2000, pp2386-2391.

The Ladbury presentation  from MAPLD2002 I posted 
the link to yesterday talks about the mechanics of the upset.

A fascinating presentation about upsets in 
avionics (for planes, not spacecraft) from Boeing is here:
http://www.solarstorms.org/SEUavionics.pdf

Look at slide 11, and you see that the upset rate 
is 30 times higher at 30,000 ft than sea 
level.  Those of you building clusters for 
observatories in Atacama might want to pay more 
attention to upsets than those of us close to sealevel.

Likewise, the upset rate is higher at high 
latitudes. (Why yes, it's essential that we build 
that cluster on a tropical island. otherwise it will cost more for ECC ram)

An interesting post on a mailing list:
http://www.cs.york.ac.uk/hise/safety-critical-archive/2001/0140.html

Ladkin discusses some of the potential issues with the Boeing (and other) data.


So there's more about SEUs in memory than anyone 
on this list ever wanted to know.  There's lots 
more stuff available, although you pretty quickly 
get into export controlled territory if you are 
poking at the limits of the technology.

Jim

--------
James Lux, P.E.
Task Manager, SOMD Software Defined Radios
Flight Communications Systems Section
Jet Propulsion Laboratory
4800 Oak Grove Drive, M/S 161-213
Pasadena CA 91109
USA

+1(818)354-2075 phone
+1(818)393-6875 fax






More information about the Beowulf mailing list