[Beowulf] Not quite Walmart, or, living without ECC?

Fri Nov 16 15:16:37 PST 2007

At 01:56 PM 11/16/2007, Mark Hahn wrote:
>>I just asked the local NT goon, "do you use ECC for the servers?" and
>>he answered, "you have to". What he considers a server-class mobo
>>requires ECC
>
>whether you need ECC depends on many things.  first, how much memory 
>your machine has - my experience is that most generic servers (web, file,
>mail, etc) don't have much - maybe a few GB.  the chance of needing ECC
>also depends on how "hard" you use the ram (again, mundane servers 
>are pretty lightly utilized.)  as well as factors like altitude, ram quality,
>and the ever popular "how important is your data".
>
>for clusters, I would say that ECC is basically a necessity, unless all
>the jobs can be run in a "checking" mode (ie, perform a search or
>optimization, then verify the results in case the hit was due to a bit flip.)
>
>that said, ECC events are not all that common.  I have a 768-node cluster
>here, each node dual-socket opteron with 8GB PC3200 ddr.  I just 
>checked all nodes with mcelog, and 35 have reported corrected events 
>over roughly
>the last 20 days.  one may have hit an uncorrectable event (but in 
>our clusters, corrected ECC rate is not a good predictor for uncorrectable
>ones...)

So the detected upset rate is:

35/(768*20) detected errors per day per computer (0.0023) or 3.3E-14 
errors/bit/day

Wikipedia claims 1 error/month/GB (3E-11 errors/bit/day) but their 
references are all pretty ancient (a JPL paper from 2001 is probably 
reporting on devices that would have been used in consumer 
electronics in the early 90s).  They may also have been talking about 
"upset rates", and what you observe is "detected bit error rate" 
(that is, you don't see all the upsets that have occurred, because 
you don't read all memory, all the time... your accesses may be 
concentrated in, say, 1GB of your overall 8GB DRAM space)

http://parts.jpl.nasa.gov/docs/CassDRAM-00.pdf  discusses some 
possible reasons why multibit error rates and single bit error rates 
don't scale like you'd expect (a heavy ion can zap multiple bits at 
one time, so the bit errors are not uncorrelated).  In spacecraft 
systems, often, they implement a scrubbing algorithm that 
systematically reads and checks each location in turn, as opposed to 
waiting for the processor to happen to fetch that location.  That's 
so that you have a chance to scrub an error in a word before it takes 
a second hit. On Cassini, the scrubbing in the 2.5 Gbit solid state 
recorders is such that every word gets scrubbed about every 9 
minutes.  They get about 200-300 single bit errors/day.  But this is, 
truly, ancient technology...