[Beowulf] Not quite Walmart, or, living without ECC?

Scott Atchley atchley at myri.com
Mon Nov 26 12:56:57 PST 2007

On Nov 26, 2007, at 3:27 PM, David Mathog wrote:

> I ran a little test over the Thanksgiving holiday to see how common
> random errors in nonECC memory are.  I used the memtest86+ bit fade  
> test
> mode, which writes all 1s, waits 90 minutes, checks the result, then
> does the same thing for all 0s.   Anyway, this was the best test I  
> could
> find for detecting the occasional gamma ray type data loss event.  The
> result: no errors logged in 5 solid days of testing.  So this class of
> error (the type ECC would detect and probably fix) apparently occurs
> on these machines at a rate of less than 1 per 840 Gigabyte-hours.
> Possibly the upper limit is half that if data can only be lost
> on 1 -> 0 transition, or vice versa.  This assumes the bit fade test
> works, which cannot be independently verified from these results.
> On the web there are references to an IBM study which found 1 bit
> error/256Mb/Month, which would have been (.25 *30 * 24) =
> 1 per 180 Gigabyte-hours.  If IBM's numbers held for my hardware
> there should have seen 4 or 5 errors in total.  Mine are in a basement
> in a concrete building, perhaps that provided some shielding  
> relative to
> what IBM used for their test conditions.
> The memory was Corsair Twinx1024-3200C2.  When first installed all
> of this memory had run for 24 hours with no errors in normal
> memtest86+ testing.
> Regards,
> David Mathog

Or maybe you got lucky. Five days may not be long enough.

We have had customers report events that included parity errors on  
hundreds of nodes simultaneously on large clusters. Higher altitude  
makes things worse. Being in a DOE lab near lots of interesting  
materials does not help either. :-)


More information about the Beowulf mailing list