[Beowulf] Not quite Walmart, or, living without ECC?

David Mathog mathog at caltech.edu
Mon Nov 26 12:27:03 PST 2007

I ran a little test over the Thanksgiving holiday to see how common
random errors in nonECC memory are.  I used the memtest86+ bit fade test
mode, which writes all 1s, waits 90 minutes, checks the result, then
does the same thing for all 0s.   Anyway, this was the best test I could
find for detecting the occasional gamma ray type data loss event.  The
result: no errors logged in 5 solid days of testing.  So this class of
error (the type ECC would detect and probably fix) apparently occurs
on these machines at a rate of less than 1 per 840 Gigabyte-hours.
Possibly the upper limit is half that if data can only be lost
on 1 -> 0 transition, or vice versa.  This assumes the bit fade test
works, which cannot be independently verified from these results.

On the web there are references to an IBM study which found 1 bit
error/256Mb/Month, which would have been (.25 *30 * 24) = 
1 per 180 Gigabyte-hours.  If IBM's numbers held for my hardware
there should have seen 4 or 5 errors in total.  Mine are in a basement
in a concrete building, perhaps that provided some shielding relative to
what IBM used for their test conditions.

The memory was Corsair Twinx1024-3200C2.  When first installed all
of this memory had run for 24 hours with no errors in normal
memtest86+ testing.


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

