[Beowulf] Not quite Walmart, or, living without ECC?
David Mathog
mathog at caltech.edu
Mon Nov 26 12:27:03 PST 2007
I ran a little test over the Thanksgiving holiday to see how common
random errors in nonECC memory are. I used the memtest86+ bit fade test
mode, which writes all 1s, waits 90 minutes, checks the result, then
does the same thing for all 0s. Anyway, this was the best test I could
find for detecting the occasional gamma ray type data loss event. The
result: no errors logged in 5 solid days of testing. So this class of
error (the type ECC would detect and probably fix) apparently occurs
on these machines at a rate of less than 1 per 840 Gigabyte-hours.
Possibly the upper limit is half that if data can only be lost
on 1 -> 0 transition, or vice versa. This assumes the bit fade test
works, which cannot be independently verified from these results.
On the web there are references to an IBM study which found 1 bit
error/256Mb/Month, which would have been (.25 *30 * 24) =
1 per 180 Gigabyte-hours. If IBM's numbers held for my hardware
there should have seen 4 or 5 errors in total. Mine are in a basement
in a concrete building, perhaps that provided some shielding relative to
what IBM used for their test conditions.
The memory was Corsair Twinx1024-3200C2. When first installed all
of this memory had run for 24 hours with no errors in normal
memtest86+ testing.
Regards,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list