[Beowulf] Not quite Walmart, or, living without ECC?

Michael Will mwill at penguincomputing.com
Tue Nov 27 10:35:06 PST 2007


We have found that linpack is by far the better memory tester than
Memtest86+. 
Memtest does not find all the bad RAM that linpack triggers, visible
through the mcelog and
through IPMI BMC logs. The nice thing about the BMC log entries is that
it actually tells
you which DIMM in which CPU-bank was causing the ECC so you don't need
to trouble
shoot with a lengthy divide and conquer approach. 

Michael
-----Original Message-----
From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
On Behalf Of David Mathog
Sent: Tuesday, November 27, 2007 9:54 AM
To: Tony Travis
Cc: beowulf at beowulf.org
Subject: Re: [Beowulf] Not quite Walmart, or, living without ECC?

Tony Travis wrote:
> Memtest86+ is fine for 'burn-in' tests, but it does not do a realistic
> memory stress test under the conditions that normal applications run. 

Wow, deja vu.  I just remembered we had almost exactly this same
discussion 2 years ago, in fact I apparently sent you my hacked up
version of memtester which has delays in it between the write and read
cycles, to allow it to catch bit fade (due to radiation or whatever).

One thing I still don't get though, if memtester is catching memory
errors which only appear when _other parts of the system are active_
does replacing the "bad" memory actually cure these problems?  That is,
if memtest86+ runs cleanly and memtester finds problems, is it really
the memory which is the issue?

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list