[Beowulf] Memory errors poll

Mark Hahn hahn at mcmaster.ca
Mon Mar 30 21:14:06 PDT 2009

>> we replace dimms which show > 1000 corrected ECCs per day
>> (or any overflows, for which counts are inaccurate, or any uncorrectable
>> errors.)
> These systems are a couple of generations old, right?

waaait a minute - I think I gave the wrong impression.  we have about
13 TB of this gen hardware (yes, from 3 years ago).  our observed rate
is that at a given moment, a fraction of 1% of the nodes have any EC's at
all.  our vendor is happy to replace dimms that have a nontrivial rate,
and there aren't a lot of nodes that have had this done.

one interesting thing is that during a 3-year period, seems like about 1% 
of nodes developed higher EC rates that disappeared when the dimms were 
reseated.  I wonder whether this was the result of thermal cycling...

> I think I have Linux set up to record single-bit errors, and the rate

using edac?  I toyed with mcelog before that, but never really got much
traction until edac came with an updated kernel.

