[Beowulf] Memory errors poll
Mark Hahn
hahn at mcmaster.ca
Mon Mar 30 21:14:06 PDT 2009
>> we replace dimms which show > 1000 corrected ECCs per day
>> (or any overflows, for which counts are inaccurate, or any uncorrectable
>> errors.)
>
> These systems are a couple of generations old, right?
waaait a minute - I think I gave the wrong impression. we have about
13 TB of this gen hardware (yes, from 3 years ago). our observed rate
is that at a given moment, a fraction of 1% of the nodes have any EC's at
all. our vendor is happy to replace dimms that have a nontrivial rate,
and there aren't a lot of nodes that have had this done.
one interesting thing is that during a 3-year period, seems like about 1%
of nodes developed higher EC rates that disappeared when the dimms were
reseated. I wonder whether this was the result of thermal cycling...
> I think I have Linux set up to record single-bit errors, and the rate
using edac? I toyed with mcelog before that, but never really got much
traction until edac came with an updated kernel.
More information about the Beowulf
mailing list