[Beowulf] Memory errors poll

Mark Hahn hahn at mcmaster.ca
Sun Mar 29 22:11:20 PDT 2009

> /Could those of you running ECC memory give me an updated figure on the
> number of errors detected/corrected per day per system? /

we replace dimms which show > 1000 corrected ECCs per day
(or any overflows, for which counts are inaccurate, or any 
uncorrectable errors.)

> I have an old figure of about 1 error-bit per day per system at sea
> level, but I would like to know if it is getting worse or better.

we have several thousand nodes, and most of them go for months 
without any corrected ECCs (probably all within 200M of sea level).

