[Beowulf] Memory errors poll

Greg Lindahl lindahl at pbm.com
Mon Mar 30 17:48:26 PDT 2009


On Mon, Mar 30, 2009 at 01:11:20AM -0400, Mark Hahn wrote:
>> /Could those of you running ECC memory give me an updated figure on the
>> number of errors detected/corrected per day per system? /
>
> we replace dimms which show > 1000 corrected ECCs per day
> (or any overflows, for which counts are inaccurate, or any uncorrectable 
> errors.)

These systems are a couple of generations old, right?

I think I have Linux set up to record single-bit errors, and the rate
I get is basically zero oh, uh, 5 terabytes of modern ram, at sea
level.

When I installed some new memory I had a few systems with modest
numbers of single-bit upsets, and the vendor was happy to swap dimms
until the problem went away. I think he also does that during his
factory burn-in.

-- greg





More information about the Beowulf mailing list