[Beowulf] Memory errors poll

Mon Mar 30 21:12:43 PDT 2009

>>> /Could those of you running ECC memory give me an updated figure on
>>> the number of errors detected/corrected per day per system? /
>>
>> we replace dimms which show > 1000 corrected ECCs per day (or
>> any overflows, for which counts are inaccurate, or any
>> uncorrectable errors.)
>
>
> That seems a remarkably high rate, for the raw memory errors. Micron quotes
> something like 100 soft errors per 1E9 device hours. (That's a
> FIT:failure in time of 100)

1000 per day seems high?  it doesn't worry me much, since it's low enough
that there will be very few double errors by coincidence, and almost
certainly no measurable overhead.  (overhead of polling and logging CEs
_is_ measurable on machines with bad dimms, btw.)

these dimms have 16 chips.  also, these are observed CEs, which includes 
problems due to other dimms, sockets, the csrow bus and the (opteron)
memory controller.  I'm also not claiming that there are a significant 
number of dimms showing > 0 but < 1000 CEs/day.

> If I saw that rate, I'd assume that there's something seriously wrong with the part.

perhaps.  one problem is that I don't have a good load-generator.
when idle, or loaded with light-footprint jobs, even nodes with a real
problem can wind up reporting few CEs.

initially, my attempt at a load-generator was simply a multithreaded 
stream-like thing that kept blasting bit-patterns into big arrays.
as far as I know, it's as likely to write bad ECC as read it, so you 
have to alternate r/w cycles.  but being sequential is probably less 
than optimal (indeed, perhaps why memtest86 sometimes gives false negatives).

> I suspect that most "memory errors" reported for PCs (whether in clusters
>or not) are manifestations of bus timing problems, perhaps over temperature,
>rather than actual bit flips in memory.  The actual measured rate of single
>event upsets is so low

sure.  I'm just talking about observed events reported by ECC hardware.
interestingly, it's easy to imagine a scenario where the MC trains its 
dram parameters at one temperature, but winds up operating at another.
and possibly operating poorly - things like skew are set by the bios 
and afaik never recalibrated.