[Beowulf] Memory errors poll
Lux, James P
james.p.lux at jpl.nasa.gov
Mon Mar 30 08:42:03 PDT 2009
> -----Original Message-----
> From: beowulf-bounces at beowulf.org
> [mailto:beowulf-bounces at beowulf.org] On Behalf Of Mark Hahn
> Sent: Sunday, March 29, 2009 10:11 PM
> To: ariel sabiguero yawelak
> Cc: Beowulf at beowulf.org
> Subject: Re: [Beowulf] Memory errors poll
> > /Could those of you running ECC memory give me an updated figure on
> > the number of errors detected/corrected per day per system? /
> we replace dimms which show > 1000 corrected ECCs per day (or
> any overflows, for which counts are inaccurate, or any
> uncorrectable errors.)
That seems a remarkably high rate, for the raw memory errors. Micron quotes something like 100 soft errors per 1E9 device hours. (That's a FIT:failure in time of 100)
If I saw that rate, I'd assume that there's something seriously wrong with the part.
> > I have an old figure of about 1 error-bit per day per system at sea
> > level, but I would like to know if it is getting worse or better.
This is something readily available from the memory manufacturers, at the device level.
Beware of random stuff you read on the web.. That is, check the date of the data being used in the article. Technologies change over time, pretty substantially, so observations about DRAM error rates in 1998 probably aren't applicable to DRAM error rates in 2008 (unless you happen to be using 10 year old memory!)
A recent paper is by Borucki, Schindlbeck and Slayman (IEEE CFP 08 RPS-CDR 46th ann. Intl. Rel. Physics Symp. 2008, pp482ff) comments that for modern parts, high energy cosmic rays are more important than alpha particles, and reports on measurements made on DIMMs. They blasted modern mobos in a neutron test facility, and then scaled for New York. It looks like about 100-200 FIT/Gb, which corresponds with Micron's numbers, above. They also looked at multibit and logic errors as well as simple memory cell errors. As expected, the SEU rate (per bit) is going down as features get smaller, but logic error rates stay roughly the same.
OK.. So you got a box with, say, 4Gbyte of RAM.. That's 32 Gb, so you'd expect something like 5000 errors per 1E9 hours, or 5 errors per 1E6 hours.. An error every 200,000 hours or 22 years (if my before coffee math in my head is right)
I suspect that most "memory errors" reported for PCs (whether in clusters or not) are manifestations of bus timing problems, perhaps over temperature, rather than actual bit flips in memory. The actual measured rate of single event upsets is so low
> we have several thousand nodes, and most of them go for
> months without any corrected ECCs (probably all within 200M
> of sea level).
> Beowulf mailing list, Beowulf at beowulf.org To change your
> subscription (digest mode or unsubscribe) visit
More information about the Beowulf