ecc-memory.

Jim Lux James.P.Lux at jpl.nasa.gov
Tue Jan 16 15:50:55 PST 2001


Many years ago, I worked on a system with ECC memory (Multibus-II).  Bad
hardware bus drivers manifested themselves as corrected bit error
interrupts.  We discovered it because we were getting double bit errors as
well, and even at the several a day single bit error rate, we shouldn't have
been seeing DBE's.

So, Serguei's comment that it might be due to difference in parts quality
and system design is quite relevant.  Just off hand, I wouldn't expect 1 km
of air to provide enough shielding to reduce the upset rate by 50 times. The
estimates of how much the increase is are all over the map (some articles
about radiation and airline safety talk about extremely high ratios, but
they assume polar flights, during a solar flare, etc., etc.)

One source http://www.prioritiesforhealth.com/1102/rad.html that, at least,
appears superficially authoritative (at least they use the right units,
etc.) gives the following data for cosmic rays:

SL background - 30 mRem/yr
Denver at ca 1600 m - 50 mRem/yr
Mexico City  (2260m) - 70 mRem/yr
La Paz, Bolivia (3660) - 180 mRem/yr

Of some interest might be that the background radiation in Calgary from
Uranium and Thorium in the rocks might increase the upset rate, but again,
not 50 times....

If you are interested, there are a number of sites which give the solar
weather statistics, which directly affects the number of solar originating
particles that might cause upsets.  You could correlate solar particle flux
against your observed bit error rates (or intervals) and determine if it is
radiation induced, or something else.  (if nothing else, you should  see a
24 hour periodicity if it is solar related)


-----Original Message-----
From: Serguei Patchkovskii <patchkov at ucalgary.ca>
To: Greg Lindahl <glindahl at hpti.com>
Cc: josip at icase.edu <josip at icase.edu>; beowulf at beowulf.org
<beowulf at beowulf.org>
Date: Tuesday, January 16, 2001 2:59 PM
Subject: RE: D-Link switch and ecc-memory.


>On Tue, 16 Jan 2001, Greg Lindahl wrote:
>> > My best estimate is that our system corrects one single bit error (SBE)
>> > per week in 37.5 GB of ECC memory.  This translates into SBE event
>> > intervals of about 9 months per GB of RAM.  Your mileage may vary...
>>
>> Josip neglected to mention that he is at sea level. If you are at a
higher
>> altitude, you will see more errors.
>
>Indeed. Here in Calgary (1 kilometer above the sea level), I count an
average
>of 50 corrected memory errors _per_day_ for 220 Gbytes of memory over the
>last three months - or about fifty times the Josip's rate. This average
>excludes three systems with failing memory - which we hadn't got around to
>replace yet. (These three have the error rate of about 30 times the
median).
>
>How much of the difference is due to an increase in cosmic radiation, and
>how much is due to the differences in parts quality and system design,
>I am not qualified to assess.
>
>Regards,
>
>/Serge.P
>
>---
>Home page: http://www.cobalt.chem.ucalgary.ca/ps/
>
>
>_______________________________________________
>Beowulf mailing list
>Beowulf at beowulf.org
>http://www.beowulf.org/mailman/listinfo/beowulf
>





More information about the Beowulf mailing list