[Beowulf] ECC Scrub, which setting?
Jim Lux
james.p.lux at jpl.nasa.gov
Mon May 19 21:51:22 PDT 2008
Quoting Mark Hahn <hahn at mcmaster.ca>, on Mon 19 May 2008 08:47:46 PM PDT:
>>> It is currently set to
>>> Basic, which scrubs every 5.24 ms.
>>
>> You'll have to look in the manual to find out what that means -- it's
>> probably "do a small amount of scrubbing every 5.24 ms". And you have
>
> I expect it's the interval between cacheline-sized (64B) scrubs. as
> such, I think it's much too low (4G ram in 98 hours!)
too low, based on what assumption for upset rate?
If the rate is, say, 1E-13 upset/bit/day, and you've got 1 Gbyte
(roughly 1E10 bits), you're looking at 1E-3 upsets/day. Since the ECC
will correct the error, what you're really fighting with the scrubbing
is the probability of a *double* error in the same word. Depending on
the error statistics, i.e. do you get multiple bit errors in the same
word.. (unlikely with most memory layout schemes which spread words
across the geometry, but, you never know...)
And if you DO get a double error, the ECC code will detect it, and you
can halt or take corrective measures (i.e. throw away that work
package's output, and restart from a checkpoint, etc.)
Even if the rate is much higher.. say 1E-12 upset/bit/hour.. about 200
times higher than the 1E-13 I used above. And say you've got 4Gbyte
of ram.. now you're looking at a single (fully corrected) upset per
day. The probability of a undetected error is still quite low
(requiring at least 3 errors), and the probability of a double bit
error causing an abort (within the 100 or so hours you calculated for
the scrub) is probably low enough that it wouldn't materially affect
your computation rate. And this assumes that your OS doesn't
autoscrub on a detected Single Bit Error, perhaps because the hardware
doesn't support it.
OTOH, if the ECC is protecting you from a lousy mobo design with
timing glitches and crosstalk between traces manifesting as errors...
Jim
More information about the Beowulf
mailing list