[Beowulf] Acceptable rad limits for cluster rooms?
Jim Lux
James.P.Lux at jpl.nasa.gov
Mon Jun 19 07:29:13 PDT 2006
At 06:32 AM 6/19/2006, Mark Hahn wrote:
> > OK, now to the gory details.
>
>isn't the "kind" of radiation also really important? IANA-physicist,
>but alphas are pretty inconsequential to metal-cased stuff, no?
Oh, sure... But that starts to get into a lot of tricky issues surrounding
things like secondary emission (that is, a 1 GeV alpha may not penetrate
the case, but when it stops, the odds of generating something new is pretty
good). People do devote their entire lives to understanding radiation
effects on electronics parts, and they're discovering new stuff every day,
particularly as process geometries get smaller.
I'm not sure it's available at this link, but here is
"An introduction to space radiation effects on microelectronics"
http://parts.jpl.nasa.gov/docs/JPL00-62.pdf
If that link doesn't find it (or it's blocked for some reason), let me know
and I'll find a way to get the document. (It's approved for public release,
so there shouldn't be a problem).
> > to get hit, and c) the data sits there a long time. Bear in mind, though,
> > that the processor itself is probably pretty susceptible to SEU, and
> > doesn't have ECC internally. Nor is the data and address bus
> > protected. The same applies to most of the peripheral chips.
>
>hmm, both AMD and Intel claim parity protection on internal buses;
>I don't know whether this is just a marketing checkoff...
Indeed, parity on internal buses is used, which then brings up "parity" or
"EDAC" (Error Detection and Correction) What happens when an error is
detected? Do you just interrupt? Do you fix it and go on? As soon as you
start talking EDAC, you also have to worry about latencies (the EDAC
usually takes a couple clocks to deal with.. one to do the parity check
with the syndrome bits, the second to correct the errors).
Other interesting architectural questions... Is the Program Counter
protected, or for that matter, the stack pointer and ALU registers?
I was thinking more along the lines of the off-chip interfaces, e.g., the
PCI bus, which generally isn't protected. And, of course, you'd have to
worry about buffers in all manner of peripherals (network interfaces, disk
drives, etc.)
In practice, you're more likely to get errors from electrical noise, etc.,
than from radiation effects.
BTW, an interesting exercise is to look at the distribution of detected
errors, and see if it matches what should be happening for random radiation
induced events. Some 25 years ago, I worked with a system that had error
corrected ram and it would throw double bit errors every few days,
triggering a system halt. Since we also collected statistics on detected
single bit errors (which were corrected), it was a fairly simple matter to
show that the DBE error rate wasn't consistent with the SBE error rate (to
a first order, you expected Pdbe = Psbe^2), and hence, we determined that
the real problem was a timing margin on the bus.
James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875
More information about the Beowulf
mailing list