[Beowulf] Acceptable rad limits for cluster rooms?

Jim Lux James.P.Lux at jpl.nasa.gov
Mon Jun 19 07:29:13 PDT 2006


At 06:32 AM 6/19/2006, Mark Hahn wrote:
> > OK, now to the gory details.
>
>isn't the "kind" of radiation also really important?  IANA-physicist,
>but alphas are pretty inconsequential to metal-cased stuff, no?


Oh, sure... But that starts to get into a lot of tricky issues surrounding 
things like secondary emission (that is, a 1 GeV alpha may not penetrate 
the case, but when it stops, the odds of generating something new is pretty 
good).  People do devote their entire lives to understanding radiation 
effects on electronics parts, and they're discovering new stuff every day, 
particularly as process geometries get smaller.

I'm not sure it's available at this link, but here is
"An introduction to space radiation effects on microelectronics"

http://parts.jpl.nasa.gov/docs/JPL00-62.pdf


If that link doesn't find it (or it's blocked for some reason), let me know 
and I'll find a way to get the document. (It's approved for public release, 
so there shouldn't be a problem).



> > to get hit, and c) the data sits there a long time. Bear in mind, though,
> > that the processor itself is probably pretty susceptible to SEU, and
> > doesn't have ECC internally.  Nor is the data and address bus
> > protected.  The same applies to most of the peripheral chips.
>
>hmm, both AMD and Intel claim parity protection on internal buses;
>I don't know whether this is just a marketing checkoff...

Indeed, parity on internal buses is used, which then brings up "parity" or 
"EDAC" (Error Detection and Correction) What happens when an error is 
detected? Do you just interrupt? Do you fix it and go on? As soon as you 
start talking EDAC, you also have to worry about latencies (the EDAC 
usually takes a couple clocks to deal with.. one to do the parity check 
with the syndrome bits, the second to correct the errors).

Other interesting architectural questions... Is the Program Counter 
protected, or for that matter, the stack pointer and ALU registers?

I was thinking more along the lines of the off-chip interfaces, e.g., the 
PCI bus, which generally isn't protected.  And, of course, you'd have to 
worry about buffers in all manner of peripherals (network interfaces, disk 
drives, etc.)

In practice, you're more likely to get errors from electrical noise, etc., 
than from radiation effects.

BTW, an interesting exercise is to look at the distribution of detected 
errors, and see if it matches what should be happening for random radiation 
induced events.  Some 25 years ago, I worked with a system that had error 
corrected ram and it would throw double bit errors every few days, 
triggering a system halt. Since we also collected statistics on detected 
single bit errors (which were corrected), it was a fairly simple matter to 
show that the DBE error rate wasn't consistent with the SBE error rate (to 
a first order, you expected Pdbe = Psbe^2), and hence, we determined that 
the real problem was a timing margin on the bus.

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875





More information about the Beowulf mailing list