[Beowulf] Acceptable rad limits for cluster rooms?

Jim Lux James.P.Lux at jpl.nasa.gov
Mon Jun 19 06:02:45 PDT 2006


At 02:21 PM 6/16/2006, Brian Oborn wrote:
>The cluster for our Physics department is next to a room that, at the time 
>of installation, was an empty accelerator hall. However, a new electron 
>accelerator has been installed and the cluster room is now a mild 
>radiation area. Before we start considering shielding options, I was 
>wondering if anybody on this list could offer insight into "acceptable" 
>radiation limits for normal, rackmount cluster nodes with ECC memory, and 
>if there are energy threasholds in beta and gamma radiation that might be 
>significant.

You've got a couple issues to worry about.

First off, what is "mild"?  I would think that if you're in a 5 mrem/yr 
kind of environment (or whatever that is in SI units), you probably don't 
have much to worry about.  That is, if people can be in there, then the 
electronics can probably tolerate it.

OK, now to the gory details.

1) Total dose effects - Most ICs suffer some sort of change in properties 
with dose.  Optoelectronics is notorious for displacement damage 
effects.  However, the doses where this kind of thing becomes an issue is 
up in the kilorad area.  There's also a particularly annoying property 
called "Enhanced Low Dose Rate" effects which essentially means that for 
some parts, the cumulative effects of a low dose rate for some time are 
greater than getting the whole dose all at once.  This, of course makes 
testing a bit tricky, since you want to zap your parts in the test fixture 
all at once.

2) Single Event Effects - There's a lot of flavors of this, upsets(bit 
flips) being but one.  There's also "latchup" and "single event gate 
rupture", etc.  These are all because some charged particle hits the IC and 
deposits charge in the structure, causing some sort of trouble. (A high 
energy photon could also ionize the silicon on the way as it slows down, 
too) It might be just changing the state of a stored bit, but it can also 
be something as catastrophic as upsetting the relative biasing of P and N 
layers, causing large currents to flow where normally they wouldn't.  In 
this world the usual specification is whether the part is immune at a 
Linear Energy Transfer (LET) of X MeV/cm, where X is a number greater than 
some tens.

ECC RAM is a way to mitigate just one kind of SEE, the upset (SEU) in a 
memory area, where a) it's easy to do, and b) there's lots of target area 
to get hit, and c) the data sits there a long time. Bear in mind, though, 
that the processor itself is probably pretty susceptible to SEU, and 
doesn't have ECC internally.  Nor is the data and address bus 
protected.  The same applies to most of the peripheral chips.

DRAM is considered particularly susceptible, because the storage mechanism 
is a tiny amount of charge, and there isn't a huge amount of margin in the 
decision about whether it's a one or zero.  Compare this to a big old 
flipflop in the processor, or a static RAM cell, where the information 
storage mechanism isn't a packet of charge, but is something like a couple 
cross coupled transistors, one On and the other Off, with huge energy 
margins.  There's also some historical data: there were some notorious DRAM 
pacakges where the plastic itself was slightly radioactive; DRAM has always 
skated near the edge of functioning because they push the density; and some 
of the early studies of radiation effects on computers focussed on DRAMs, 
because they were sensitive, and it was easy to tell if there was a 
problem. (i.e. you can read the data back and see if it's changed)



>http://www.beowulf.org/mailman/listinfo/beowulf

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875





More information about the Beowulf mailing list