[Beowulf] Acceptable rad limits for cluster rooms?
Jim Lux
James.P.Lux at jpl.nasa.gov
Mon Jun 19 06:02:45 PDT 2006
At 02:21 PM 6/16/2006, Brian Oborn wrote:
>The cluster for our Physics department is next to a room that, at the time
>of installation, was an empty accelerator hall. However, a new electron
>accelerator has been installed and the cluster room is now a mild
>radiation area. Before we start considering shielding options, I was
>wondering if anybody on this list could offer insight into "acceptable"
>radiation limits for normal, rackmount cluster nodes with ECC memory, and
>if there are energy threasholds in beta and gamma radiation that might be
>significant.
You've got a couple issues to worry about.
First off, what is "mild"? I would think that if you're in a 5 mrem/yr
kind of environment (or whatever that is in SI units), you probably don't
have much to worry about. That is, if people can be in there, then the
electronics can probably tolerate it.
OK, now to the gory details.
1) Total dose effects - Most ICs suffer some sort of change in properties
with dose. Optoelectronics is notorious for displacement damage
effects. However, the doses where this kind of thing becomes an issue is
up in the kilorad area. There's also a particularly annoying property
called "Enhanced Low Dose Rate" effects which essentially means that for
some parts, the cumulative effects of a low dose rate for some time are
greater than getting the whole dose all at once. This, of course makes
testing a bit tricky, since you want to zap your parts in the test fixture
all at once.
2) Single Event Effects - There's a lot of flavors of this, upsets(bit
flips) being but one. There's also "latchup" and "single event gate
rupture", etc. These are all because some charged particle hits the IC and
deposits charge in the structure, causing some sort of trouble. (A high
energy photon could also ionize the silicon on the way as it slows down,
too) It might be just changing the state of a stored bit, but it can also
be something as catastrophic as upsetting the relative biasing of P and N
layers, causing large currents to flow where normally they wouldn't. In
this world the usual specification is whether the part is immune at a
Linear Energy Transfer (LET) of X MeV/cm, where X is a number greater than
some tens.
ECC RAM is a way to mitigate just one kind of SEE, the upset (SEU) in a
memory area, where a) it's easy to do, and b) there's lots of target area
to get hit, and c) the data sits there a long time. Bear in mind, though,
that the processor itself is probably pretty susceptible to SEU, and
doesn't have ECC internally. Nor is the data and address bus
protected. The same applies to most of the peripheral chips.
DRAM is considered particularly susceptible, because the storage mechanism
is a tiny amount of charge, and there isn't a huge amount of margin in the
decision about whether it's a one or zero. Compare this to a big old
flipflop in the processor, or a static RAM cell, where the information
storage mechanism isn't a packet of charge, but is something like a couple
cross coupled transistors, one On and the other Off, with huge energy
margins. There's also some historical data: there were some notorious DRAM
pacakges where the plastic itself was slightly radioactive; DRAM has always
skated near the edge of functioning because they push the density; and some
of the early studies of radiation effects on computers focussed on DRAMs,
because they were sensitive, and it was easy to tell if there was a
problem. (i.e. you can read the data back and see if it's changed)
>http://www.beowulf.org/mailman/listinfo/beowulf
James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875
More information about the Beowulf
mailing list