[Beowulf] Servers Too Hot? Intel Recommends a Luxurious Oil Bath

Lux, Jim (337C) james.p.lux at jpl.nasa.gov
Wed Sep 5 11:29:06 PDT 2012

Yes.. this is something that has been researched and tested in the laboratory.  I don't know that anyone has actually tried reconfiguring around a damaged piece of an FPGA, if for no other reason than permanent damage in a reconfigurable FPGA is extremely unusual (and probably hasn't ever occurred).  There are soft upsets in the configuration memory, and the Virtex and Virtex II have a potential failure mode where an upset in just the wrong place could cause damage (having two logic element outputs fighting each other), but it's very unlikely.

There's a fair amount of test data on radiation behavior (klabs.org or MAPLD are places to look).  I'm not sure there's a failure mechanism (with high enough probability) that causes a hard failure of just some gates. (These parts are typically latchup-immune, for instance).  I suppose some sufficiently high energy particle could damage a few gates permanently.  You'd need very high Linear Energy Transfer, though.   There's a paper by Fuller, et al, out there where they zapped a Virtex with 2068 MeV Au ions, looking to see if latchup could be observed at any LET below 125 MeV-cm^2/mg  (this is the upper bound for galactic cosmic rays).  No latchup detected.  They did see an increase in current, but it's because of the configuration upsets causing internal logic contention, and went away when the device was reconfigured.  (fluence was 1E7-1E8 ions/cm^2, which is HUGE compared to what you see in real life.  There were some changes in current that stuck around for a few hours, but gradually annealed away)

As far as upsets go, typical predicted upset rates aer on the order of 2 upsets/device day in LEO up to 5.9 upsets/device day in GEO.  With flare enhancement, it's like 21 upsets/device day for LEO and 81.5 for GEO.  (of course, life is better than this.. in most designs, the vast majority of configuration bits are "don't care", so you wouldn't see the upset..  a typical multiplier is 4:1. That is, half an upset/device day for LEO)  (all these are for the XVQR300)
(another source reports a cross section for proton SEU of 5E-13 cm^2/bit.. the device has, say, 6E6 bits, so you can figure out what kind of proton flux you need to get a given upset rate)

And, of course, now there's a rad hardened Virtex 5 available (you too can own one for about $80k/copy).. 1Mrad(Si) total dose, config mem upset rate in GEO 3.8E-10 errors/bit/day. Single Event Functional Interrupt (SEFI) of configuration control logic (this would prevent you from reconfiguring on the fly) in GEO is once every 10,000 years.

So it's not really clear that you NEED to be able to reconfigure around damage..

We've only been flying Xilinx Virtex parts for long durations since 2005 (Mars Reconnaissance Orbiter) (there might be some other earlier experiments.. CANDOS used a couple of Virtex II parts on Shuttle was 2 weeks in 2003 and only operated for 10s of hours) We do periodic scrubbing/reloading of the configuration memory, and I'm not sure we even know if there was a transient upset (that is, we don't read it back, we just rewrite, blindly).  There's some DoD comm payloads that use Virtex parts, and their mitigation strategy for configuration upsets is to have two devices and ping pong between them.. while chip 1 is being configured, use chip 2, when done, flip, reconfigure chip 2 and use chip 1.

When all is said and done, reconfiguring to get around a human coding error is actually much more likely.

Jim Lux

From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Nathan Moore
Sent: Wednesday, September 05, 2012 8:24 AM
To: beowulf at beowulf.org
Subject: Re: [Beowulf] Servers Too Hot? Intel Recommends a Luxurious Oil Bath

> On Tue, 4 Sep 2012, Ellis H. Wilson III wrote:
Which is why I was suggesting that, "Maybe the whole thing is just
built, sealed for good, primed with [hydrogen/oil/he/whatever], started,
allowed to slowly degrade over time and finally tossed when the still
working equipment is sufficiently dated."

I remember an "ancient" IBM technical article about the BlueGene, here: http://researcher.watson.ibm.com/researcher/files/us-ajayr/SysJ_BlueGene.pdf

In the work (or maybe it was a closely related paper), the authors make the point that as core count increases and feature size decreases, cpu units will have to be fault tolerant, eg if cosmic rays have toasted 10% of your chip's cores, it should still be able to function.  Related, this is one of the great beauties of FPGA's.  Jim Lux can probably tell us if this would be real, but it would seem to make sense to program a space probe (ie voyager type) with an FPGA emulated CPU for the sake of damage survivability.  In the worst case that the probe encounters something unpleasant and part of the FPGA is damaged, perhaps the rest of the LUT's in the FPGA could be reprogrammed to produce a less powerful, yet still functional, controller.  This would take the "field programmable" aspect to the device to a new height...

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20120905/3bd31f0e/attachment.html>

More information about the Beowulf mailing list