[Beowulf] Acceptable rad limits for cluster rooms?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jim Lux James.P.Lux at jpl.nasa.govMon Jun 19 07:29:13 PDT 2006
- Previous message: [Beowulf] Acceptable rad limits for cluster rooms?
- Next message: [Beowulf] Acceptable rad limits for cluster rooms?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
At 06:32 AM 6/19/2006, Mark Hahn wrote: > > OK, now to the gory details. > >isn't the "kind" of radiation also really important? IANA-physicist, >but alphas are pretty inconsequential to metal-cased stuff, no? Oh, sure... But that starts to get into a lot of tricky issues surrounding things like secondary emission (that is, a 1 GeV alpha may not penetrate the case, but when it stops, the odds of generating something new is pretty good). People do devote their entire lives to understanding radiation effects on electronics parts, and they're discovering new stuff every day, particularly as process geometries get smaller. I'm not sure it's available at this link, but here is "An introduction to space radiation effects on microelectronics" http://parts.jpl.nasa.gov/docs/JPL00-62.pdf If that link doesn't find it (or it's blocked for some reason), let me know and I'll find a way to get the document. (It's approved for public release, so there shouldn't be a problem). > > to get hit, and c) the data sits there a long time. Bear in mind, though, > > that the processor itself is probably pretty susceptible to SEU, and > > doesn't have ECC internally. Nor is the data and address bus > > protected. The same applies to most of the peripheral chips. > >hmm, both AMD and Intel claim parity protection on internal buses; >I don't know whether this is just a marketing checkoff... Indeed, parity on internal buses is used, which then brings up "parity" or "EDAC" (Error Detection and Correction) What happens when an error is detected? Do you just interrupt? Do you fix it and go on? As soon as you start talking EDAC, you also have to worry about latencies (the EDAC usually takes a couple clocks to deal with.. one to do the parity check with the syndrome bits, the second to correct the errors). Other interesting architectural questions... Is the Program Counter protected, or for that matter, the stack pointer and ALU registers? I was thinking more along the lines of the off-chip interfaces, e.g., the PCI bus, which generally isn't protected. And, of course, you'd have to worry about buffers in all manner of peripherals (network interfaces, disk drives, etc.) In practice, you're more likely to get errors from electrical noise, etc., than from radiation effects. BTW, an interesting exercise is to look at the distribution of detected errors, and see if it matches what should be happening for random radiation induced events. Some 25 years ago, I worked with a system that had error corrected ram and it would throw double bit errors every few days, triggering a system halt. Since we also collected statistics on detected single bit errors (which were corrected), it was a fairly simple matter to show that the DBE error rate wasn't consistent with the SBE error rate (to a first order, you expected Pdbe = Psbe^2), and hence, we determined that the real problem was a timing margin on the bus. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875
- Previous message: [Beowulf] Acceptable rad limits for cluster rooms?
- Next message: [Beowulf] Acceptable rad limits for cluster rooms?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
