[Beowulf] ECC Memory and Job Failures
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Nifty Tom Mitchell niftyompi at niftyegg.comThu Apr 23 15:35:37 PDT 2009
- Previous message: [Beowulf] ECC Memory and Job Failures
- Next message: [Beowulf] ECC Memory and Job Failures
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, Apr 23, 2009 at 04:45:08PM +0100, Huw Lynes wrote: > > Thought this might be of interest to others: > > http://blog.revolution-computing.com/2009/04/blame-it-on-cosmic-rays.html > > Apparently someone ran a large cluster job with both ECC and none-ECC > RAM. They consistently got the wrong answer when foregoing ECC. > > I'd love to see the original data. Not unexpected and yes, ....data please. What if disabling ECC changes the data path timing and uncovers a hardware race condition that is unrelated to cosmic ray bit flipping. Test with stream benchmarks etc.... bit error rates should track to altitude. While cosmic ray bit flipping is real it is only one data integrity issue to cope with in system design. Does disabling ECC enable some other form of error detection like parity or is the RAM running bare. Does the ECC hardware log errors even in disabled mode (it might). In some cases disabling ECC causes the RAM to be accessed faster... causing more heat causing timing changes... Years ago SGI ran into this when the cache line coherency model changed on one desktop box. While today's RAM technology is very different it is interesting to note that then a parity error might be expected once in about 22 days on a 96MB RAM system on those old boxes (as best I can recall). The memory design made it very easy to count the errors and very hard to not count them. The last part is important btw. The vast majority were seen only by the kernel in "bzero(), bcopy()" where they could be safely delt with once the issue was understood. Other recovery tricks delt with more but not all errors... some applications would be killed when recovery was impossible. To my knowledge that was the last system SGI designed from scratch that only had parity error detection on main memory. I suspect the same number (one flipped bit in 22 days) could be used as an initial assumption for any block of 6or8 -DIMMs as the cross section of the "detector" is about the same (i.e. square mm of Si). I suspect that good data is also very HARD to come by. IMO Running on a large cluster without multiple bit detection and a minimum of one bit correction ECC is silly. Further running without watching the ECC logs is also silly. Watching the logs can be hard to do. ECC codes for wide cache lines today are very good and a bad component may go undetected for some time. Some memory controllers will correct single bit errors without inserting a delay.... or posting a machine check exception. Of interest a hardware trainer at SGI was mystified when he cut the leg on a memory chip and it did not produce the error that he expected on an Orign. DMA data paths, cache, and even paths internal to the processor and IO should be protected. When I first heard this 64k DRAM as the new thing (c 1984 perhaps sooner) and IBM with the IBM PC was in the middle of it. Then it was Cosmic rays, today Google search for Neutrons and flipped bits. There was one distraction associated with uranium contaminated ceramic packages back then too. If guess is S(*t happens, Bits flip. Later, mitch -- T o m M i t c h e l l Found me a new hat, now what?
- Previous message: [Beowulf] ECC Memory and Job Failures
- Next message: [Beowulf] ECC Memory and Job Failures
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
