[Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)

Gerry Creager gerry.creager at tamu.edu
Thu Apr 23 14:16:45 PDT 2009


David Mathog wrote:
> Huw Lynes <lynesh at cardiff.ac.uk> wrote:
> 
>> http://blog.revolution-computing.com/2009/04/blame-it-on-cosmic-rays.html
>>
>> Apparently someone ran a large cluster job with both ECC and none-ECC
>> RAM. They consistently got the wrong answer when foregoing ECC.
> 
> There were not very many details given.  I would not rule out the
> possibility that the nonECC memory was slightly faulty, and that the
> observed errors had nothing to do with gamma rays at all.  A better test
> would have been to use the same ECC memory for both tests, and to turn
> ECC memory correction on and off in the BIOS.

Where's Jim Lux.  I'm sure he's an opinion on this, too...

Cosmic ray hits are, if I recall correctly, an improbable event at the 
earth's surface on the order of 1/1e13 sec (but I'm doing this from 
memory and IT may have taken a hit).  In spaceborne applications, 
however, the potential for random high energy particle hits is 
significantly higher.  And it's not just memory, although that tends to 
be more susceptible.  CPUs are also at risk.  CMOS parts tend to 
tolerate these events better than a lot of others than NMOS.  There are 
a lot of old CPUs and memory designs for spaceflight even today.

I tend to buy the theory that there's something wrong with the non-ECC 
components, rather than thinking there's a cosmic ray with you r name on it.

gerry
-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843



More information about the Beowulf mailing list