[Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)

Joe Landman landman at scalableinformatics.com
Thu Apr 23 14:44:52 PDT 2009


Gerry Creager wrote:
> David Mathog wrote:
>> Huw Lynes <lynesh at cardiff.ac.uk> wrote:
>>
>>> http://blog.revolution-computing.com/2009/04/blame-it-on-cosmic-rays.html 
>>>
>>>
>>> Apparently someone ran a large cluster job with both ECC and none-ECC
>>> RAM. They consistently got the wrong answer when foregoing ECC.
>>
>> There were not very many details given.  I would not rule out the
>> possibility that the nonECC memory was slightly faulty, and that the
>> observed errors had nothing to do with gamma rays at all.  A better test
>> would have been to use the same ECC memory for both tests, and to turn
>> ECC memory correction on and off in the BIOS.
> 
> Where's Jim Lux.  I'm sure he's an opinion on this, too...
> 
> Cosmic ray hits are, if I recall correctly, an improbable event at the 
> earth's surface on the order of 1/1e13 sec (but I'm doing this from 

Hmmm... one of the experiments done way back in the dusty days of my 
undergrad was cosmic ray generated Muon lifetime measurement, using 3 
large scintillators, some PMDs, and a little luck.  No computers were 
harmed (or used!) in these measurements.  Labview wasn't even a glint in 
National Instrument's eyes then.

I am pretty sure we did this experiment on the surface (inside a large 
concrete building in fact, which may have altered the signal somewhat).

The atmosphere definitely attenuates the cosmic radiation background 
(and I seem to remember reading things about notch and other weird 
filter properties of the EM spectrum traversing the atmosphere ... all 
that absorption...)

> memory and IT may have taken a hit).  In spaceborne applications, 
> however, the potential for random high energy particle hits is 
> significantly higher.  And it's not just memory, although that tends to 
> be more susceptible.  CPUs are also at risk.  CMOS parts tend to 
> tolerate these events better than a lot of others than NMOS.  There are 
> a lot of old CPUs and memory designs for spaceflight even today.
> 
> I tend to buy the theory that there's something wrong with the non-ECC 
> components, rather than thinking there's a cosmic ray with you r name on 
> it.
> 

Allow me to second this.  If I see a memory showing off a huge number of 
ECC errors, I start looking at if the DIMMs were seated right. 
Reseating memory (on one server) is usually a fast thing.  More than one 
... not so much fast.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Beowulf mailing list