[Beowulf] Re: ECC Memory and Job Failures (Huw Lynes)
landman at scalableinformatics.com
Thu Apr 23 14:44:52 PDT 2009
Gerry Creager wrote:
> David Mathog wrote:
>> Huw Lynes <lynesh at cardiff.ac.uk> wrote:
>>> Apparently someone ran a large cluster job with both ECC and none-ECC
>>> RAM. They consistently got the wrong answer when foregoing ECC.
>> There were not very many details given. I would not rule out the
>> possibility that the nonECC memory was slightly faulty, and that the
>> observed errors had nothing to do with gamma rays at all. A better test
>> would have been to use the same ECC memory for both tests, and to turn
>> ECC memory correction on and off in the BIOS.
> Where's Jim Lux. I'm sure he's an opinion on this, too...
> Cosmic ray hits are, if I recall correctly, an improbable event at the
> earth's surface on the order of 1/1e13 sec (but I'm doing this from
Hmmm... one of the experiments done way back in the dusty days of my
undergrad was cosmic ray generated Muon lifetime measurement, using 3
large scintillators, some PMDs, and a little luck. No computers were
harmed (or used!) in these measurements. Labview wasn't even a glint in
National Instrument's eyes then.
I am pretty sure we did this experiment on the surface (inside a large
concrete building in fact, which may have altered the signal somewhat).
The atmosphere definitely attenuates the cosmic radiation background
(and I seem to remember reading things about notch and other weird
filter properties of the EM spectrum traversing the atmosphere ... all
> memory and IT may have taken a hit). In spaceborne applications,
> however, the potential for random high energy particle hits is
> significantly higher. And it's not just memory, although that tends to
> be more susceptible. CPUs are also at risk. CMOS parts tend to
> tolerate these events better than a lot of others than NMOS. There are
> a lot of old CPUs and memory designs for spaceflight even today.
> I tend to buy the theory that there's something wrong with the non-ECC
> components, rather than thinking there's a cosmic ray with you r name on
Allow me to second this. If I see a memory showing off a huge number of
ECC errors, I start looking at if the DIMMs were seated right.
Reseating memory (on one server) is usually a fast thing. More than one
... not so much fast.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf