[Beowulf] ECC support on motherboards?

Joe Landman landman at scalableinformatics.com
Tue May 13 15:45:45 PDT 2008


Perry E. Metzger wrote:
> Håkon Bugge <Hakon.Bugge at scali.com> writes:
>> Its even worse. On one mtbd; the BIOS had a menu for enabling ECC; I
>> did. But reading the register from the chipset revealed nothing was
>> actually enabled in the hardware. You have to be paranoid in this
>> business. This was a "bleeding edge" mtbd, with a low revision BIOS of
>> course. The fu being that a car manufacturer ran a cluster of these
>> for several months doing crash worthiness simulations ...
> 
> So another question is, how can you reliably test any of this stuff?
> It isn't like you can reliably induce single bit errors and see if the
> hardware catches them. (A special memory module that let you test

.... actually ... you can.  Run your code, and have it beat on RAM.  We 
do this.

Some folks use memtest* and variants, and it catches some base errors. 
But it doesn't exercise things the way the application does.  So we use 
a number of GAMESS runs and other large ram things.  Beats the heck out 
of the unit.  We get a very good indication if it starts tossing MCE 
errors that there is a real memory issue.

And, for those doubters, yes, we have caught errors with this that 
memtest* did not catch.  And yes, we could reliably reproduce them.

All our systems, regardless of their function run with these tests 
specifically to try to force MCE errors.

> would be a wonderful thing, but I've never even heard of such a thing.)
> 
> I'm doing the planning for a new cluster and the whole thing is
> remarkably bothersome. You can't easily figure out what motherboards
> will even pretend to do ECC that easily, you can't easily check once
> you have a sample motherboard in hand. It isn't even easy to get ECC
> memory for more modern standards. I'm starting to wonder if doing all
> calculations twice, once on each of two machines, isn't easier, but it
> seems utterly wrong to do that...

Hmmm.... sounds to me like you probably need to work with groups that 
have done this and do this for a living (deliver working systems to 
customers, and help them figure out what they need to do).  Bug Don 
Becker and his team (Penguin), and a bunch of others hanging around here 
(and us if you like).

It actually is not hard to build a system with ECC capability.  Most 
vendors, the vast majority of them, leave the bios default settings and 
assume they are "good enough".  We don't normally advise that.

Greg L or someone spoke about scrubbing.  You can enable that.  It is 
generally a good idea (we recommend it).  Yeah, it does eat memory 
bandwidth.  And it does slow down access to ram.  The is a cost for 
every decision.

> 
> Perry


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Beowulf mailing list