[Beowulf] Strange hardware? problems

Orion Poplawski orion at cora.nwra.com
Mon Apr 30 14:50:39 PDT 2007

Mark Hahn wrote:
> they blow up consistently on multiple 244's?  might all the 244's have 
> the same potentially flawed cooling/heasink-compound/powersupply/etc?  

Well, they are identical, so yes.

> are you saying there's something _similar_ between the 244s and the 275s?

Nope, but both 244 machines behave the same as the two 275s.  So I don't 
really suspect a problem with a particular memory chip on a particular 
machine for example.

> is your ram ECC (and enabled as such in bios, preferably with scrub 
> enabled)?

Yup, w/scrubbing.

> if ECC, have you run mcelog?

Yup.  Nothing

>> We've tried FC4/5 on the 244 machines.  At one point all were running 
>> identical FC5 installs with the same problems.
> why do you think the problem is software?

I'm very confused at this point.

>> Problem is not exactly reproducible unfortunately.  It will crash at 
>> different times in the simulations, but they will crash at some point 
>> with the length of runs we are doing.
> sounds like heat/power to me.

Possible, but the machines are actually fairly lightly used with our 
model tests - just one processor.  sensors reports CPU temps <= 40C

  >> Are there any cpu tests out there that would check the accuracy of
>> various calculations?
> you don't mean accuracy, do you?  like some subtle problem with 
> low-order bits?
> I tend to use HPL for this kind of test - it's not to hard to tune it to 
> use however much memory you want, and to run for as long as you want.  
> it's not
> the ultimate memory-grinder, but it's pretty intense.

I do mean accuracy, and not necessarily subtle - things blow up bad. 
Perform some set of calculations over and over and error if it doesn't 
give the expected result.

