Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Strange hardware? problems

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Orion Poplawski orion at cora.nwra.com
Mon Apr 30 14:50:39 PDT 2007


Mark Hahn wrote:
> 
> they blow up consistently on multiple 244's?  might all the 244's have 
> the same potentially flawed cooling/heasink-compound/powersupply/etc?  

Well, they are identical, so yes.

> are you saying there's something _similar_ between the 244s and the 275s?

Nope, but both 244 machines behave the same as the two 275s.  So I don't 
really suspect a problem with a particular memory chip on a particular 
machine for example.

> is your ram ECC (and enabled as such in bios, preferably with scrub 
> enabled)?

Yup, w/scrubbing.

> if ECC, have you run mcelog?

Yup.  Nothing

>> We've tried FC4/5 on the 244 machines.  At one point all were running 
>> identical FC5 installs with the same problems.
> 
> why do you think the problem is software?

I'm very confused at this point.

>> Problem is not exactly reproducible unfortunately.  It will crash at 
>> different times in the simulations, but they will crash at some point 
>> with the length of runs we are doing.
> 
> sounds like heat/power to me.

Possible, but the machines are actually fairly lightly used with our 
model tests - just one processor.  sensors reports CPU temps <= 40C

  >> Are there any cpu tests out there that would check the accuracy of
>> various calculations?
> 
> you don't mean accuracy, do you?  like some subtle problem with 
> low-order bits?
> I tend to use HPL for this kind of test - it's not to hard to tune it to 
> use however much memory you want, and to run for as long as you want.  
> it's not
> the ultimate memory-grinder, but it's pretty intense.

I do mean accuracy, and not necessarily subtle - things blow up bad. 
Perform some set of calculations over and over and error if it doesn't 
give the expected result.

-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA/CoRA Division                    FAX: 303-415-9702
3380 Mitchell Lane                  orion at cora.nwra.com
Boulder, CO 80301              http://www.cora.nwra.com



More information about the Beowulf mailing list