[Beowulf] Strange hardware? problems
Orion Poplawski
orion at cora.nwra.com
Mon Apr 30 14:50:39 PDT 2007
Mark Hahn wrote:
>
> they blow up consistently on multiple 244's? might all the 244's have
> the same potentially flawed cooling/heasink-compound/powersupply/etc?
Well, they are identical, so yes.
> are you saying there's something _similar_ between the 244s and the 275s?
Nope, but both 244 machines behave the same as the two 275s. So I don't
really suspect a problem with a particular memory chip on a particular
machine for example.
> is your ram ECC (and enabled as such in bios, preferably with scrub
> enabled)?
Yup, w/scrubbing.
> if ECC, have you run mcelog?
Yup. Nothing
>> We've tried FC4/5 on the 244 machines. At one point all were running
>> identical FC5 installs with the same problems.
>
> why do you think the problem is software?
I'm very confused at this point.
>> Problem is not exactly reproducible unfortunately. It will crash at
>> different times in the simulations, but they will crash at some point
>> with the length of runs we are doing.
>
> sounds like heat/power to me.
Possible, but the machines are actually fairly lightly used with our
model tests - just one processor. sensors reports CPU temps <= 40C
>> Are there any cpu tests out there that would check the accuracy of
>> various calculations?
>
> you don't mean accuracy, do you? like some subtle problem with
> low-order bits?
> I tend to use HPL for this kind of test - it's not to hard to tune it to
> use however much memory you want, and to run for as long as you want.
> it's not
> the ultimate memory-grinder, but it's pretty intense.
I do mean accuracy, and not necessarily subtle - things blow up bad.
Perform some set of calculations over and over and error if it doesn't
give the expected result.
--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA/CoRA Division FAX: 303-415-9702
3380 Mitchell Lane orion at cora.nwra.com
Boulder, CO 80301 http://www.cora.nwra.com
More information about the Beowulf
mailing list