[Beowulf] Strange hardware? problems
Mark Hahn
hahn at mcmaster.ca
Fri Apr 27 11:24:49 PDT 2007
> - 2 Tyan S2882 dual processor Opteron 244 stepping 10
> - 2 Tyan S2882-D dual processor dual core Opteron 275 stepping 2
OK, those are obviously fairly different in age and power.
> We have two (relatively complicated) numerical models (RAMS and a homegrown
> one) that will blow up in random locations on the 244 machines but run fine
> on the 275 machines.
they blow up consistently on multiple 244's? might all the 244's have the
same potentially flawed cooling/heasink-compound/powersupply/etc? are you
saying there's something _similar_ between the 244s and the 275s?
> By blow up it appears the calculations get corrupted in some way and the
> numbers get un-physical in RAMS and the simulation exits. With the other
> model we get segfaults.
is your ram ECC (and enabled as such in bios, preferably with scrub enabled)?
if ECC, have you run mcelog?
> We've tried FC4/5 on the 244 machines. At one point all were running
> identical FC5 installs with the same problems.
why do you think the problem is software?
> Problem is not exactly reproducible unfortunately. It will crash at
> different times in the simulations, but they will crash at some point with
> the length of runs we are doing.
sounds like heat/power to me.
> Are there any cpu tests out there that would check the accuracy of various
> calculations?
you don't mean accuracy, do you? like some subtle problem with low-order bits?
I tend to use HPL for this kind of test - it's not to hard to tune it to use
however much memory you want, and to run for as long as you want. it's not
the ultimate memory-grinder, but it's pretty intense.
More information about the Beowulf
mailing list