Mark Hahn hahn at mcmaster.ca
Fri Apr 27 11:24:49 PDT 2007

> - 2 Tyan S2882 dual processor Opteron 244 stepping 10
> - 2 Tyan S2882-D dual processor dual core Opteron 275 stepping 2

OK, those are obviously fairly different in age and power.

> We have two (relatively complicated) numerical models (RAMS and a homegrown 
> one) that will blow up in random locations on the 244 machines but run fine 
> on the 275 machines.

they blow up consistently on multiple 244's?  might all the 244's have the 
same potentially flawed cooling/heasink-compound/powersupply/etc?  are you 
saying there's something _similar_ between the 244s and the 275s?

> By blow up it appears the calculations get corrupted in some way and the 
> numbers get un-physical in RAMS and the simulation exits.  With the other 
> model we get segfaults.

is your ram ECC (and enabled as such in bios, preferably with scrub enabled)?
if ECC, have you run mcelog?

> We've tried FC4/5 on the 244 machines.  At one point all were running 
> identical FC5 installs with the same problems.

why do you think the problem is software?

> Problem is not exactly reproducible unfortunately.  It will crash at 
> different times in the simulations, but they will crash at some point with 
> the length of runs we are doing.

sounds like heat/power to me.

> Are there any cpu tests out there that would check the accuracy of various 
> calculations?

you don't mean accuracy, do you?  like some subtle problem with low-order bits?
I tend to use HPL for this kind of test - it's not to hard to tune it to use 
however much memory you want, and to run for as long as you want.  it's not
the ultimate memory-grinder, but it's pretty intense.

