[Beowulf] Tyan S2882

Mark Hahn hahn at physics.mcmaster.ca
Thu Sep 28 07:17:27 PDT 2006

> * Dual AMP Opteron DP270 (2.0 GHz)

which rev?

> * Mem: 8*1GB PC3200 (DDR 400) ECC reg.; Corsair/Samsung CM72SD1024RLP-3200/SB
>  ( 12 nodes have 8*2GB)

this dimm is 2-rank, I believe; corsair's datasheet is pretty lame. 
that means that each bank of memory is 4x2=8 ranks.  that's definitely
pushing the limit; I'm sure it can be done in some cases, but it's definitely
not supported by some rev's of the opteron, and will always be pretty

> When a node crashes, we typically see a MCE + kernel panic. We get about

try running mcelog periodically; I bet you see lots of corrected ECC's.

> once and ran stable afterwards. Crashes seem to occur mostly when the system
> is under heavy CPU (memory?) load.


> Far too many correctable ECC errors are reported (on a subset of about 10-20
> nodes). Sometimes the ECC errors disappeared after I cyclically interchanged
> the memory modules within one node. There seems to be a weak correlation
> between the instabilities and the tendency to exhibit ECC errors.

IMO, the config is the problem, not the boards, cpus, dimms, etc.

> It seems that the last BIOS upgrade has reduced the ECC error rate
> somewhat.

probably made the timing a little looser.  does the bios let you tweak?
it would be interesting to know whether derating the clock (->pc2700)
helps this situation more or less than derating the latency.

regards, mark hahn.

