Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Tyan S2882

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Mark Hahn hahn at physics.mcmaster.ca
Thu Sep 28 07:17:27 PDT 2006


> * Dual AMP Opteron DP270 (2.0 GHz)

which rev?

> * Mem: 8*1GB PC3200 (DDR 400) ECC reg.; Corsair/Samsung CM72SD1024RLP-3200/SB
>  ( 12 nodes have 8*2GB)

this dimm is 2-rank, I believe; corsair's datasheet is pretty lame. 
that means that each bank of memory is 4x2=8 ranks.  that's definitely
pushing the limit; I'm sure it can be done in some cases, but it's definitely
not supported by some rev's of the opteron, and will always be pretty
bleeding-edge.

> When a node crashes, we typically see a MCE + kernel panic. We get about

try running mcelog periodically; I bet you see lots of corrected ECC's.

> once and ran stable afterwards. Crashes seem to occur mostly when the system
> is under heavy CPU (memory?) load.

yep.

> Far too many correctable ECC errors are reported (on a subset of about 10-20
> nodes). Sometimes the ECC errors disappeared after I cyclically interchanged
> the memory modules within one node. There seems to be a weak correlation
> between the instabilities and the tendency to exhibit ECC errors.

IMO, the config is the problem, not the boards, cpus, dimms, etc.

> It seems that the last BIOS upgrade has reduced the ECC error rate
> somewhat.

probably made the timing a little looser.  does the bios let you tweak?
it would be interesting to know whether derating the clock (->pc2700)
helps this situation more or less than derating the latency.

regards, mark hahn.



More information about the Beowulf mailing list