[Beowulf] Tyan S2882
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at physics.mcmaster.caThu Sep 28 07:17:27 PDT 2006
- Previous message: [Beowulf] Tyan S2882
- Next message: [Beowulf] Tyan S2882
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> * Dual AMP Opteron DP270 (2.0 GHz) which rev? > * Mem: 8*1GB PC3200 (DDR 400) ECC reg.; Corsair/Samsung CM72SD1024RLP-3200/SB > ( 12 nodes have 8*2GB) this dimm is 2-rank, I believe; corsair's datasheet is pretty lame. that means that each bank of memory is 4x2=8 ranks. that's definitely pushing the limit; I'm sure it can be done in some cases, but it's definitely not supported by some rev's of the opteron, and will always be pretty bleeding-edge. > When a node crashes, we typically see a MCE + kernel panic. We get about try running mcelog periodically; I bet you see lots of corrected ECC's. > once and ran stable afterwards. Crashes seem to occur mostly when the system > is under heavy CPU (memory?) load. yep. > Far too many correctable ECC errors are reported (on a subset of about 10-20 > nodes). Sometimes the ECC errors disappeared after I cyclically interchanged > the memory modules within one node. There seems to be a weak correlation > between the instabilities and the tendency to exhibit ECC errors. IMO, the config is the problem, not the boards, cpus, dimms, etc. > It seems that the last BIOS upgrade has reduced the ECC error rate > somewhat. probably made the timing a little looser. does the bios let you tweak? it would be interesting to know whether derating the clock (->pc2700) helps this situation more or less than derating the latency. regards, mark hahn.
- Previous message: [Beowulf] Tyan S2882
- Next message: [Beowulf] Tyan S2882
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
