[Beowulf] Tyan S2882
Eric W. Biederman
ebiederm at xmission.com
Thu Sep 28 07:02:07 PDT 2006
Gebhardt Thomas <gebhardt at hrz.uni-marburg.de> writes:
> Hi,
>
>> We are currently deploying Tyan S2882 Dual Opteron Boards, and we have
>> found the system to be quite unstable. After BIOS updates and kernel
>> changes we still get random kernel panics when under load.
>
> Me too :-(
>
> We've got a 85 Node Dual Opteron Cluster. I've documented most of the
> crashes on
> http://clust-doc.hrz.uni-marburg.de/index.php/Hardware_Bulletin .
>
> Our equipment:
>
> * Dual AMP Opteron DP270 (2.0 GHz)
> * MB: TYAN S2882G3-DNR
> * Mem: 8*1GB PC3200 (DDR 400) ECC reg.; Corsair/Samsung CM72SD1024RLP-3200/SB
> ( 12 nodes have 8*2GB)
> * PS: EMACS P1 6400P
> * HD: 250 GB SATA from Western Digital
>
> Dist: Debian/Sarge amd64
> Kernel: various, currently 2.6.15.3 from kernel.org
> BIOS: (most recent, as far as I know)
>
> When a node crashes, we typically see a MCE + kernel panic. We get about
> 2 crashes per week on our 85 node cluster. Some nodes seem to be more unstable
> than others but we also see instabilities on nodes that had been stable so
> far. The instabilities are very hard to reproduce: we have nodes that crashed
> once and ran stable afterwards. Crashes seem to occur mostly when the system
> is under heavy CPU (memory?) load.
I bet if you decode the MCE it will say uncorrectable ECC memory error.
> Far too many correctable ECC errors are reported (on a subset of about 10-20
> nodes). Sometimes the ECC errors disappeared after I cyclically interchanged
> the memory modules within one node. There seems to be a weak correlation
> between the instabilities and the tendency to exhibit ECC errors. memtest86
> runs fine on the momory modules.
memtest86 doesn't see correctable memory errors.
> It seems that the last BIOS upgrade has reduced the ECC error rate
> somewhat.
>
> We definitely have no temperature problem. As far as I can see (libsensor)
> the voltages are ok, too.
It sounds like you have a pile of correctable (soft?) memory errors that occasionally
become uncorrectable.
Good Luck,
Eric
More information about the Beowulf
mailing list