[Beowulf] Tyan S2882

Vincent Diepeveen diep at xs4all.nl
Thu Sep 28 17:16:25 PDT 2006


My dual opteron dual core is extremely stable,
except when i run 1 type of software, namely software that is
doing non-stop multiplying. I do that under Ubuntu.

That really seems like a worst case path in the dual core opteron chips.

After it is nonstop multiplying for a number of days,
I get a complete crash of the system.

Any other software program, windows (x64) or ubuntu linux,
it runs extremely stable for months.

Is it possible some crashes you had were caused by non stop multiplying 
numbers?

Very optimal programmed software will of course manage to limit the amount 
of
instructions overhead when doing matrix calculations or whatever and will be 
basically
busy multiplying.

In my case it was big number multiplying just with integer multiplying.

Vincent


----- Original Message ----- 
From: "Gebhardt Thomas" <gebhardt at hrz.uni-marburg.de>
To: <beowulf at beowulf.org>
Sent: Wednesday, September 27, 2006 10:20 AM
Subject: Re: [Beowulf] Tyan S2882


> Hi,
>
>> We are currently deploying Tyan S2882 Dual Opteron Boards, and we have
>> found the system to be quite unstable. After BIOS updates and kernel
>> changes we still get random kernel panics when under load.
>
> Me too :-(
>
> We've got a 85 Node Dual Opteron Cluster. I've documented most of the
> crashes on
> http://clust-doc.hrz.uni-marburg.de/index.php/Hardware_Bulletin .
>
> Our equipment:
>
> * Dual AMP Opteron DP270 (2.0 GHz)
> * MB: TYAN S2882G3-DNR
> * Mem: 8*1GB PC3200 (DDR 400) ECC reg.; Corsair/Samsung 
> CM72SD1024RLP-3200/SB
>  ( 12 nodes have 8*2GB)
> * PS: EMACS P1 6400P
> * HD: 250 GB SATA from Western Digital
>
> Dist: Debian/Sarge amd64
> Kernel: various, currently 2.6.15.3 from kernel.org
> BIOS: (most recent, as far as I know)
>
> When a node crashes, we typically see a MCE + kernel panic. We get about
> 2 crashes per week on our 85 node cluster. Some nodes seem to be more 
> unstable
> than others but we also see instabilities on nodes that had been stable so
> far. The instabilities are very hard to reproduce: we have nodes that 
> crashed
> once and ran stable afterwards. Crashes seem to occur mostly when the 
> system
> is under heavy CPU (memory?) load.
>
> Far too many correctable ECC errors are reported (on a subset of about 
> 10-20
> nodes). Sometimes the ECC errors disappeared after I cyclically 
> interchanged
> the memory modules within one node. There seems to be a weak correlation
> between the instabilities and the tendency to exhibit ECC errors. 
> memtest86
> runs fine on the momory modules.
>
> It seems that the last BIOS upgrade has reduced the ECC error rate
> somewhat.
>
> We definitely have no temperature problem. As far as I can see (libsensor)
> the voltages are ok, too.
>
> Cheers, Thomas
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
> 




More information about the Beowulf mailing list