[Beowulf] Tyan S2882

Mark Hahn hahn at physics.mcmaster.ca
Tue Sep 26 08:14:45 PDT 2006


> We are currently deploying Tyan S2882 Dual Opteron Boards, and we have

these are older, well-known, widely installed and certainly _can_ run stable.

> found the system to be quite unstable. After BIOS updates and kernel
> changes we still get random kernel panics when under load.

have you run memtest86?  are you monitoring temperatures?
(and perhaps voltages)

> So far we have solved the
> - broken BIOS problem with an update to the most recent BIOS.

due to a newer cpu?  the cluster I have with S2882's (mixed with 
S2881's, I think) hasn't needed any updates, but it's not using 
dual-core or anything exotic.

> - Discovered that some power supplies can produce problems
> http://www.anandtech.com/mb/showdoc.aspx?i=2608

I have a hard time believing this is specific to antec+tyan.
yes, certainly, PS's are a sensitive point, especially if you've
got heavily-configured systems.

> - FS corruption due to a firmeware problem in a RAID hardware board

therefore not related to the MB, right?

> - MCE chipkill errors (non-fatal) due to apparent bad RAM

also not related to the MB, right?  also, you really should expect
some small rate of corrected ECC's on any system; it's only a high
rate that's a problem (or uncorrectable ones, of course...)

> To be solved:
> - random kernel panics that take out the logging even when all debug
> flags are set in the kernel, as it fails to sync the disc during the
> kernel panic.

but kernel panics never sync - after all, a panic is specifically
an event from which you can't continue in any way.  or am I misunderstanding
what you're saying?

it sounds like you've done a lot of debugging already, but I'd recommend 
going back to basics.  remove all the io devices, disks, etc and see 
whether the board+cpu+memory can run stably, etc.



More information about the Beowulf mailing list