[Beowulf] Tyan S2882

Bill Broadley bill at cse.ucdavis.edu
Tue Sep 26 18:58:28 PDT 2006


Krugger wrote:
> Hi,
> 
> We are currently deploying Tyan S2882 Dual Opteron Boards, and we have
> found the system to be quite unstable. After BIOS updates and kernel

Unstable when?  When idle?  Under heavy cpu load?  Under heavy I/O?
During Install?  Which OS/Dist/Kernel?

> changes we still get random kernel panics when under load.

What kind of load?  How big is the power supply?  What kind of CPU?

> Anyone has these boards and has found any solution, as I have mailed
> other users of this board  who also reported random kernel panics and
> an unusual number of hardware problems.

How many are unreliable?  1 of 1? 10 of 10? 64 of 64?

> So far we have solved the
> - broken BIOS problem with an update to the most recent BIOS.
> - Discovered that some power supplies can produce problems
> http://www.anandtech.com/mb/showdoc.aspx?i=2608

Power supplies do degrade over time, especially if overloaded.

> - FS corruption due to a firmeware problem in a RAID hardware board

Indeed, hardware RAID problems seem shockingly common..

> - MCE chipkill errors (non-fatal) due to apparent bad RAM

Detected how?   New memory passed 24 hours with memtest86?  Are you using
ram certified as compatible with the 2882?

> To be solved:
> - random kernel panics that take out the logging even when all debug
> flags are set in the kernel, as it fails to sync the disc during the
> kernel panic.

Could log it to serial.

I've got at least 32 of these, and they seem pretty reliable.




More information about the Beowulf mailing list