[Beowulf] Tyan S2882
    Bill Broadley 
    bill at cse.ucdavis.edu
       
    Tue Sep 26 18:58:28 PDT 2006
    
    
  
Krugger wrote:
> Hi,
> 
> We are currently deploying Tyan S2882 Dual Opteron Boards, and we have
> found the system to be quite unstable. After BIOS updates and kernel
Unstable when?  When idle?  Under heavy cpu load?  Under heavy I/O?
During Install?  Which OS/Dist/Kernel?
> changes we still get random kernel panics when under load.
What kind of load?  How big is the power supply?  What kind of CPU?
> Anyone has these boards and has found any solution, as I have mailed
> other users of this board  who also reported random kernel panics and
> an unusual number of hardware problems.
How many are unreliable?  1 of 1? 10 of 10? 64 of 64?
> So far we have solved the
> - broken BIOS problem with an update to the most recent BIOS.
> - Discovered that some power supplies can produce problems
> http://www.anandtech.com/mb/showdoc.aspx?i=2608
Power supplies do degrade over time, especially if overloaded.
> - FS corruption due to a firmeware problem in a RAID hardware board
Indeed, hardware RAID problems seem shockingly common..
> - MCE chipkill errors (non-fatal) due to apparent bad RAM
Detected how?   New memory passed 24 hours with memtest86?  Are you using
ram certified as compatible with the 2882?
> To be solved:
> - random kernel panics that take out the logging even when all debug
> flags are set in the kernel, as it fails to sync the disc during the
> kernel panic.
Could log it to serial.
I've got at least 32 of these, and they seem pretty reliable.
    
    
More information about the Beowulf
mailing list