[Beowulf] Tyan S2882
bill at cse.ucdavis.edu
Tue Sep 26 18:58:28 PDT 2006
> We are currently deploying Tyan S2882 Dual Opteron Boards, and we have
> found the system to be quite unstable. After BIOS updates and kernel
Unstable when? When idle? Under heavy cpu load? Under heavy I/O?
During Install? Which OS/Dist/Kernel?
> changes we still get random kernel panics when under load.
What kind of load? How big is the power supply? What kind of CPU?
> Anyone has these boards and has found any solution, as I have mailed
> other users of this board who also reported random kernel panics and
> an unusual number of hardware problems.
How many are unreliable? 1 of 1? 10 of 10? 64 of 64?
> So far we have solved the
> - broken BIOS problem with an update to the most recent BIOS.
> - Discovered that some power supplies can produce problems
Power supplies do degrade over time, especially if overloaded.
> - FS corruption due to a firmeware problem in a RAID hardware board
Indeed, hardware RAID problems seem shockingly common..
> - MCE chipkill errors (non-fatal) due to apparent bad RAM
Detected how? New memory passed 24 hours with memtest86? Are you using
ram certified as compatible with the 2882?
> To be solved:
> - random kernel panics that take out the logging even when all debug
> flags are set in the kernel, as it fails to sync the disc during the
> kernel panic.
Could log it to serial.
I've got at least 32 of these, and they seem pretty reliable.
More information about the Beowulf