[Beowulf] Tyan S2882
hahn at physics.mcmaster.ca
Tue Sep 26 08:14:45 PDT 2006
> We are currently deploying Tyan S2882 Dual Opteron Boards, and we have
these are older, well-known, widely installed and certainly _can_ run stable.
> found the system to be quite unstable. After BIOS updates and kernel
> changes we still get random kernel panics when under load.
have you run memtest86? are you monitoring temperatures?
(and perhaps voltages)
> So far we have solved the
> - broken BIOS problem with an update to the most recent BIOS.
due to a newer cpu? the cluster I have with S2882's (mixed with
S2881's, I think) hasn't needed any updates, but it's not using
dual-core or anything exotic.
> - Discovered that some power supplies can produce problems
I have a hard time believing this is specific to antec+tyan.
yes, certainly, PS's are a sensitive point, especially if you've
got heavily-configured systems.
> - FS corruption due to a firmeware problem in a RAID hardware board
therefore not related to the MB, right?
> - MCE chipkill errors (non-fatal) due to apparent bad RAM
also not related to the MB, right? also, you really should expect
some small rate of corrected ECC's on any system; it's only a high
rate that's a problem (or uncorrectable ones, of course...)
> To be solved:
> - random kernel panics that take out the logging even when all debug
> flags are set in the kernel, as it fails to sync the disc during the
> kernel panic.
but kernel panics never sync - after all, a panic is specifically
an event from which you can't continue in any way. or am I misunderstanding
what you're saying?
it sounds like you've done a lot of debugging already, but I'd recommend
going back to basics. remove all the io devices, disks, etc and see
whether the board+cpu+memory can run stably, etc.
More information about the Beowulf