[Beowulf] Tyan 2466 crashes, no obvious reason why

Tim Mattox tmattox at gmail.com
Fri Sep 3 12:02:47 PDT 2004

Hello David Mathog,
I don't know if others do this, but if I can afford the downtime, I
will redistribute the components from a crashing node amongst
some known healthy nodes.  Then when the crash re-appears
on one of the nodes, I am fairly sure which is the faulty component.
This may not be worth your time though, since it's a pain to do.

As for your specific case, I've seen memory failures that will
get past 10 minutes on memtest86.  Try a longer run.
Swapping power supplies is a pretty good option as well.
You may want to also look for any leaking or bulging capacitors
on the motherboard.  Depending on how old your node is, it may
have been built when there was a rash of bad electrolyte used
in capacitors pretty much across the industry. The rash of
bad capacitors peaked a few years ago, so that doesn't seem
likely to me that you would only just now see a problem.

Good luck diagnosing the problem!

On Fri, 03 Sep 2004 11:31:30 -0700, David Mathog
<mathog at mendel.bio.caltech.edu> wrote:
> One of 20 identical nodes containing
> Tyan 2466
> Single Athlon MP 2200+
> 1GB ECC memory
> is starting to flake out.
> For no apparent reason it just drops dead (as far as
> linux is concerned) after a few minutes to a few days.
> At that point the network is down, the serial lines are down,
> and near as I can tell the OS just blew up.  There is zip,
> nothing, nada in the log files to indicate a problem.
> I pulled the unit and monitored it closely and it does not seem
> to be an overheating problem:  all the fans are spinning
> as they should be even after it has crashed.  The network
> port lights are still flashing.  After reboot smartctl shows
> no errors on the hard drive.  Running sensors every few seconds
> in a loop shows nothing odd happening to the voltages or temps
> or fan speeds up through the last log point before it dies.
> Running memcheck86 for 10 minutes showed no memory errors.
> I'm thinking about replacing the power supply (for lack of a better
> idea.)  What else might be causing this???  There's not much
> in these systems, just the one CPU, a floppy, 1GB RAM and a cheap
> S3 graphics card (normally not used.)
> The other 19 (identical) nodes are working reliably.
> Thanks,
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech

Tim Mattox - tmattox at gmail.com - http://homepage.mac.com/tmattox/

More information about the Beowulf mailing list