[Beowulf] Tyan 2466 crashes, no obvious reason why

Michael Lodico K1EG k1eg at matrix-computers.com
Sat Sep 4 07:08:03 PDT 2004


David, Tim's ideas are about the best bet for trouble shooting this problem but I would also put my finger on the
center of each fan to see if you can slow them down.  If you can then you have faulty fans that maybe causing the
problem the problem.  Then the next logical step would be to replace the power supply.  If that doesn't solve it
then replace the memory.  If none of these steps solve the problem then you probably have a voltage regulator on
the mainboard that is going bad.  I have fixed computers that did what yours is doing and it ended up being the
voltage regulator on the mainboard when none of these steps worked.  Please keep the list informed of what you
find and I hope this helps you track down the annoying problem.

Michael Lodico

Tim Mattox wrote:

> Hello David Mathog,
> I don't know if others do this, but if I can afford the downtime, I
> will redistribute the components from a crashing node amongst
> some known healthy nodes.  Then when the crash re-appears
> on one of the nodes, I am fairly sure which is the faulty component.
> This may not be worth your time though, since it's a pain to do.
>
> As for your specific case, I've seen memory failures that will
> get past 10 minutes on memtest86.  Try a longer run.
> Swapping power supplies is a pretty good option as well.
> You may want to also look for any leaking or bulging capacitors
> on the motherboard.  Depending on how old your node is, it may
> have been built when there was a rash of bad electrolyte used
> in capacitors pretty much across the industry. The rash of
> bad capacitors peaked a few years ago, so that doesn't seem
> likely to me that you would only just now see a problem.
>
> Good luck diagnosing the problem!
>
> On Fri, 03 Sep 2004 11:31:30 -0700, David Mathog
> <mathog at mendel.bio.caltech.edu> wrote:
> > One of 20 identical nodes containing
> >
> > Tyan 2466
> > Single Athlon MP 2200+
> > 1GB ECC memory
> >
> > is starting to flake out.
> >
> > For no apparent reason it just drops dead (as far as
> > linux is concerned) after a few minutes to a few days.
> > At that point the network is down, the serial lines are down,
> > and near as I can tell the OS just blew up.  There is zip,
> > nothing, nada in the log files to indicate a problem.
> >
> > I pulled the unit and monitored it closely and it does not seem
> > to be an overheating problem:  all the fans are spinning
> > as they should be even after it has crashed.  The network
> > port lights are still flashing.  After reboot smartctl shows
> > no errors on the hard drive.  Running sensors every few seconds
> > in a loop shows nothing odd happening to the voltages or temps
> > or fan speeds up through the last log point before it dies.
> > Running memcheck86 for 10 minutes showed no memory errors.
> >
> > I'm thinking about replacing the power supply (for lack of a better
> > idea.)  What else might be causing this???  There's not much
> > in these systems, just the one CPU, a floppy, 1GB RAM and a cheap
> > S3 graphics card (normally not used.)
> >
> > The other 19 (identical) nodes are working reliably.
> >
> > Thanks,
> >
> >
> > David Mathog
> > mathog at caltech.edu
> > Manager, Sequence Analysis Facility, Biology Division, Caltech
>
> --
> Tim Mattox - tmattox at gmail.com - http://homepage.mac.com/tmattox/
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list