[Beowulf] Tips for diagnosing intermittent problems on a small cluster

Peter St. John peter.st.john at gmail.com
Mon Nov 26 11:38:38 PST 2007

clarification understood, thanks.

I sometimes have problems with a desktop; I reboot (because of memory
leaks) and have to shutdown because the mobo refuses to restart
(seemingly because of temp) but a couple minutes cooldown does the

On Nov 26, 2007 12:58 PM, David Mathog <mathog at caltech.edu> wrote:
> "Peter St. John" <peter.st.john at gmail.com> wrote
> > I understood that sometimes the voltage from a fatigued (?),
> > overheated (?) PS may fail the mobo's bootup requirements (which can
> > be stricter re: voltage variations than running requirements) so
> > sometimes a PS has to cool down before the PC will reboot. So
> > particularly, sometimes a PC failing to reboot promptly is a symptom
> > of the PS not max healthy.
> Subtle difference between "ignoring reset switch" and "failing to
> reboot".  I guess the symptoms might appear the same if the reset is
> actually applied but the motherboard never gets far enough into the
> startup sequence to generate beep codes or put any of the BIOS info up
> on the video card.  Even so, I still don't think what we were
> observing was power supply related.  For one thing these motherboards
> could get into that state (unstartable until unplugged) even on a normal
> shutdown followed by a lengthy off period allowing everything to cool
> down substantially.  And one or two boards would enter this state more
> or less at random on any full cluster shutdown. (So no indication of a
> particular bad node.)  The 10-20 second "unplugged reset" time is fast
> enough to drain charge from an electronic part, but probably not long
> enough to lower the temperature much on an overheated part, especially
> one within the power supply if the fans are not running.
> When I've seen iffy power supply problems the symptom has usually been
> "random crash for no good reason", not "won't start".  A totally blown
> supply won't start, of course, but it's easy enough to confirm that
> diagnosis with a power supply tester.
> Regards,
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech

More information about the Beowulf mailing list