[Beowulf] Tips for diagnosing intermittent problems on a small cluster

David Mathog mathog at caltech.edu
Mon Nov 26 09:58:01 PST 2007


"Peter St. John" <peter.st.john at gmail.com> wrote

> I understood that sometimes the voltage from a fatigued (?),
> overheated (?) PS may fail the mobo's bootup requirements (which can
> be stricter re: voltage variations than running requirements) so
> sometimes a PS has to cool down before the PC will reboot. So
> particularly, sometimes a PC failing to reboot promptly is a symptom
> of the PS not max healthy.

Subtle difference between "ignoring reset switch" and "failing to
reboot".  I guess the symptoms might appear the same if the reset is
actually applied but the motherboard never gets far enough into the
startup sequence to generate beep codes or put any of the BIOS info up
on the video card.  Even so, I still don't think what we were
observing was power supply related.  For one thing these motherboards 
could get into that state (unstartable until unplugged) even on a normal
shutdown followed by a lengthy off period allowing everything to cool
down substantially.  And one or two boards would enter this state more
or less at random on any full cluster shutdown. (So no indication of a
particular bad node.)  The 10-20 second "unplugged reset" time is fast
enough to drain charge from an electronic part, but probably not long
enough to lower the temperature much on an overheated part, especially
one within the power supply if the fans are not running.

When I've seen iffy power supply problems the symptom has usually been
"random crash for no good reason", not "won't start".  A totally blown
supply won't start, of course, but it's easy enough to confirm that
diagnosis with a power supply tester.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list