[Beowulf] Tips for diagnosing intermittent problems on a small cluster
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
David Mathog mathog at caltech.eduMon Nov 26 09:58:01 PST 2007
- Previous message: [Beowulf] Tips for diagnosing intermittent problems on a small cluster
- Next message: [Beowulf] Tips for diagnosing intermittent problems on a small cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
"Peter St. John" <peter.st.john at gmail.com> wrote > I understood that sometimes the voltage from a fatigued (?), > overheated (?) PS may fail the mobo's bootup requirements (which can > be stricter re: voltage variations than running requirements) so > sometimes a PS has to cool down before the PC will reboot. So > particularly, sometimes a PC failing to reboot promptly is a symptom > of the PS not max healthy. Subtle difference between "ignoring reset switch" and "failing to reboot". I guess the symptoms might appear the same if the reset is actually applied but the motherboard never gets far enough into the startup sequence to generate beep codes or put any of the BIOS info up on the video card. Even so, I still don't think what we were observing was power supply related. For one thing these motherboards could get into that state (unstartable until unplugged) even on a normal shutdown followed by a lengthy off period allowing everything to cool down substantially. And one or two boards would enter this state more or less at random on any full cluster shutdown. (So no indication of a particular bad node.) The 10-20 second "unplugged reset" time is fast enough to drain charge from an electronic part, but probably not long enough to lower the temperature much on an overheated part, especially one within the power supply if the fans are not running. When I've seen iffy power supply problems the symptom has usually been "random crash for no good reason", not "won't start". A totally blown supply won't start, of course, but it's easy enough to confirm that diagnosis with a power supply tester. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
- Previous message: [Beowulf] Tips for diagnosing intermittent problems on a small cluster
- Next message: [Beowulf] Tips for diagnosing intermittent problems on a small cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
