[Beowulf] RE: Approach For Diagnosing Heat Related Failure?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
David Mathog mathog at caltech.eduTue Jul 21 14:02:41 PDT 2009
- Previous message: [Beowulf] Resolved - Approach For Diagnosing Heat Related Failure?
- Next message: [Beowulf] Approach For Diagnosing Heat Related Failure?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Jon Forrest <jlforrest at berkeley.edu> wrote: > I have a rack full of identical compute > nodes. One of them has become heat sensitive. > > When it's in the warm computer room it crashes. > I can't even run memtest from the CentOS DVD > for 2 seconds. However, when this node is > in my much cooler office everything works > fine. All the other nodes are working fine > in the computer room. Presumably you have already blown the dust out of it and reseated all the obvious suspect components. If the motherboard has a "shutdown on overheat" option that may now have a value set low enough that it stops the machine in the warmer room. If you didn't explicitly set it to that value then suspect the motherboard battery - change it, reset the BIOS, and all should be well. If the machine has a hardware status monitor in the BIOS check there too for out of range temperatures. (Odd that your machine room is much hotter than your office.) Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
- Previous message: [Beowulf] Resolved - Approach For Diagnosing Heat Related Failure?
- Next message: [Beowulf] Approach For Diagnosing Heat Related Failure?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
