Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Approach For Diagnosing Heat Related Failure?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Jon Forrest jlforrest at berkeley.edu
Tue Jul 21 11:56:23 PDT 2009


I have a rack full of identical compute
nodes. One of them has become heat sensitive.

When it's in the warm computer room it crashes.
I can't even run memtest from the CentOS DVD
for 2 seconds. However, when this node is
in my much cooler office everything works
fine. All the other nodes are working fine
in the computer room.

I'm not convinced the problem is actually
the memory. Other than opening the node
to spray cooling liquid when it's in the warm
room, what approach would you use to figure out which
component(s) is(are) failing?

Cordially,
-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu



More information about the Beowulf mailing list