Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Approach For Diagnosing Heat Related Failure?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Victor Gregorio vgregorio at penguincomputing.com
Tue Jul 21 13:56:00 PDT 2009


Hello Jon,

If your system has temperature and fan sensors, you might be able to use
lm_sensors to display component temperatures and diagnose fan failures.

[root at tesla ~]# sensors-detect                  # answer all defaults
[root at tesla ~]# /etc/init.d/lm_sensors start    # load kernel modules
[root at tesla ~]# sensors                         # check sensor stats

Hope this helps.  Regards,

-- 
Victor Gregorio
Penguin Computing

On Tue, Jul 21, 2009 at 11:56:23AM -0700, Jon Forrest wrote:
> I have a rack full of identical compute
> nodes. One of them has become heat sensitive.
>
> When it's in the warm computer room it crashes.
> I can't even run memtest from the CentOS DVD
> for 2 seconds. However, when this node is
> in my much cooler office everything works
> fine. All the other nodes are working fine
> in the computer room.
>
> I'm not convinced the problem is actually
> the memory. Other than opening the node
> to spray cooling liquid when it's in the warm
> room, what approach would you use to figure out which
> component(s) is(are) failing?
>
> Cordially,
> -- 
> Jon Forrest
> Research Computing Support
> College of Chemistry
> 173 Tan Hall
> University of California Berkeley
> Berkeley, CA
> 94720-1460
> 510-643-1032
> jlforrest at berkeley.edu
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list