[Beowulf] Approach For Diagnosing Heat Related Failure?

Victor Gregorio vgregorio at penguincomputing.com
Tue Jul 21 13:56:00 PDT 2009


Hello Jon,

If your system has temperature and fan sensors, you might be able to use
lm_sensors to display component temperatures and diagnose fan failures.

[root at tesla ~]# sensors-detect                  # answer all defaults
[root at tesla ~]# /etc/init.d/lm_sensors start    # load kernel modules
[root at tesla ~]# sensors                         # check sensor stats

Hope this helps.  Regards,

-- 
Victor Gregorio
Penguin Computing

On Tue, Jul 21, 2009 at 11:56:23AM -0700, Jon Forrest wrote:
> I have a rack full of identical compute
> nodes. One of them has become heat sensitive.
>
> When it's in the warm computer room it crashes.
> I can't even run memtest from the CentOS DVD
> for 2 seconds. However, when this node is
> in my much cooler office everything works
> fine. All the other nodes are working fine
> in the computer room.
>
> I'm not convinced the problem is actually
> the memory. Other than opening the node
> to spray cooling liquid when it's in the warm
> room, what approach would you use to figure out which
> component(s) is(are) failing?
>
> Cordially,
> -- 
> Jon Forrest
> Research Computing Support
> College of Chemistry
> 173 Tan Hall
> University of California Berkeley
> Berkeley, CA
> 94720-1460
> 510-643-1032
> jlforrest at berkeley.edu
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list