[Beowulf] Cooling vs HW replacement

Josip Loncaric josip at lanl.gov
Tue Jan 18 08:30:03 PST 2005

At my old job, we had the unfortunate experience of AC failing on the 
hottest days of the year.  Despite providing plenty of circulating fresh 
35-40 deg. C air, we lost hardware, mainly disks.  In fact, we'd start 
losing hard drives (even high quality SCSI drives in our servers) any 
time the ambient temperature approached 30 deg. C.

Based on this experience, I'd say that keeping the ambient temperature 
under about 25-27 deg. C is a good policy.  As Robert has pointed out, 
the cost of lost productivity while the system is down for hard drive 
replacement and reconstruction, not to mention the manpower required, 
can make an unreliable system "AWESOMELY expensive."

In fact, I'd recommend installing a temperature activated kill switch in 
any cluster computing room.

Remember: dissipating 5-10 KW in a small enclosed space can overheat 
your expensive cluster within minutes of AC failure, certainly faster 
than your system administrator can respond to an alarm triggered on a 
Sunday at 2am.  Even a forced shutdown (when ambient temperature exceeds 
about 30 deg. C for more than a few minutes) is cheaper to fix than 
replacing and rebuilding several failed hard drives.


More information about the Beowulf mailing list