another radical concept...Re: [Beowulf] Cooling vs HW replacement

Seth Bardash seth at integratedsolutions.org
Thu Jan 20 10:36:48 PST 2005


Here's another radical concept: (Not meant as an advertisement just an
explanation of good engineering practices)

Purchase, build or upgrade your systems so they can run properly at full
load in the ambient condition you presently have. 

We design all our system with 15 degrees C headroom at FULL LOAD at 25
degrees C ambient. We then test and burn-in every system we build to make
sure that ALL OF THEM meet this spec.

We test every machine using cpuburn: http://pages.sbcglobal.net/redelm/ (run
2 copies simultaneously for dual's) and make sure they operate properly and
run the processors and MB's at reasonable temperatures.

We install the latest i2c and lm sensors code and modify the sensors.conf to
give easily read output so we can monitor temperatures while the machine is
under full load. We monitor the temp every 10 seconds for a minimum of 4
hours to allow the systems to stabilize and then take readings to make sure
the systems meet spec.

We ran into a problem with a set of dual Opteron 246's in one machine out of
84 we were building for a cluster that had excellent cooling. We spoke at
length with AMD and they provided some spare Opteron 246's for testing and
took back the pair that was running 10 degrees hotter than all the other
processors we were installing. It turned out that the processors that were
running hot had heat-spreaders on them that were not exactly flat (they were
slightly concave). We would not have discovered the problem if we had not
done all the up front work required to install and run the software for
temperature testing.

In existing machines you can look on the web for better heatsink-fan
assemblies than those presently installed and extend the temperature range
of an existing machine. Although this requires being careful when upgrading
a machine the pay-off can eliminate most full load temperature problem.

Some of the best heatsink fan assemblies can be had from:

http://www.selectcool.com
http://www.micforg.co.jp/en/index.html
http://www.swiftnets.com/
http://www.zalman.co.kr/

Make sure when upgrading a machine you use the best available thermal
compound between the CPU and the HSF. We only use Artic Silver 5 and the
results are well worth the extra $0.25 per CPU.

The total cost to fix a poorly cooled system with a better HSF is usually
about $20 to $30 per CPU. You also might want to change the cooling fans in
the system with higher volume fans to get the heat out of the box. 

Load throttling should never be necessary on a well designed and well built
machine.

Just my 2 cents.....

Seth Bardash

Integrated Solutions and Systems
1510 North Gate Road
Colorado Springs, CO 80921

719-495-5866
719-495-5870 Fax
719-337-4779 Cell

http://www.integratedsolutions.org

Failure can not cope with perseverance! 



-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.7.1 - Release Date: 1/19/2005




More information about the Beowulf mailing list