[Beowulf] 96 cores in silent and small enclosure

Gerry Creager gerry.creager at tamu.edu
Fri Apr 16 18:06:38 PDT 2010

I hadn't looked at -217 since, well, I was designing spaceflight 
hardware... This is a very nice set of references; I'm especially fond 
of perusing the Weibull data. I'd not looked at klabs before.

I'll echo the 2x factoring with 10 deg temperature rise. And, I hear, al 
the time, from bean counters and room monitors, how we should run our 
machine rooms hotter. I've got 2 with ambient setpoints at 80F right 
now, and we see, in our 300 node cluster, an average of one DIMM and one 
hard drive/week. It's a real good thing the hardware's all still under 
maintenance, else we'd be out of systems already.  Over the winter, when 
building thermal sink was lower, we also saw fewer failures.


Lux, Jim (337C) wrote:
> Try this
> http://rel.intersil.com/docs/rel/calculation_of_semiconductor_failure_rates.pdf
> You might also look for MIL-HDBK-217
> Of course, a paper by H.S. Blanks makes the following statement:
> Although the temperature dependence of failure rate can be very high, in most situations it is much less than that of the Arrhenius acceleration factor. It is very improbable that the temperature dependence of component failure rate can be meaningfully modelled for reliability prediction purposes or for the purpose of optimizing thermal design component layout.
> (from abstract for "Arrhenius and the temperature dependence of non-constant failure rate" Quality and Reliability Engineering International, Vol 6, #4, pp259-265, 20 Mar 2007)
> You might also browse around http://www.weibull.com/  or http://www.klabs.org/ 
> Jim
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Jon Tegner
> Sent: Wednesday, April 14, 2010 1:12 AM
> To: Mark Hahn
> Cc: beowulf at beowulf.org
> Subject: Re: Re: [Beowulf] 96 cores in silent and small enclosure
> the max temp spec is not some arbitrary knob that the chip vendors
> choose out of spiteful anti-green-ness. I wouldn't be surprised to see some
> ****************************************************************
> Issue is not the temp spec of current cpus, problem is that it is hard to get relevant information. I haven't found any that states that the failure rate in year 5 should be significantly higher if you operate the cpu at 65 C instead of 55 C. I'm just saying this kind of information would be valuable (and I would be glad to find it).
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list