Issues with 2466 based cluster

Mark Hahn hahn at physics.mcmaster.ca
Sat Oct 19 12:02:14 PDT 2002


> While i wait for all of your expert opinions on the
> issue i will go and oil the fans....
> They call it system administration :)

moving parts are the root of all evil.  well, second to heat, of course ;)

some cluster vendors actually brag about how many fans they manage
to stuff into their nodes.  this always puzzles me - afakit, it is 
based on a "redundant array of flakey fans" (RAFF ala RAID), which
seems kind of a dubious concept.  for instance, it assumes the failures
are fairly uniform in time (and they certainly aren't!).  and that 
a sort of effective "airflow balancing" happens if a fan dies (maybe...)

other vendors take the approach of reducing the number of fans while
also using more reliable ones.  for instance, it's commonly held that 
centrifugal fans are more reliable than muffin fans.

high-end vendors do all of the above: replication and quality - 
in fact, our ES40's have multiple redundant fans, which, critically,
are tested at power-on, continuously monitored, and the redundant ones 
kept un-powered...

anyway, I've oiled fans, too.  my conclusion is that the margins for 
muffin fans are just so slim that everyone uses the same crappy technique
of using a plastic sticker to seal the bearing.  it doesn't matter whether
it's ball-bearing or sleeve - when the oil's gone, it's just going to 
sit there quivering (or sieze entirely if the leaking oil collected 
enough dust...)

I'm just now comleting the order for a new cluster: 1U dual Xeons
with passive heatsinks and centrifugal fans.  hopefully low-maintenance...




More information about the Beowulf mailing list