[Beowulf] followup on 1000-node Caltech cluster

Mark Hahn hahn at physics.mcmaster.ca
Mon Jun 20 07:44:05 PDT 2005

> We have had issues with air-cooling racks that generate more than  
> about 8kW of energy.  We just couldn't get enough cold air into the  
> front of the rack.

how much airflow (CFM) do you see from tiles in the front of your racks?
for 8kW, I'd expect maybe 8-900 CFM, or around 2 modestly-performing 
perforated tiles.  I'm reasonably happy with the tiles in my new machineroom:
about 600 CFM apiece, and placed 2/rack.

the rule-of-thumb is 1 ton (3.5 KW) per tile, but that assumes somewhat
older, lower-flow tiles, I think.

> We also had issues of air from the back of the  
> rack recirculating around to the front of the racks (on the edges of  
> the cluster)...

how long are your rows?

> > The intakes for the HVAC units are ~3 feet from the backs of the  
> > racks.  We
> > expect to have almost laminar air flow, but with good local & roomwide
> > mixing due to an up&down airflow pattern from the supply ducts.

obviously the hot air will want to rise, but I suppose enough velocity
will make it go where you want.  my machineroom is "half-ducted" as well:
downflow chillers, 16" raised floor acting as a cold air plenum, but 
open space above for hot/return air.  it's a bit of a risk - it would 
offer a lot more control to have a suspended ceiling close to the top
of the racks, with the supra-ceiling space acting as a return plenum.

> When we lost air conditioning to our machine room (we currently have  
> about 300kW of gear) the room went up about 10 degrees celcius in  
> about 10 minutes.  Now, we have automatic processes to shut machines  

that sounds surprisingly slow.  our older machineroom had only about 30KW
in it, but it was fairly small.  when cooling was lost, we went up >15 C
in <5 minutes.

interestingly, there's no real point to keeping up compute nodes via UPS
unless you also have the chillers+blowers on UPS or automatic generator.
in fact, all of our new machines (~6K cpus across 4 large clusters) have 
UPS-less compute nodes.

> though our machine hadn't shut down (it didn't reach the critical  
> temperature) we were loosing disks and dimms all over the place just  
> due to the sudden rate of rise of temperature.  When we spoke with HP  

hmm, I suppose the temperature is somewhat fractal, so that if you measure
it in different places, you'll see quite different rates of change.

in our old machineroom, I have 1-wire sensors strapped to cold-water pipes,
in the cold air plenum, in the return plenum as well as in a quiet corner 
of the room.  auto-shutdown looks for both hotter return as well as 
too-warm "cold" air.  in both cases the air is moving fast enough to give 
the sensors rapid response.

regards, mark hahn.

More information about the Beowulf mailing list