[Beowulf] followup on 1000-node Caltech cluster
David Kewley
kewley at gps.caltech.edu
Sat Jun 18 16:04:56 PDT 2005
Hi all,
I wrote to this list on 6/6 about the large cluster that we expect to
install at Caltech. I got a bunch of great replies (on- and off-list),
and wrote a brief followup on 6/8. Here's another, more definite &
detailed followup.
It is confirmed that we will definitely be receiving this cluster around
7/20. The shipment will be 26 fully-loaded racks plus various other
stuff. I'm working out the receiving details; I believe at this point
that with eight workers we can get all 30 tons or so of pallets off the
truck(s) and into the subbasement room inside one workday. I have been
working closely with the Caltech Transportation department to plan
this, and I will work with Dell & the shipper as much as required until
we have a very good plan.
Our goal is to have the cluster running well enough to show off our
near-realtime earthquake application in late August or September.
There is of course an enormous amount of work that needs to happen both
before and after that point.
I was informed yesterday that there will be only one technical staff
member supporting the cluster. We will work very closely with our
vendors to get things working right, and we will hire additional help
in the first couple of months if needed. But after the initial period,
I'll be the only support person. I'm excited & honored to work with
this cluster, but I'm not at all certain that my best efforts are
capable of giving the results they want. We'll see; I've made my
opinions known quite clearly. I've also advised them to consider what
happens if I get hit by a bus or otherwise become unavailable.
Many of your replies focused on the need for as much automation as
possible. We will in fact have a lot of automation capability
available; the challenge will be integrating it all. The room has a
500kVA/400kW Liebert UPS for the computing equipment, a 130kVA UPS for
the HVAC, and six ?-ton Liebert chilled-water HVAC units. The cooling
is adequate; the numbers suggest that we'll be close to maxing out the
400kW computer UPS. All of this equipment will be networked, and our
automation systems (to be built) will have access to lots of data and
control parameters on this equipment.
Our compute node racks will dissipate up to ~14kW. The power strips are
three 5.7kW 208V 3-phase APC networked/switched "rack PDUs". The
intakes for the HVAC units are ~3 feet from the backs of the racks. We
expect to have almost laminar air flow, but with good local & roomwide
mixing due to an up&down airflow pattern from the supply ducts.
We also have a BTU meter hooked up to the chilled water supply/return
lines. Combined with the UPS data, we will thus be able to have our
automated systems continuously calculate the energy balance -- is the
cooling system taking out the energy going in? The sign of the
instantaneous energy balance is a very good predictor of the
temperature trend in the room. Thus we will be able to initiate alarms
and/or shutdown procedures long before the temperature rise is
noticeable. This is all to be automated *before* the systems are
permanently powered up -- we will not have 24/7 human coverage, so the
automated response systems are critical.
It was calculated that if the air cooling suddenly stopped while the
power into the room was 400kW, but the fans kept the air circulating,
the average rate of temperature increase would be 4 degrees F per
second. I remain a bit skeptical of that number personally -- for
example, there was no allowance for the heat capacity of the
now-stationary chilled water in the heat-transfer coils, nor of the
walls, floors, ceiling, air ducts, and other surfaces in the room. But
the Planetary Science folks here know their atmostpheric modeling (it's
what they do for a living), so I'm pretty confident that their results
are correct given their simple assumptions.
We plan to configure our automated systems to initiate a compute-node
shutdown immediately upon loss of power or loss of chilled water flow,
to be completed in ~30seconds, with enforcment at the 30-second mark
via our separate controls on the individual power lines. Shutdown of
the more critical systems may be delayed, but will also occur
automatically in a continuing loss situation.
We'll also deal with events from the smoke alarm / precharge fire
suppression system and the subfloor water sensors in automated,
appropriate ways. We'll have a programmatic trigger for the Emergency
Power Off circuit for the room, as well as a number of
shielded-pushbutton and emergency-break-glass buttons in the room
itself.
I'm hoping to recruit five volunteers to stand at the six AC units with
me and test what happens if all the fans stop at once with the
computers running full-tilt. Hopefully we won't cook too quickly to
turn the units back on. The person who did the calculation refuses to
help with this test, and thinks we shouldn't do it at all. :)
We've already had a flood in the subbasement that left at least 2" of
water on the sub-raised-floor slab (cause of the flood: plumber errors
in a different area of the subbasement). The machines stayed up until
I shut them down manually (this happened on a Friday night; I got a
7:13 AM Saturday phone call), but the power outlets were on the slab
sitting in (clean) water for a few hours, so the electrical contacts
were completely corroded. The outlets are now raised 13", and the
conduits to the outlets are waterproof.
We also have calcium deposits on the unsealed cement slab, which sits on
soil. The cause is water impregation of the slab (from the flood and
from failure of the under-slab sealing layer) bringing lime to the
surface where the water evaporates, leaving the lime. I will be
investigating how best to prevent this in the future, and will be
arranging for cleaning. The sub-raised-floor area is a high-speed air
conduit for the facility, and we've already had a couple of lime
snowstorms when previously-idle HVAC units were turned on.
Regarding automation capabilities of the computing equipment itself: We
will have console access via the Dell IPMI BMC (baseboard management
controller), over the ethernet network. I have a hint at least that
BIOS management and updates can be automated to some degree on these
machines, but I haven't verified the details. The BMC will give us
power-on/off/reset control, plus we have networked switched power
strips as another mechanism to control power to individual nodes.
Rocks is installed on our existing 160-node cluster, but I am far from
sufficiently familiar with it. I will be learning a lot about it in
the near future, and I've gotten expressions of interest & support from
Rocks authors & users. One thing Rocks will do (combined with the DELL
PXE support) is automatic re-imaging of the compute nodes. So that's
taken care of.
We have NBD onsite service on the compute nodes and 4hr onsite service
on the critical equipment. I fully intend and expect to get Dell to
help me tap this support efficiently. I am confident that will work OK
-- if it doesn't, they'll hear about it. We do have a few spare
compute nodes; I believe we'll also have a spare 48-port GigE Nortel
switch (these stackable switches form our GigE network). The node hard
disks are 10k & 15kRPM SCSI. Out of the 161 machines I have now, a
handful of disks have gone south (in ~6 months), and a few more nodes
have failed for other reasons.
We will have support for LSF and for Ibrix; I expect good support
whenever I need it. Of course, I'll have to become intimately familiar
with these technologies myself.
I have Myrinet running fine on our 160-node cluster. I had no
significant hardware issues, and the only software issues were getting
the required versions of the software for our hardware (not included
with Rocks 3.2), and doing several days of work to convert e.g. gm into
a proper rpm. GM's build tools don't conform well to rpm's design
assumptions, but I got a very good rpm in the end. Upgrades will now
be easy.
I will get very good support from Myricom -- they have been very helpful
already, and are very interested in seeing this cluster work well.
I have a good deal of Linux experience, and I have a good number of
immediate colleagues whose experience I tap regularly (and they mine).
We have good Linux resources on campus. Even so, I will be treading on
territory that no one on campus has seen before. This machine room
will be rivaled at Caltech only by the CACR facility, which is a
collection of a large number of systems in one large room.
Our data storage will be a top-notch DataDirect SAN with 30-40 TB total
available after redundancy, built on FibreChannel disks. I expect to
have a single filesystem served by sixteen Ibrix segment nodes with
high-availability failover of the servers (and RAID and other
redundancy on the SAN side).
I am unsure whether I'll send all the storage data over the Nortel-based
GigE network (our initial design), or whether I'll kick some compute
nodes off the Myrinet and put the Ibrix segment servers on that. It
will depend on the performance and other issues we see, and I'll have
some experts to draw on when deciding these issues.
We will not support user storage of non-scratch data on the node-local
hard drives. A select subset of the multi-TB main data store will be
backed up using a 4.8TB-native LTO2 tape library.
The user pool is already fairly experienced on a couple of mid-sized
clusters, and I expect we'll keep that experience pool growing as
students etc. come and go. They will be expected to help each other
with the majority of helpdesk and howto issues.
Regarding mission creep, the problem is that I have a science background
and enjoy programming, so I'd *like* to work with users on their code
issues. :) I've been told, though, that application support is not my
responsibility. I will have to exert a lot of self-discipline not to
get scattered.
There ya go, a more complete description of our situation. I'll be
contacting our vendors (including those on this list) for help in
planning and execution in the next month and beyond. Beyond that, I
may not have sufficient time to respond to each person who writes to
me. All the same, I very much appreciate any feedback you send my way.
Thanks!
David
More information about the Beowulf
mailing list