[Beowulf] number of admins
Bill Rankin
wrankin at ee.duke.edu
Wed Jun 8 11:26:15 PDT 2005
Chris Dagdigian wrote:
> My $.02
Actually the advice given is worth more than that - it's pretty much
right on target. I have a couple additions.
> The number of sysadmins required is a function of how much
> infrastructure you have in place to reduce operational burden:
>
> - remote power control over all nodes
>
> - remote access to BIOS on all nodes via serial console
>
> - remote access to system console via serial port on all nodes
Alternately, the above "good ideas" can be met by 1) a crash cart with a
display/keyboard/mouse, and 2) a 24/7 ops staff.
A quick work on facilities - a 1000 node cluster is quite an
installation and facilities nightmare. You are looking 100 tons of AC
and 400kVA of power (at least). No to mention UPS and generator backup.
Get outside consultants to come in and assess your needs. Make sure
that they know clusters. I have met many people who have years of
machine room operational experience, but still have difficulty wrapping
their head around the concept of a single *rack* of equipment that can
radiate 14kW+ of heat. Much less 25 or so of these racks packed
shoulder to shoulder.
If you do not have an around-the-clock on-site staff, then you will need
to sit down and carefully run through a couple scenarios -
1) it's 3am on a Saturday night and you lose one of your coolers while
the cluster is at full load. You have about about 15-30 minutes before
you start seeing hardware melting - who shows up, how long does it take
them and what do they do?
2) same day, same time - you have a water pipe burst. same questions.
I am not sure of how your organization is structured, but I would highly
recommend meeting with the groups that run the major campus computing
infrastructure - the folks who do 24/7 support, the ones who run the
*big* machine rooms. Talk to them and bring them into the process
early. Get their advice.
> - unattended/automatic OS installation onto bare metal (autoYast,
> kickstart, systemimger etc.)
>
> - unattended/automatic OS incremental updates to running nodes
Absolutely required - here at Duke we use pxe/kickstart/yum to auto
install and maintain patch levels on all nodes.
Note that the success of your cluster will depend on the local/campus
Linux infrastructure - OS repositories, application repositories, local
knowledge. If you do not have this readily available, then you will
have to build it.
Do rely upon the availability of outside Linux resources that you don't
have at least some influence with.
> - documented plan for handling node hardware failures which includes
> specific info on when and how an admin is expected to spend time
> diagnosing a problem versus when the admin can just hand the node off
> to a vendor or someone else for simple planned replacement or advanced
> troubleshooting. For Dell systems you want to have an agreement in
> place where your sysadmin can make a judgement call that a node needs
> replacement WITHOUT having to first wade through the hell that is
> Dell's first tier of customer support.
I will second this recommendation. Dell customer support means well,
but they are trained to deal more with Mom&Dad's PC rather than a room
full of servers. "Reboot the machine with the diagnostics disk" is not
an option when it's a fileserver and one disk in a RAID has clearly died.
Maintain a full spares kit on-site. The kit should include at lease a
hard drive, replacement PS, replacement network switch, and a full set
of spares for your myrinet or infiniband network. Don't forget the cables.
> If you have the infrastructure in place where your admin(s) can do
> everything remotely including OS installs, console access and remote
> power control then you may be able to get away with a single admin (as
> long as his/her job is tightly scoped to keeping the cluster
> functional). If you have not pre-planned your architecture to make
> administration as easy and as "hands off" as possible then you are
> going to need many hands.
Even if you have, I would plan on at least two really *good* sys admins
to manage the cluster. I would add a third to manage the storage and
backups.
> The biggest reason for cluster deployment unhappiness can be traced to
> this:
>
> - management and users expect the cluster operators to also also be
> experts with HPC programming, the applications in use, application
> integration issues and the cluster scheduler. This almost never works
> out well as the skills and background needed to keep a cluster running
> are often quite different from the expertise needed to understand the
> local research efforts and internal application mix.
You cannot underestimate this. You should have at least one full time
HPC applications person, probably two. Often this a post-doc level. As
you build your cluster, it is vital that you build up your personnel to
match.
Hope this helps. Good luck!
-bill
--
bill rankin, ph.d. ........ director, cluster and grid technology group
wrankin at ee.duke.edu .......................... center for computational
duke university ...................... science engineering and medicine
http://www.ee.duke.edu/~wrankin .............. http://www.csem.duke.edu
More information about the Beowulf
mailing list