[Beowulf] number of admins

Wed Jun 8 11:26:15 PDT 2005

Chris Dagdigian wrote:
> My $.02

Actually the advice given is worth more than that - it's pretty much 
right on target.  I have a couple additions.

> The number of sysadmins required is a function of how much  
> infrastructure you have in place to reduce operational burden:
> 
>  - remote power control over all nodes
> 
>  - remote access to BIOS on all nodes via serial console
> 
>  - remote access to system console  via serial port on all nodes

Alternately, the above "good ideas" can be met by 1) a crash cart with a 
  display/keyboard/mouse, and 2) a 24/7 ops staff.

A quick work on facilities - a 1000 node cluster is quite an 
installation and facilities nightmare.  You are looking 100 tons of AC 
and 400kVA of power (at least).  No to mention UPS and generator backup.
Get outside consultants to come in and assess your needs.  Make sure 
that they know clusters.  I have met many people who have years of 
machine room operational experience, but still have difficulty wrapping 
their head around the concept of a single *rack* of equipment that can 
radiate 14kW+ of heat.  Much less 25 or so of these racks packed 
shoulder to shoulder.

If you do not have an around-the-clock on-site staff, then you will need 
to sit down and carefully run through a couple scenarios -

1) it's 3am on a Saturday night and you lose one of your coolers while 
the cluster is at full load.  You have about about 15-30 minutes before 
you start seeing hardware melting - who shows up, how long does it take 
them and what do they do?

2) same day, same time - you have a water pipe burst.  same questions.

I am not sure of how your organization is structured, but I would highly 
recommend meeting with the groups that run the major campus computing 
infrastructure - the folks who do 24/7 support, the ones who run the 
*big* machine rooms.  Talk to them and bring them into the process 
early.  Get their advice.

>  - unattended/automatic OS installation onto bare metal (autoYast,  
> kickstart, systemimger etc.)
> 
>  - unattended/automatic OS incremental updates to running nodes

Absolutely required - here at Duke we use pxe/kickstart/yum to auto 
install and maintain patch levels on all nodes.

Note that the success of your cluster will depend on the local/campus 
Linux infrastructure - OS repositories, application repositories, local 
knowledge.  If you do not have this readily available, then you will 
have to build it.

Do rely upon the availability of outside Linux resources that you don't 
have at least some influence with.

>  - documented plan for handling node hardware failures which  includes 
> specific info on when and how an admin is expected to spend  time 
> diagnosing a problem versus when the admin can just hand the  node off 
> to a vendor or someone else for simple planned replacement  or advanced 
> troubleshooting.  For Dell systems you want to have an  agreement in 
> place where your sysadmin can make a judgement call that  a node needs 
> replacement WITHOUT having to first wade through the  hell that is 
> Dell's first tier of customer support.

I will second this recommendation.  Dell customer support means well, 
but they are trained to deal more with Mom&Dad's PC rather than a room 
full of servers.  "Reboot the machine with the diagnostics disk" is not 
an option when it's a fileserver and one disk in a RAID has clearly died.

Maintain a full spares kit on-site.  The kit should include at lease a 
hard drive, replacement PS, replacement network switch, and a full set 
of spares for your myrinet or infiniband network.  Don't forget the cables.

> If you have the infrastructure in place where your admin(s) can do  
> everything remotely including OS installs, console access and remote  
> power control then you may be able to get away with a single admin  (as 
> long as his/her job is tightly scoped to keeping the cluster  
> functional). If you have not pre-planned your architecture to make  
> administration as easy and as "hands off" as possible then you are  
> going to need many hands.

Even if you have, I would plan on at least two really *good* sys admins 
to manage the cluster.  I would add a third to manage the storage and 
backups.

> The biggest reason for cluster deployment unhappiness can be traced  to 
> this:
> 
>  - management and users expect the cluster operators to also also be  
> experts with HPC programming, the applications in use, application  
> integration issues and the cluster scheduler. This almost never works  
> out well as the skills and background needed to keep a cluster  running 
> are often quite different from the expertise needed to  understand the 
> local research efforts and internal application mix.

You cannot underestimate this.  You should have at least one full time 
HPC applications person, probably two.  Often this a post-doc level.  As 
you build your cluster, it is vital that you build up your personnel to 
match.

Hope this helps.  Good luck!

-bill

-- 
bill rankin, ph.d. ........ director, cluster and grid technology group
wrankin at ee.duke.edu .......................... center for computational
duke university ...................... science engineering and medicine
http://www.ee.duke.edu/~wrankin .............. http://www.csem.duke.edu