[Beowulf] number of admins

Wed Jun 15 07:11:21 PDT 2005

Brian D. Ropers-Huilman writes:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> A 1,024 node cluster is sizable. However, given that you  are running
> Rocks, it is likely that you can get by with one sysadmin, so long as they
> know their way around the HPC world. But, I wouldn't stop there. I would
> augment that sysadmin with what I term a Scientific Computing support
> person who could aid in the admin, but is mostly responsible for the
> software stack and optimizing communications and storage for the users.

I was going to contribute to the answer earlier, but was screwing around
with my new laptop (and its ipw2200 driver) and got distracted.
Besides, I agreed with most of the answers that were posted.

One element that I don't recall being elicited in the discussion so far
that is pretty important, however, is that the answer contains a great
big "it depends" in it, and is somewhat time dependent.

As Brian has indicated, on a good day and with a good cluster
distribution or distribution add-on layer, a single sysadmin can manage
even 1024 nodes.  However, not all days are good days.

With a cluster this large, in the LONG run your administration needs
break down into distinct areas:  Human, software, hardware,
infrastructure.  

Human tends to be dominated by a the 10% or so personality-disordered
individuals who use up 90% of your time (making you grit your teeth
while you help them).  If the cluster has a small user base of
knowledgeable people it may be a small demand timewise, but with such a
big cluster and a possibly large and ignorant user base it could
dominate.  It is also "bursty", giving you a choice of staffing to meet
the AVERAGE demand (requiring that people wait in line when a burst of
demand occurs) or staffing to meet the PEAK demand (so nobody ever has
to wait much to get help).  Both have costs.  The first costs you lost
productivity of the folks in line (and this will be a theme below, so
remember it).  The second costs you the idle time of the surplus staff,
unless you can redirect their opportunity cost labor into other
channels.

The human (software) costs in a linux system are minimal -- setting up
repositories for installation and maintenance, building key packages as
required, organizing a shared file system and account space, doing all
the "systems administration" things that make a cluster go.  With a
solid base distro, pxe/kickstart, and prebuilt packages either as part
of the distro or as an add-on layer (rocks vs warewulf, e.g.) most of
your time here will be taken up by any EXTRA packages or SPECIFIC tools
and applications you need to install and support.  These can be quite
costly to administer (floating licenses, per seat/node pricing,
configuration, training) but you'll have to assess your needs based on
the specifics of what you plan to do.  Batch managers also require a
fair bit of tweaking and administration at least at first.

Two things here limit its long run human cost.  One is that MOST of the
costs are front loaded.  Setting the cluster (software) UP is time
consuming, but it is only done once and can really be done (for the most
part) by one person no matter how many you have available to help out.
The second is that a lot of the administration in linux based systems is
now thoroughly automated and scales at the theoretical limit of
scalability.  Once it is set up, yum (e.g.) maintains it and
administrative load SHOULD go way, way down.  To where sure, one person
can run it and get change back from their dollar most days, barring some
really horrible problems with a specific, mission critical package.

The human (hardware) costs are one place that I don't think has gotten
enough attention.  Just assembling 1024 nodes and racking them out of
the box is a time consuming and thankless job that will take quite a
long time depending on their state of assembly when purchased.
Attaching rails, racking, cabling a node out of the box takes perhaps a
MINIMUM of 15 minutes per, or four per hour, or 32 per day.  That adds
up to order of five or six WEEKS for a single person to install your
cluster nodes from delivery boxes to racks, divide by the number of
people working.

This sort of equation plays on through on the hardware maintenance side.
Even though the probability of a hardware event per node per day is
generally small, the probability per CLUSTER per day is NOT likely to be
small.  Then there is the dread word "generally", implying that there
are exceptions.  Two exceptions that have gotten lots of list attention
recently are the capacitor from hell (affected close to half of all
cheap motherboards in the known universe) -- ran for a year flawlessly,
but between years 2 and 4 80% of them would blow, often messily,
toasting said motherboard.  I still have some cases with the little
smokey smear on the inside from the spatter pattern.  Then there were
the eternal bios etc problems of the Tyan 246x dual Athlon motherboard
-- great cluster motherboard performance wise, but a total time suck
otherwise.  

Let's call this the "lemon factor".  If you end up with a lemon for your
standard node (and NOTHING can prevent this, as some lemons only reveal
themselves in years 2+ when it is too late) then, well, a single person
will be in some sort of hardware hell forever, even if the cluster is
under a maintenance contract.  Having gotten bitten by BOTH the
capacitor problem AND the 246x problems, I've spent a fair bit of time
in this particular hell.  In this case one person will Not Be Enough.
How many will be enough?  Hard to say and depends heavily on whether
nodes are under maintenance, but probably a minimum of 3 unless your
users are comfortable having large numbers of nodes out of commission at
any given time.

Even without lemons, maintenance costs tend to have a particular shape
that is peaked on the leading edge ("burn in", when most defects reveal
themselves) and ends with a rising curve as hardware ages out and starts
to fail.  Long term chronic problems tend to be power supplies, fans
(especially in a dusty environment), hard drives, but anything else CAN
contribute as an individual lemon.  In years 2+, I personally would say
that hardware maintenance will DOMINATE your cluster maintenance
activity.  Expect failures of one part or another just about per day, at
a cost of hours per incident to resolve even with maintenance contracts,
more of not.  You can see that if the failure rates nudge JUST A BIT up,
it will overwhelm a single person, and by years 3+ they will, almost
certainly, do so.  I don't think one person can manage 1024 nodes for a
full 3 year life cycle (or beyond) without falling behind at the end.

Finally, there is infrastructure.  This has a nontrivial human component
at least some of the time -- certainly during initial setup, but every
time e.g. power or AC fails, your cluster can require SIGNIFICANT human
effort.  As was pointed out, this most often happens in the middle of
the night.  To properly cover a 24x7 cluster, one therefore needs enough
humans to be able to respond promptly all year long, vacations and
holidays.  That is a minimum of 2-4 people right there.

    rgb

> 
> I have a 512 node Linux cluster, a 128 node Linux cluster, a 32 node Apple
> cluster, a 16 node Linux cluster, and a new 32 processor Altix/Prism, which
> is on the way. I have a total of 3 sysadmins and currently only 1 Sci.
> Comp. support person, though I still have an opening there. I also have an
> opening for a true Help Desk support person to handle the more mundane
> aspects of serving user's needs on these systems. This staff works with ~5
> undergraduate student workers as well, though that number could be smaller
> or further augmented by a couple of good graduate students.
> 
> Your 1,024 node system will likely never run a 2,048 processor job, other
> than your initial HPL if you pursue that. I say this because there _will_
> be hardware issues. I do not have experience with Dell's HPC systems, but I
> know George Jones is doing a heck of a job getting them out there so I have
> to believe they work well. In terms of the Myrinet and other software, yes,
> the system should be quite stable given today's software stacks.
> 
> You ask about non-obvious skill sets. I would bring in someone who's good
> at scripting, which is not necessarily something a sysadmin will have. Any
> sysadmin will be able to do some level of scripting, but you'll want
> someone who is quite skilled in this area. This person can help you
> automate processes on the system such as: name space management, additional
> usage reports, disk scrubbers, automatic documentation of the installed
> software, and the like. We do everything via LDAP and have a series of
> command-line PHP scripts for managing user space and other things.
> 
> I'd be willing to talk more off-line too if you're interested.
> 
> David Kewley said the following on 2005.06.06 17:23:
>> Hi all,
>> 
>> We expect to get a large new cluster here, and I'd like to draw on the 
>> expertise on this list to educate management about the personnel 
>> needed.
>> 
>> The cluster is expected to be:
>> 
>> ~1000 Dell PE1850 dual CPU compute nodes
>> master & other auxiliary nodes on similar hardware
>> 1024-port Myrinet
>> Nortel stacked-switches-based GigE network
>> many-TB SAN built on Data Direct & Ibrix
>> Platform Rocks
>> Platform LSF HPC Rocks roll
>> Moab added later, quite possibly
>> tape library backup (software TBD)
>> NFS service to public workstations
>> nine man-weeks of Dell installation support
>> 10 man-days of Ibrix installation support
>> 
>> The users will be something like:
>> 
>> ~10 local academic groups, perhaps 60 users total
>> several different locally-written or -customized codebases
>> at least one near-real-time application with public exposure
>> 
>> We have some experience already with a 160-node Dell cluster that has 
>> some of the basic elements listed above, but several of the pieces will 
>> be totally new, and some of the pieces we already have will need 
>> greater care.
>> 
>> My questions to you are:
>> 
>> * How many sysadmins should we plan to have once the cluster is stable?
>> * Is there indeed any such thing as a "stable" cluster of this sort, and 
>> if so, should we get additional help during the initial phase of the 
>> project, when things are less stable (help beyond the vendor 
>> installation support listed above)?
>> * If we need more help in the initial phases, how might we go about 
>> finding people?  Contract workers?  Commercial or private 
>> consultancies?
>> * Should we look for any specific non-obvious skillset, or would skilled 
>> sysadmins be adequate?
>> 
>> And finally:
>> 
>> * If we only have one sysadmin, someone who is bright and capable, but 
>> is learning as they go, is that too small a support staff?
>> * If one such sysadmin is too little, then what would you expect the 
>> impact on the users to be?
>> 
>> I have been giving my opinion to management, but I'd really like to get 
>> (relatively unbiased) professional opinions from outside as well.  I 
>> thank you for any comments you can make!
>> 
>> David
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
> - --
> Brian D. Ropers-Huilman  .:. Asst. Director .:.  HPC and Computation
> Center for Computation & Technology (CCT)        bropers at cct.lsu.edu
> Johnston Hall, Rm. 350                           +1 225.578.3272 (V)
> Louisiana State University                       +1 225.578.5362 (F)
> Baton Rouge, LA 70803-1900  USA              http://www.cct.lsu.edu/
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.4 (Darwin)
> Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
> 
> iD8DBQFCpymGwRr6eFHB5lgRAs+DAKCUyh4nBq5AecBpqlQLNu/cEsn2RACg+Vwq
> aXqIFfr70DqO/40lOyQl93E=
> =8lgx
> -----END PGP SIGNATURE-----
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050615/4f406ed4/attachment.sig>