[Beowulf] number of admins

Brian D. Ropers-Huilman bropers at cct.lsu.edu
Wed Jun 8 10:23:19 PDT 2005


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

A 1,024 node cluster is sizable. However, given that you  are running
Rocks, it is likely that you can get by with one sysadmin, so long as they
know their way around the HPC world. But, I wouldn't stop there. I would
augment that sysadmin with what I term a Scientific Computing support
person who could aid in the admin, but is mostly responsible for the
software stack and optimizing communications and storage for the users.

I have a 512 node Linux cluster, a 128 node Linux cluster, a 32 node Apple
cluster, a 16 node Linux cluster, and a new 32 processor Altix/Prism, which
is on the way. I have a total of 3 sysadmins and currently only 1 Sci.
Comp. support person, though I still have an opening there. I also have an
opening for a true Help Desk support person to handle the more mundane
aspects of serving user's needs on these systems. This staff works with ~5
undergraduate student workers as well, though that number could be smaller
or further augmented by a couple of good graduate students.

Your 1,024 node system will likely never run a 2,048 processor job, other
than your initial HPL if you pursue that. I say this because there _will_
be hardware issues. I do not have experience with Dell's HPC systems, but I
know George Jones is doing a heck of a job getting them out there so I have
to believe they work well. In terms of the Myrinet and other software, yes,
the system should be quite stable given today's software stacks.

You ask about non-obvious skill sets. I would bring in someone who's good
at scripting, which is not necessarily something a sysadmin will have. Any
sysadmin will be able to do some level of scripting, but you'll want
someone who is quite skilled in this area. This person can help you
automate processes on the system such as: name space management, additional
usage reports, disk scrubbers, automatic documentation of the installed
software, and the like. We do everything via LDAP and have a series of
command-line PHP scripts for managing user space and other things.

I'd be willing to talk more off-line too if you're interested.

David Kewley said the following on 2005.06.06 17:23:
> Hi all,
> 
> We expect to get a large new cluster here, and I'd like to draw on the 
> expertise on this list to educate management about the personnel 
> needed.
> 
> The cluster is expected to be:
> 
> ~1000 Dell PE1850 dual CPU compute nodes
> master & other auxiliary nodes on similar hardware
> 1024-port Myrinet
> Nortel stacked-switches-based GigE network
> many-TB SAN built on Data Direct & Ibrix
> Platform Rocks
> Platform LSF HPC Rocks roll
> Moab added later, quite possibly
> tape library backup (software TBD)
> NFS service to public workstations
> nine man-weeks of Dell installation support
> 10 man-days of Ibrix installation support
> 
> The users will be something like:
> 
> ~10 local academic groups, perhaps 60 users total
> several different locally-written or -customized codebases
> at least one near-real-time application with public exposure
> 
> We have some experience already with a 160-node Dell cluster that has 
> some of the basic elements listed above, but several of the pieces will 
> be totally new, and some of the pieces we already have will need 
> greater care.
> 
> My questions to you are:
> 
> * How many sysadmins should we plan to have once the cluster is stable?
> * Is there indeed any such thing as a "stable" cluster of this sort, and 
> if so, should we get additional help during the initial phase of the 
> project, when things are less stable (help beyond the vendor 
> installation support listed above)?
> * If we need more help in the initial phases, how might we go about 
> finding people?  Contract workers?  Commercial or private 
> consultancies?
> * Should we look for any specific non-obvious skillset, or would skilled 
> sysadmins be adequate?
> 
> And finally:
> 
> * If we only have one sysadmin, someone who is bright and capable, but 
> is learning as they go, is that too small a support staff?
> * If one such sysadmin is too little, then what would you expect the 
> impact on the users to be?
> 
> I have been giving my opinion to management, but I'd really like to get 
> (relatively unbiased) professional opinions from outside as well.  I 
> thank you for any comments you can make!
> 
> David
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

- --
Brian D. Ropers-Huilman  .:. Asst. Director .:.  HPC and Computation
Center for Computation & Technology (CCT)        bropers at cct.lsu.edu
Johnston Hall, Rm. 350                           +1 225.578.3272 (V)
Louisiana State University                       +1 225.578.5362 (F)
Baton Rouge, LA 70803-1900  USA              http://www.cct.lsu.edu/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCpymGwRr6eFHB5lgRAs+DAKCUyh4nBq5AecBpqlQLNu/cEsn2RACg+Vwq
aXqIFfr70DqO/40lOyQl93E=
=8lgx
-----END PGP SIGNATURE-----




More information about the Beowulf mailing list