[Beowulf] number of admins

Wed Jun 8 09:39:33 PDT 2005

Excellent points Chris.

Indeed one of the biggest issues for HPC admins is focus drift. We often 
are expected to know everything about everything technical. I am very 
lucky to have very good and very smart admins in my group.

We have also been doing HPC long enough to have an actual understanding 
of the apps and how they work, but that is often the problem. We  can 
check F77 and F90 code. We can troubleshoot software. Sometimes these 
tasks will take priority over installs and upgrades. One of my biggest 
managerial tasks is keeping things focused and prioritized.

Ten items that are all top priority makes none top priority.

Things that we do to alleviate some of the issues of many machines few 
admins include:

Shutting down machines that malfunction.  ie if a machine dies in a 2 
year old cluster, we shut it down

Three year hardware cycles. I plan for the next three years and make 
those plans 1 year in advance.

Hedging our bets with multiple technologies (OS X, Solaris, Linux and 
both SMP and cluster machines (Sparc, Opteron, G5, PIV).

Mike

Chris Dagdigian wrote:

> My $.02
>
> The number of sysadmins required is a function of how much  
> infrastructure you have in place to reduce operational burden:
>
>  - remote power control over all nodes
>
>  - remote access to BIOS on all nodes via serial console
>
>  - remote access to system console  via serial port on all nodes
>
>  - unattended/automatic OS installation onto bare metal (autoYast,  
> kickstart, systemimger etc.)
>
>  - unattended/automatic OS incremental updates to running nodes
>
>  - documented plan for handling node hardware failures which  includes 
> specific info on when and how an admin is expected to spend  time 
> diagnosing a problem versus when the admin can just hand the  node off 
> to a vendor or someone else for simple planned replacement  or 
> advanced troubleshooting.  For Dell systems you want to have an  
> agreement in place where your sysadmin can make a judgement call that  
> a node needs replacement WITHOUT having to first wade through the  
> hell that is Dell's first tier of customer support.
>
> If you have the infrastructure in place where your admin(s) can do  
> everything remotely including OS installs, console access and remote  
> power control then you may be able to get away with a single admin  
> (as long as his/her job is tightly scoped to keeping the cluster  
> functional). If you have not pre-planned your architecture to make  
> administration as easy and as "hands off" as possible then you are  
> going to need many hands.
>
> The biggest reason for cluster deployment unhappiness can be traced  
> to this:
>
>  - management and users expect the cluster operators to also also be  
> experts with HPC programming, the applications in use, application  
> integration issues and the cluster scheduler. This almost never works  
> out well as the skills and background needed to keep a cluster  
> running are often quite different from the expertise needed to  
> understand the local research efforts and internal application mix.
>
> This is not a good thing to be doing. The cluster sysadmins should be  
> focused on the OS, hardware, interconnects and infrastructure.
>
> You probably need some additional staff resources to specifically cover:
>
>  o Someone who understands the research/work and can talk to end  
> users intelligently about how to use/integrate/run/troubleshoot the  
> the cluster application mix. This person needs to understand the  
> science, research and applications involved and probably also needs  
> to be a bit of a shell/perl toolsmith who can assist with workflows  
> and application integration. This person could actually be recruited  
> from the ranks of the users if there is a particular expert "power  
> user" who would be interested in the role.
>
> o Someone who understands high performance scientific software  
> development who can help the cluster admins deal with and  
> troubleshoot the Myrinet interconnect while also being able to help  
> the end users with HPC compiler issues, software dev issues and  
> application optimization issues
>
> So the big message in my mind is:
>
>  o cluster operators should not expected to be application experts
>  o cluster operators should not expected to be HPC coding &  
> scientific software development expers
>  o Significant effort needs to be put into training users how to use  
> the cluster and the interconnect
>
> Short term you may also need a LSF expert on hand to help get the  
> cluster resource policies sorted but that is short term only as the  
> cluster admins can pick up the LSF specifics very quickly and easily.
>
>
> -Chris
> bioteam.net
>
>
>
>
> On Jun 6, 2005, at 6:23 PM, David Kewley wrote:
>
>> Hi all,
>>
>> We expect to get a large new cluster here, and I'd like to draw on the
>> expertise on this list to educate management about the personnel
>> needed.
>>
>> The cluster is expected to be:
>>
>> ~1000 Dell PE1850 dual CPU compute nodes
>> master & other auxiliary nodes on similar hardware
>> 1024-port Myrinet
>> Nortel stacked-switches-based GigE network
>> many-TB SAN built on Data Direct & Ibrix
>> Platform Rocks
>> Platform LSF HPC Rocks roll
>> Moab added later, quite possibly
>> tape library backup (software TBD)
>> NFS service to public workstations
>> nine man-weeks of Dell installation support
>> 10 man-days of Ibrix installation support
>>
>> The users will be something like:
>>
>> ~10 local academic groups, perhaps 60 users total
>> several different locally-written or -customized codebases
>> at least one near-real-time application with public exposure
>>
>> We have some experience already with a 160-node Dell cluster that has
>> some of the basic elements listed above, but several of the pieces  will
>> be totally new, and some of the pieces we already have will need
>> greater care.
>>
>> My questions to you are:
>>
>> * How many sysadmins should we plan to have once the cluster is  stable?
>> * Is there indeed any such thing as a "stable" cluster of this  sort, 
>> and
>> if so, should we get additional help during the initial phase of the
>> project, when things are less stable (help beyond the vendor
>> installation support listed above)?
>> * If we need more help in the initial phases, how might we go about
>> finding people?  Contract workers?  Commercial or private
>> consultancies?
>> * Should we look for any specific non-obvious skillset, or would  
>> skilled
>> sysadmins be adequate?
>>
>> And finally:
>>
>> * If we only have one sysadmin, someone who is bright and capable, but
>> is learning as they go, is that too small a support staff?
>> * If one such sysadmin is too little, then what would you expect the
>> impact on the users to be?
>>
>> I have been giving my opinion to management, but I'd really like to  get
>> (relatively unbiased) professional opinions from outside as well.  I
>> thank you for any comments you can make!
>>
>> David
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit  
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Mike Davis                                  jmdavis at mail2.vcu.edu
Director- Research Computing Services       (804) 828-3885 phone
Manager- Supercomputing Systems Group       (804) 828-1961 fax