[Beowulf] number of admins
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Sean Dilda agrajag at dragaera.netWed Jun 8 10:01:06 PDT 2005
- Previous message: [Beowulf] number of admins
- Next message: [Beowulf] number of admins
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
David Kewley wrote: > My questions to you are: > > * How many sysadmins should we plan to have once the cluster is stable? > > * If we only have one sysadmin, someone who is bright and capable, but > is learning as they go, is that too small a support staff? > * If one such sysadmin is too little, then what would you expect the > impact on the users to be? The answer to this question depends on what kind of non-sysadmin support staff will be around. I personally am the sole sysadmin for a ~300 node cluster with 7 research groups and over 100 users. Under current conditions, I can probably scale up towards 1000 nodes without problem. With that said, I have a lot of non-sysadmin support to help me out. There's a guy who does a lot of scientific computing support. He helps researchers write/optimize code, and also helps them out with issues with Fortran compilers, etc. I also have an 24-hour Operations staff I can rely on. They take care of the server room. If I never need a node rebooted or anything like that, I can give them a call and they'll take care of it. When a piece of hardware breaks, I pass the information on to them so they can sit on hold with Dell, handle shipping/receiving/etc, and all I have to do is turn the box off and switch out the part. Without these people to help me, I'd probably be at my limit now. But because I have them to help, I can handle a bit more. I don't think my man hours are stretched by the number of nodes as much as by the number of user requests and the number of different hardware models in the cluster. Those are the things that can really eat up time. As for specific skills to look for, I recommend someone who knows the Linux distro well and is familiar with maintaining a large number of identical machines (clustered or not). We use CentOS (Red Hat Enterprise Linux clone) on the cluster. By using tools like yum and kickstart I've been able to minimize the amount of work required to keep up with hundreds of OS images. These same technologies are regularly used in computing labs, large web server farms, etc. While I came into this job already familiar with beowulf technology, I'm not MPI expert, and I hadn't even used SGE (our scheduler of choice) before I got here. I was able to pick up all the SGE knowledge I needed while I was here. The things that I think really made it work well for me is knowing what I did about yum and kickstart so that I could have time to learn SGE and handle user requests. I hope this helps. Sean
- Previous message: [Beowulf] number of admins
- Next message: [Beowulf] number of admins
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
