[Beowulf] While the knives are out... Wulf Keepers

Mon Aug 21 14:25:18 PDT 2006

Robert G. Brown wrote:
> 
> Remember I'm just such a type as well.  So are a whole lot of primary
> contributers on this list.  Building and USING a cluster to perform
> actual work provides one with all sorts of real world experience that
> goes into building your next one, or helping others to do so.  Many
> people -- e.g. Greg Lindahl or Joe L. or Jim L.  -- seem to interpolate
> these worlds and more and use clusters to do research, engineer and
> manage clusters, do corporate stuff with or for clusters.
>
As would I. But I would prefer that they install their version and 
libraries in their directories rather than the universal directories.

> Rather what I think he's saying is that in a large cluster environment
> where there are many and diverse user groups sharing an extended
> resource, careless management can cost productivity -- which is
> absolutely true.  Examples of careless management certainly include
> thoughtlessly updating some mission-critical library to solve a problem
> for group A at the expense of breaking applications for groups B and C,
> but this can actually be done just as easily by a professional
> administrator as by a research group.  The only difference is that a
> "cluster administrator" is usually professionally charged with not being
> so careless and with having the view of the hand and the time to
> properly test things and so on.  A good cluster administrator takes this
> responsibility seriously and may well seek to remain in firm control of
> updates and so on in order to accomplish this.

I exercise extreme caution on running research machines. For example, my 
  older clusters have both g98 and g03 installed. This is done to avoid 
differences possible between the different versions in long running 
projects. Researchers have the choice of which versions they will use. 
In general I do the same thing with Gammess, BLAST etc. I will install a 
new version rather than immediately upgrade an existing version if there 
is no specific need to do so.

> 
> As you observe, ultimately this comes down to good communications and
> core competence among ALL people with root-level access for ANY LAN
> operation (not just cluster computing -- you can do the exact same thing
> in any old LAN).  There are many ways to enforce this -- fascist topdown
> management by a competent central IT group where they permit "no" direct
> user management of the cluster; completely permissive management where
> each group talks over any changes likely to affect others but retains
> privileges to access and root-manage at least the machines that they
> "own" in a collective cluster (yes, this can work and work well and is
> in fact workING in certain environments right now), something like COD
> whereby any selected subcluster can be booted in realtime into a user's
> own individually developed "cluster node image" via e.g. DHCP so that
> while you're using the nodes you TOTALLY own them but cannot screw up
> access to those same nodes when OTHER people boot them into THEIR own
> image, and lots more besides including topdown not-quite-so-conservative
> management (which is probably the norm).
> 

Many of us struggle with this. There are certainly good resons for 
individual images. If people like Joe, and Greg and Jim are making those 
images I feel pretty good about it.

One way that we resolve this issue is with a technology refresh rate 
that attempts to upgrade the computational infrastructure on a 3 year 
cycle.  For example, 2 years ago we purchased a 64p xeon cluster. Last 
year we cooperated with Physics researchers to purchase a 200p Opteron 
Cluster. This year we are bringing up a 100p addition to the Opterons. 
Next year, we will add ~300 processors and begin the retirement of the 
original 64p xeon cluster. But we will still have gone from 64 to 600 
processors over that three year period.

> At a guess, Really Big Clusters -- ones big enough to have a full time
> administrator or even an administrative group -- are going to strongly
> favor topdown fascist adminstration as there are clear lines of
> responsibility and a high "cost" of downtime.  For these to be
> successful there have to be equally firm open lines of communication, so
> that researchers work is (safely and competently) enabled regardless of
> the administration skills of members of any group.  Larger shared
> corporate clusters are also likely to very often fall into this
> category, although there are also many exceptions I'm sure at the
> workgroup level.  Small research-group owned clusters are likely as not
> to be locally owned and operated even today.  In between you're bound to
> see almost anything.
> 

Communication is definitely paramount. I meet with researchers, PI's and 
postdocs almost daily. Part of that is simple outreach. Showing people 
the new capabilities that we have and the performance improvements that 
they can expect. Part of it is building the trust that is necessary to 
build the kind of cooperation that helps the Institution.

The "high cost" of downtime is very real. Unexpected downtime will bring 
calls to the Provost. I work to minimize that unexpected downtime. And, 
its not easy.

Last week an issue with an upgrade to the University backbone cost one 
of my subnets external communication for 16 hours. There was no affect 
on running jobs, but no way for users to login between 6pm and 10am. 
That's not good. The network change notification specified intermittent 
short outages as the upgrade progressed. Not a complete loss of comm for 
16 hours. When the subnet had been down for 3 hours, I began emailing 
users to let them know the situation and what was being done to correct 
it. Two clusters reside on that one subnet. Other clusters including our 
opterons were not effected.

But there is no doubt that we must all communicate and cooperate to make 
things work in bother the big and small pictures.

Mike Davis