Advice for 2nd cluster installation

Fri Jan 10 09:56:18 PST 2003

On Fri, 10 Jan 2003, Kwan Wing Keung wrote:

> Dear experts,
> 
> We have installed a 32 nodes dual 1 GHz P3 clusters a year already.  Its
> performance is excellent, and the system stability is fairly OK.
> 
> Due to the increase in system loading, we are going to install
> additional nodes in the coming half year.  I would like to have more
> hints from our experts on the following questions:
> 
> (1) Currently most of the major vendors already propose P4 nodes of
>     at least 2.2 GHz.  In view of the difference in processor speed,
>     definitely the new nodes have to be separated from the old nodes.
>     We can either create new batch queues to accommodate these new nodes
>     (but still sharing the old file system), or build an entire
>     new cluster with its own front-node and file system.
>     What will be the relative advantages/disadvantages?

This depends on your work mix.  If your task mix is embarrassingly
parallel (as it sounds like it might be given that you use a batch
queue) and doesn't interact strongly with local resources or the
network, then there is very little disadvantage in perpetually
augmenting a cluster with newer faster nodes.  The biggest problem is
just keeping track of which queues are which and ensuring that one
user/group isn't accidentally starved relative to another by being
inequitably pushed onto the slower nodes.

If your task mix is NOT all EP -- some tasks are synchronous, some tasks
have large communications fractions in addition to computation, some
tasks use lots of data to initialize (that has to be distributed to all
the nodes to start a computation), some tasks create a lot of data on
the nodes (that has to be loaded back to and postprocessed or stored on
a single master node later) -- in any of these cases there can be any of
a variety of problems associated with mixed speed clusters.  For
example, load balancing of a partitioned, synchronous task (where the
overall process is limited to time of completion of each task increment
on the node where it takes the longest to complete) is much more
complicated if the nodes aren't all identical.  If identical, all nodes
can get identical sized chunks.  If the nodes are different, then it may
not be easy to even MEASURE the function for completion time of subtask
as a function of the partitioning, let alone dynamically correct the
partitioning to achieve balance, especially if the subtasks are
themselves not necessarily uniform in their structure.

This is visible even on very simple homogeneous clusters -- something
like a mandelbrot set problem parallelizes very easily by partitioning
the region into equal sized chunks and distributing them, but points in
some regions escape "quickly" (after only a few iterations) while points
in other regions don't escape at all.  One often finds that just a few
strips of points take much longer than the rest, blocking progress until
they complete.  An inhomogeneous cluster just makes this problem worse,
and creates the problem even if all the strips do require roughly the
same computational effort to complete.

Similar considerations exist for any potentially bottlenecked, shared
resource.  For "ordinary" filesystem access, NFS from a single server is
fine on far more than 32 nodes, so if all you do is start up jobs, read
a bit of data in (up to a few MB, perhaps), run a long time, and write
out a bit of data (up to a few MB) there may be no reason at all to
create a second cluster with a duplicate head node/server structure.  If
you plan to start up a job, read in a 2 GB data file, process for a
relatively short time, and then write back a 2 GB data file, well,
you've got a problem that requires careful thought to distribute nicely
even on 32 nodes, and NFS may not be the answer.

So the answer is, as one might expect, "it depends".  I have generations
of increasing speed nodes in several miniclusters working from a common
set of servers on a common, flat network, and for my EP tasks this is
just fine -- I track node speed mentally and distribute work
accordingly.  Other people might find this collection of resources and
its architecture almost completely useless for their problem(s).  Others
might be in between. YMMV, and you'll have to analyze your problem mix
to figure this sort of thing out.

> 
> (2) If we real intend to build a completely new cluster to house all the
>     new nodes, i.e. with its own front-node and file system, is it
>     possible for us to build some backup resilience between the 2
>     clusters as well?

Sure, lots of ways (depending on what you mean).  Do you mean backup
resilience in the sense of mirroring/synchronizing data between them on
some reasonable granularity?  See rsync and rdump.  Do you mean
operational resilience so that if one (sync'd) server fails the other
can take over?  Sure, just put two NICs into your servers and
interconnect the surviving server across the two (necessarily
connectionally contiguous) lans.  And you might (as noted above) find
that just one server works fine in this configuration for both clusters
even if you do keep them on separate lans.

It is relatively easy to set up hot spare servers and the like,
especially if you can tolerate a loss of (say) the last eight hours of
filesystem changes on a serious crash.  Just rsync'ing two servers three
times a day will give you that.  Make sure that you use fast disk(s) and
have adequate memory and horsepower in the servers, though, so they can
function satisfactorily during the rsync.

>     Another way is just a cross-mounting of two file servers.
>     Likely the postgrads will be on the old server while
>     the researchers will be on the new one.  During normal operation,
>     each cluster is only going to use its local file system, but the two
>     servers will be "rsych" during the night time.

Yeah, like this.  Lots of ways to engineer it.

>     In case a file system is inaccessible,  all users will be allowed
>     to access the remaining available file system (after the sysadm. has
>     done some work).
> 
>     This sounds complicated, but should be much cheaper.  Any expert
>     has such experience?

It isn't that complicated, and can be as cheap as "free" (runs on
existing resources).  Alternatively you can keep your servers on
same-day contracts from e.g. Dell or maintain a host that can function
as a not-quite-hot spare, use rdump to a dedicated hot-backup host a few
times a day, and count on rapid disk replacement followed by rapid
restore from the hot-backup host to the repaired system or reconfigured
spare to get back online in, say, two to six hours.

We do something like that here -- we have a RAID (and yes, it also
crashed in the last year:-p but with an rdump backup of the primary
workspace -- which also means nearly-instantaneous restores of files
users just trash, as they often do -- plus a decent server contract
there are several ways to bring our primary server back online in a
matter of hours.  Most of these ways come at the cost of maybe losing
work done in the last 0-24 hours, and should not be considered an
adequate replacement for a real tape backup scheme.

> (3) Most of the major vendors proposed blade server approach as alternate
>     proposal to the conventional 1U server.  By stuffing 14
>     processors board into a blade centre (actually just another type of
>     rack-mounted chassis occupying 7U), the "processor density" can be
>     double.

The issue here is more one of whether or not space is "expensive" to
you.  If not, blade solutions are going to be relatively more expensive
per flop and relatively less powerful in terms of networking and storage
options and configurations beyond a single/mere latency question.  In
order of cost per raw aggregate GFLOP (by whatever measure you like) it
runs blade, rackmount, tower/shelving.  In terms of configurability and
available node options, the order is reversed -- access to the full bus
and maximal disk and cooling options in a tower, usually a riser subset
of 1-3 slots in a rackmount chassis, and quite possibly no bus at all or
a single expansion option in a blade design.

Consequently, one usually chooses a cluster configuration based at least
in part on the "cost" of space to you.  If you have a gym-sized room,
mostly empty and nicely climate controlled, and only plan to EVER own
maybe 64 to 128 nodes at any one time, heck, save the money, buy more
nodes, and use tower chassis and heavy duty shelving from Home Depot.
If you have a decent sized server room (maybe five to ten meters square)
you're more likely to need to go with rackmounts to keep things neat and
clean and provide room to work and for additional systems as time
passes.  If you have a glorified broom closet with a window unit for an
air conditioner, a blade system suddenly looks very attractive.

There can of course be alternative ways costs and benefits can be
locally tallied to push toward one or the other configuration; this is
just intended to be illustrative.  The point is, do a cost-benefit
analysis, being fairly honest about costs and benefits in your local
environment, and be properly skeptical about vendor's or integrator's
claims for the same.  Whatever you do should make sense, and not just be
done because that was the way you thought everybody did it.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu