[Beowulf] Project Planning: Storage, Network, and Redundancy Considerations

Mon Mar 19 08:23:58 PDT 2007

Hey list,

I am seeking some advice regarding our latest project.  Currently, our 
shop runs 5 different clusters of varying size and handles the 
maintenance and administration of each.  I've been planning for some 
time to finally consolidate all of these machines together under a 
single head-node with a common storage pool (for /home, /opt, 
/usr/local), utilizing SGE for our resource management.  A lot of times 
on this list, the point comes up that many things depend upon your 
applications so I'll make it clear here:  Our "application" is quite 
varied.  Our users come from a wide variety of disciplines and the 
nature of our group is as a sort of tier-2 scientific computing lab 
where we provide hardware, development environments, and support for 
developing and running applications of various nature hence general-purpose.

We have fairly robust systems in place for node provisioning (an 
in-house system that utilizes kickstart and anaconda that supports 
multiple architectures), resource management (SGE has proven extremely 
reliable and more than capable of managing our fairly quaint resources).

Currently, my two largest problems are figuring out our storage needs 
(in terms of device bandwidth and throughput) and our network needs.  
When all is said and done, this is the hardware I expect to have:

~60x 16GiB RAM, Dual-Dual-Core AMD Opterons, IB-connected, 
GigE-connected, with modest local storage
8x 16GiB RAM, Dual-Dual-Core AMD Opterons & 24x 8GiBRAM, Dual-Opterons, 
Myrinet-connected, GigE-connected, modest storage (cluster1)

We wish to add to this cluster the following existing configurations:
12x AMD Opteron 246, 4GiB RAM, Myrinet-connected, etc. (cluster2)
38x AMD Opteron 246, 4GiB RAM, GigE-connected (clusters 3 & 4)
~40x Intel P4 Xeon @2.66GHz, 2GiB RAM, GigE-connected (cluster5)

Yes, I know the last sets of machines are approaching (or already are) 
legacy status (especially the last batch), but these machines are still 
useful at running the problems they were originally purchased for 
(especially the Opterons), and are still very good at some other general 
tasks (Distributed Matlab, commercial FE codes, instructional use, etc). 

Currently, each cluster has its own local storage, averaging about a 1TB 
on each.  We've currently got about 4TB of total data across all of 
these machines but anticipate this number possibly doubling with in the 
next 12-18 months.  The first phase of this plan (which must occur in 
concert with the second) is to consolidate all of these disparate arrays 
into one volume that is accessible by every node in the cluster.  I know 
that some of the supercomputing centers like NCSA have dealt with much 
larger-scale storage issues than this so I'd love to hear from one of 
you.  The current ideas that we have been floating around include the 
following:

1. Proprietary parallel storage systems (like Panasas, etc.):  It 
provides the per-node bandwidth, aggregate bandwidth, caching 
mechanisms, fault-tolerance, and redundancy that we require (plus having 
a vendor offering 24x7x365 support & 24 hour turnover is quite a breath 
of fresh air for us).  Price point is a little high for the amount of 
storage that we will get though, little more than doubling our current 
overall capacity.  As far as I can tell, I can use this device as a 
permanent data store (like /home) and also as the user's scratch space 
so that there is only a single point for all data needs across the 
cluster.  It does, however, require the installation of vendor kernel 
modules which do often add overhead to system administration (as they 
need to be compiled, linked, and tested before every kernel update).

2. Separate /home and /scratch volumes.  /home would be NFS exported 
read-only to all hosts (to prevent writes during run-time).  The volume 
would reside on one or two file servers (Sun's Thumper/X4500, etc. 
either on JFS or GFS (or perhaps ZFS???), depending on hardware) and at 
current prices, we would be able to acquire around 20TB.  We would 
double this purchase and provide the same setup off-site for redundancy 
(including our tape-backup regime).  Bandwidth for reads is more than 
sufficient for the needs of our current users.  The scratch space would 
be comprised of 8-12 nodes with 0.5 TB RAID1 storage utilizing either 
PVFS2 (which has worked exceptionally well for us previously) or Lustre 
(which we have not tested very well yet).  Both require separate kernel 
modules (this seems to be a recurring theme) and hence some additional 
administration.  Neither are well-suited for general tasks such as 
compiling (though there are ways around this) or problems involving many 
short writes, but most of the applications being run do not fit this 
profile.  8-12 nodes should provide us between 3-6TB of usable scratch.  
We would like a little more, but again, this is sufficient for our 
current usage patterns.  The pricing for this might be somewhat less 
than the proprietary system described above.

Can anyone suggest any other approaches to this problem?

We also have a problem regarding how to link these clusters together 
over a single network fabric (GigE).  It will be possible for all nodes 
to utilize this network for Message Passing, but it is highly improbable 
that such a scenario will ever be played out since almost all of our MPI 
jobs will no doubt run on either the Infiniband our Myrinet nodes (there 
are SGE policies in place to help ensure this). 

Currently, each cluster has its own GigE network for provisioning, 
administration, and resource management.  Some of these hosts utilize it 
for communications (clusters 3, 4, & 5) and all of them will no doubt 
need to utilize it for filesystem access.  Clusters 3 and 4 can be 
consolidated to a single GigE HP switch that will have a couple of ports 
left over.  Cluster 5 will have to be kept as-is and clusters 1 and 2 
will fit on a single switch as well.  I have discussed with our campus 
network admin the possibility of using two recent cisco switches that 
would support failover and load balancing as a redundant and 
high-bandwidth "trunk" for each of these networks, obviously with the 
capacity to grow in the future.  Each of our existing 3 switches would 
have up to two links to each "trunk" switch and our file servers (in 
which ever configuration we eventually choose) would also be attached to 
these switches.  There should be enough bandwidth to go around under 
this plan.  I'm just curious if this seems doable and if it is, are 
there any obvious pitfalls that I have overlooked?  Is there perhaps a 
better way to approach this (perhaps a single, large switch instead)?

Our final problem is a relatively simple one though I am definitely a 
newbie to the H.A. world.  Under this consolidation plan, we will have 
only one point of entry to this cluster and hence a single point of 
failure.  Have any beowulfers had experience with deploying clusters 
with redundant head nodes in a pseudo-H.A. fashion (heartbeat 
monitoring, fail-over, etc.) and what experiences have you had in 
adapting your resource manager to this task?  Would it simply be more 
feasible to move the resource manager to another machine at this point 
(and have both headnodes act as submit and administrative clients)?  My 
current plan is unfortunately light on the details of handling SGE in 
such an environment.  It includes purchasing two identical 1U boxes 
(with good support contracts).  They will monitor each other for 
availability and the goal is to have the spare take over if the master 
fails.  While the spare is not in use, I was planning on dispatching 
jobs to it. 

There are a number of unfilled blanks in this plan currently (and I have 
a month with which to fine-tune the rest of this) and so if anyone would 
be kind enough to offer suggestions on how to fill in a few I'd 
appreciate it. 

Thanks to all in advance for any help!

Brian Smith

-- 
--------------------------------------------------------
+ Brian R. Smith                                       +
+ HPC Systems Analyst & Programmer                     +
+ Research Computing, University of South Florida      +
+ 4202 E. Fowler Ave. LIB618                           +
+ Office Phone: 1 (813) 974-1467                       +
+ Mobile Phone: 1 (813) 230-3441                       +
+ Organization URL: http://rc.usf.edu                  +
--------------------------------------------------------