[Beowulf] Project Planning: Storage, Network, and Redundancy Considerations

Mon Mar 19 13:00:30 PDT 2007

John,

John Hearns wrote:
> Brian R. Smith wrote:
>> Hey list,
>>
>> 1. Proprietary parallel storage systems (like Panasas, etc.):  It 
>> provides the per-node bandwidth, aggregate bandwidth, caching 
>> mechanisms, fault-tolerance, and redundancy that we require (plus 
>> having a vendor offering 24x7x365 support & 24 hour turnover is quite 
>> a breath of fresh air for us).  Price point is a little high for the 
>> amount of storage that we will get though, little more than doubling 
>> our current overall capacity.  As far as I can tell, I can use this 
>> device as a permanent data store (like /home) and also as the user's 
>> scratch space so that there is only a single point for all data needs 
>> across the cluster.  It does, however, require the installation of 
>> vendor kernel modules which do often add overhead to system 
>> administration (as they need to be compiled, linked, and tested 
>> before every kernel update).
>
> If you like Panasas, go with them.
> The kernel module thing isn't all that a big deal - they are quite 
> willing to 'cook' the modules for you.
> but YMMV
>
After some discussion, it came to my attention that it might not be the 
best solution.  I will probably still need to fork over for a /home 
solution anyway.  I'll contact them anyway just to be sure.
>>
>> Our final problem is a relatively simple one though I am definitely a 
>> newbie to the H.A. world.  Under this consolidation plan, we will 
>> have only one point of entry to this cluster and hence a single point 
>> of failure.  Have any beowulfers had experience with deploying 
>> clusters with redundant head nodes in a pseudo-H.A. fashion 
>> (heartbeat monitoring, fail-over, etc.) and what experiences have you 
>> had in
>> adapting your resource manager to this task?  Would it simply be more 
>> feasible to move the resource manager to another machine at this 
>> point (and have both headnodes act as submit and administrative 
>> clients)?  My current plan is unfortunately light on the details of 
>> handling SGE in such an environment.  It includes purchasing two 
>> identical 1U boxes (with good support contracts).  They will monitor 
>> each other for availability and the goal is to have the spare take 
>> over if the master fails.  While the spare is not in use, I was 
>> planning on dispatching jobs to it.
>
> I have constructed several clusters using HA.
> I believe Joe Landman has also - as you are in the States why not give 
> some thought to contacting Scalable and getting them to do some more 
> detailed designs for you?
>
> For HA clusters, I have implemented several clusters using Linux-HA 
> and heartbeat. This is an active/passive setup, with a primary and a 
> backup head node. On failover, the backup head node starts up cluster 
> services.
> Failing over SGE is (relatively) easy - the main part is making sure 
> that the cluster spool directory is on shared storage.
> And mounting that share storage on one machine or the other :-)
Yeah, they have good failover support and we are already running the 
berkely database (I was planning on this happening one day) so moving 
over to master/shadow configuration should be easy.  Shared storage will 
be whatever we end up purchasing for that purpose so it will be 
available.  I've always run SGE over an NFS share.
>
> The harder part is failing over NFS - again I've done it.
> I gather there is a wrinkle or two with NFS v4 on Linux-HA type systems.
>
Shouldn't be a problem.  NFS will be served from a dedicated host and 
will have an off-site mirror that can take its place over vlan  Not as 
fast, but the data is there and the line is dedicated.  I'll have to 
work on other failover plans (perhaps mirroring in the same room, tapes 
for "off-site"-edness?)
> The second way to do this would be to look at using shared storage,
> and using the Gridengine queue master failover mechanism. This is a 
> different approach, in that you have two machines running, using 
> either a NAS type storage server or Panasas/Lustre. The SGE spool 
> directory is on this, and the SGE qmaster will start on the second 
> machine if the first fails to answer its heartbeat.
>
>
> ps. 1U boxes? Think something a bit bigger - with hot swap PSUs.
> You also might have to fit a second network card for your HA heartbeat 
> link (link plural - you need two links) plus a SCSI card, so think 
> slightly bigger boxes for the two head nodes.
Yes, good recommendations (no SCSI needed, thankfully).  Those are 
definitely a couple factors I forgot to consider when opting for 1U
> You can spec 1U nodes for interactive login/compile/job submission 
> nodes. Maybe you could run a DNS round robin type load balancer for 
> redundancy on these boxes - they should all be similar, and if one 
> stops working then ho-hum.
>
> pps. "when the spare is not in use dispatching jobs to it"
> Actually, we also do a cold failover setup which is just like that, 
> and the backup node is used for running jobs when it is idle.
>
>
Thanks  a lot for the help!

-Brian
>
>

-- 
--------------------------------------------------------
+ Brian R. Smith                                       +
+ HPC Systems Analyst & Programmer                     +
+ Research Computing, University of South Florida      +
+ 4202 E. Fowler Ave. LIB618                           +
+ Office Phone: 1 (813) 974-1467                       +
+ Mobile Phone: 1 (813) 230-3441                       +
+ Organization URL: http://rc.usf.edu                  +
--------------------------------------------------------