Advice for 2nd cluster installation
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduFri Jan 10 09:56:18 PST 2003
- Previous message: Advice for 2nd cluster installation
- Next message: Advice for 2nd cluster installation
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, 10 Jan 2003, Kwan Wing Keung wrote: > Dear experts, > > We have installed a 32 nodes dual 1 GHz P3 clusters a year already. Its > performance is excellent, and the system stability is fairly OK. > > Due to the increase in system loading, we are going to install > additional nodes in the coming half year. I would like to have more > hints from our experts on the following questions: > > (1) Currently most of the major vendors already propose P4 nodes of > at least 2.2 GHz. In view of the difference in processor speed, > definitely the new nodes have to be separated from the old nodes. > We can either create new batch queues to accommodate these new nodes > (but still sharing the old file system), or build an entire > new cluster with its own front-node and file system. > What will be the relative advantages/disadvantages? This depends on your work mix. If your task mix is embarrassingly parallel (as it sounds like it might be given that you use a batch queue) and doesn't interact strongly with local resources or the network, then there is very little disadvantage in perpetually augmenting a cluster with newer faster nodes. The biggest problem is just keeping track of which queues are which and ensuring that one user/group isn't accidentally starved relative to another by being inequitably pushed onto the slower nodes. If your task mix is NOT all EP -- some tasks are synchronous, some tasks have large communications fractions in addition to computation, some tasks use lots of data to initialize (that has to be distributed to all the nodes to start a computation), some tasks create a lot of data on the nodes (that has to be loaded back to and postprocessed or stored on a single master node later) -- in any of these cases there can be any of a variety of problems associated with mixed speed clusters. For example, load balancing of a partitioned, synchronous task (where the overall process is limited to time of completion of each task increment on the node where it takes the longest to complete) is much more complicated if the nodes aren't all identical. If identical, all nodes can get identical sized chunks. If the nodes are different, then it may not be easy to even MEASURE the function for completion time of subtask as a function of the partitioning, let alone dynamically correct the partitioning to achieve balance, especially if the subtasks are themselves not necessarily uniform in their structure. This is visible even on very simple homogeneous clusters -- something like a mandelbrot set problem parallelizes very easily by partitioning the region into equal sized chunks and distributing them, but points in some regions escape "quickly" (after only a few iterations) while points in other regions don't escape at all. One often finds that just a few strips of points take much longer than the rest, blocking progress until they complete. An inhomogeneous cluster just makes this problem worse, and creates the problem even if all the strips do require roughly the same computational effort to complete. Similar considerations exist for any potentially bottlenecked, shared resource. For "ordinary" filesystem access, NFS from a single server is fine on far more than 32 nodes, so if all you do is start up jobs, read a bit of data in (up to a few MB, perhaps), run a long time, and write out a bit of data (up to a few MB) there may be no reason at all to create a second cluster with a duplicate head node/server structure. If you plan to start up a job, read in a 2 GB data file, process for a relatively short time, and then write back a 2 GB data file, well, you've got a problem that requires careful thought to distribute nicely even on 32 nodes, and NFS may not be the answer. So the answer is, as one might expect, "it depends". I have generations of increasing speed nodes in several miniclusters working from a common set of servers on a common, flat network, and for my EP tasks this is just fine -- I track node speed mentally and distribute work accordingly. Other people might find this collection of resources and its architecture almost completely useless for their problem(s). Others might be in between. YMMV, and you'll have to analyze your problem mix to figure this sort of thing out. > > (2) If we real intend to build a completely new cluster to house all the > new nodes, i.e. with its own front-node and file system, is it > possible for us to build some backup resilience between the 2 > clusters as well? Sure, lots of ways (depending on what you mean). Do you mean backup resilience in the sense of mirroring/synchronizing data between them on some reasonable granularity? See rsync and rdump. Do you mean operational resilience so that if one (sync'd) server fails the other can take over? Sure, just put two NICs into your servers and interconnect the surviving server across the two (necessarily connectionally contiguous) lans. And you might (as noted above) find that just one server works fine in this configuration for both clusters even if you do keep them on separate lans. It is relatively easy to set up hot spare servers and the like, especially if you can tolerate a loss of (say) the last eight hours of filesystem changes on a serious crash. Just rsync'ing two servers three times a day will give you that. Make sure that you use fast disk(s) and have adequate memory and horsepower in the servers, though, so they can function satisfactorily during the rsync. > Another way is just a cross-mounting of two file servers. > Likely the postgrads will be on the old server while > the researchers will be on the new one. During normal operation, > each cluster is only going to use its local file system, but the two > servers will be "rsych" during the night time. Yeah, like this. Lots of ways to engineer it. > In case a file system is inaccessible, all users will be allowed > to access the remaining available file system (after the sysadm. has > done some work). > > This sounds complicated, but should be much cheaper. Any expert > has such experience? It isn't that complicated, and can be as cheap as "free" (runs on existing resources). Alternatively you can keep your servers on same-day contracts from e.g. Dell or maintain a host that can function as a not-quite-hot spare, use rdump to a dedicated hot-backup host a few times a day, and count on rapid disk replacement followed by rapid restore from the hot-backup host to the repaired system or reconfigured spare to get back online in, say, two to six hours. We do something like that here -- we have a RAID (and yes, it also crashed in the last year:-p but with an rdump backup of the primary workspace -- which also means nearly-instantaneous restores of files users just trash, as they often do -- plus a decent server contract there are several ways to bring our primary server back online in a matter of hours. Most of these ways come at the cost of maybe losing work done in the last 0-24 hours, and should not be considered an adequate replacement for a real tape backup scheme. > (3) Most of the major vendors proposed blade server approach as alternate > proposal to the conventional 1U server. By stuffing 14 > processors board into a blade centre (actually just another type of > rack-mounted chassis occupying 7U), the "processor density" can be > double. The issue here is more one of whether or not space is "expensive" to you. If not, blade solutions are going to be relatively more expensive per flop and relatively less powerful in terms of networking and storage options and configurations beyond a single/mere latency question. In order of cost per raw aggregate GFLOP (by whatever measure you like) it runs blade, rackmount, tower/shelving. In terms of configurability and available node options, the order is reversed -- access to the full bus and maximal disk and cooling options in a tower, usually a riser subset of 1-3 slots in a rackmount chassis, and quite possibly no bus at all or a single expansion option in a blade design. Consequently, one usually chooses a cluster configuration based at least in part on the "cost" of space to you. If you have a gym-sized room, mostly empty and nicely climate controlled, and only plan to EVER own maybe 64 to 128 nodes at any one time, heck, save the money, buy more nodes, and use tower chassis and heavy duty shelving from Home Depot. If you have a decent sized server room (maybe five to ten meters square) you're more likely to need to go with rackmounts to keep things neat and clean and provide room to work and for additional systems as time passes. If you have a glorified broom closet with a window unit for an air conditioner, a blade system suddenly looks very attractive. There can of course be alternative ways costs and benefits can be locally tallied to push toward one or the other configuration; this is just intended to be illustrative. The point is, do a cost-benefit analysis, being fairly honest about costs and benefits in your local environment, and be properly skeptical about vendor's or integrator's claims for the same. Whatever you do should make sense, and not just be done because that was the way you thought everybody did it. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: Advice for 2nd cluster installation
- Next message: Advice for 2nd cluster installation
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
