[Beowulf] diskless cluster nfs
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduWed Dec 8 06:21:51 PST 2004
- Previous message: [Beowulf] diskless cluster nfs
- Next message: [Beowulf] diskless cluster nfs
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, 7 Dec 2004, Josh Kayse wrote: > Ok, my first post, so please be gentle. > > I've recently been tasked to build a diskless cluster for one of our > engineers. This was easy because we already had an image for the set > of machines. Once we started testing, the performance was very poor. > Basic setup follows: > > Master node: system drive is 1 36GB SCSI drive > /home raid5 5x 36GB SCSI drives > Master node exports /tftpboot/192.168.1.x for the nodes. > > all of the nodes are diskless and get their system from the master > node over gigabit ethernet. > All that worsk fine. > > The engineers use files over nfs for message passing, and no, they > will not change their code to mpi even though it would be an > improvement in terms of manageability and probably performance. > > Basically, my question is: what are some ways of testing the > performance of nfs ande then, how can I improve the performance? > > Thanks for any help in advance. > > PS: nfs mount options: async,rsize=8192,wsize=8192,hard > file sizes: approx 2MB Is this a trick question? You begin by saying performance is poor. Then you say that you (they) won't take the obvious step to improve your performance. Sigh. OK, let's start by analyzing the problem. You haven't said much about the application. Testing NFS is a fine idea, but before spending too much time on any single metric of peformance let's analyze your cluster and task. You say you have gigabit ethernet between nodes. You don't say how MANY nodes you have, or how fast/what kind they are (even in general terms), or how much memory they have, or whether they have one or two processors. These all matter. Then there is the application. If the nodes compute for five minutes, then write a 2 MB file, then read a 2 MB file (or even several 2 MB files) parallel scaling is likely to be pretty good, even on top of NFS. If they compute 0.001 seconds, then write 2 MB and read 2+ MB, parallel scaling is likely to be poor (NFS or not). Why? If you don't already know the answer, you should check out my online book (http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php) and read up on Amdahl's Law and parallel scaling. Let's do some estimation. Forget NFS. The theoretical peak bandwidth of gigabit ethernet is 1000/8 = 125 MB/sec (this ignores headers and all sorts of reality). It takes (therefore) a minimum of 0.016 seconds to send 2 MB. In the real world, bandwidth is generally well under 125 MB/sec for a variety of reasons -- say 100 MB/sec. If you are computing for only 0.001 seconds and then communicating for 0.04 seconds, parallel scaling will be, um, "poor", MPI or NFS notwithstanding. Once you understand that fundamental ratio, you can determine what the EXPECTED parallel scaling is of the application in a good world might be. A good world would be one where each node only communicated with one other node (2 MB each way) AND the communications could proceed in parallel. A worse but still tolerable world might be one where the communications can proceed at least partially in parallel without a bottleneck -- one to many communications require (e.g. tree) algorithms or broadcasts to proceed efficiently. However, you do NOT live in a good world. You have N hosts engaged in what sounds like a synchronous computation (where everybody has to finish a step before going on to the next) with a single communications master (the NFS server). Writing to the NFS server and reading from the NFS server is strictly serialized. If you have N hosts, it will take at least Nx0.02 seconds to write all of the output files from a step of computation, at least Nx0.02 seconds to READ all of the output files (and that's assuming each node just reads one) and now you've got something like 0.001 seconds of computation compared to Nx(0.04) or worse seconds of communication. The more nodes you add, the slower it goes! In fact, if you just KEEP the data on a single node it takes Nx0.001 seconds to advance the computation a step compared to 0.001+Nx0.04 seconds in the cluster! Even if you were computing one second instead of 0.001, this sort of scaling relation will kill parallel speedup at some number of nodes. Note that I've gone into some detail here, because you are going to have to explain this, in some detail, to your engineers after working out the parallel scaling for the task at hand. There is no way out of this or around this. Tuning the hell out of NFS is going to yield at most a factor of 2 or so in speedup of the communications phase, and your problem is probably a SCALING relation and bottlenecked COMMUNICATIONS PATTERN that could care less about factors of 2 and is intrinsic to serialized NFS. In other words, your engineers are either going to have to accept algebraically derived reality and restructure their code so it speeds up (almost certainly abandoning the NFS communications model) or accept that it not only won't speed up, it will actually slow down run in parallel on your cluster. Engineers tend to be pretty smart, especially about the constraints imposed by unforgiving nature. If you give them a short presentation and teach them about parallel scaling, they'll probably understand that it isn't a matter of tweaking something and NFS working, it is a fundamental mathematical relation that keeps NFS from EVER working in this context as an IPC channel. Unless, of course, you compute for minutes compared to communicate for seconds, in which case your speedup should be fine. That's why you have to analyze the problem itself and learn the details of its work pattern BEFORE designing a cluster or parallelization. rgb > -- > Joshua Kayse > Computer Engineering > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] diskless cluster nfs
- Next message: [Beowulf] diskless cluster nfs
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
