heterogenous clusters
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Alan Ward award at mypic.adMon Jul 3 09:47:22 PDT 2000
- Previous message: Apps & Design
- Next message: Apps & Design
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Thanks for the comments. I think I'll try varying the slice lengths though - more convenient in my case. As for Java speed, JDK 1.3 can be about as fast as C (!). This is unlike older versions of the JVM that I agree were relatively slow. Some hard evidence: these are the results of a calculation-intensive program using 64-bit FP, run on a single 166 MHz node with no networking. C under Linux : 28.66 s Java 1.1.6 under Linux : 51.01 s Java 1.1.6 under Windows : 85.32 s Java 1.3 under Windows : 23.68 s These are averages over several runs. Similar results were obtained on an AMD 400. Though I haven't tried Java 1.3 under Linux yet, it should run quite well. Best regards, Alan Ward ---------- De: Luc.Vereecken at chem.kuleuven.ac.be A: award at mypic.ad; beowulf at beowulf.org Asunto: Re: heterogenous clusters Fecha: dilluns, 3 / juliol / 2000 06:07 I used MPI. It has the advantage of transparantly translating machine- specific data representation (big/little-endian,...) while still being fast enough (faster that Java I bet). While the heterogenous approach doesn't allow vendor-optimised versions (IBM's MPI on SP,...) a distribution like MPICH runs on enough platforms to give you the interconnectivity you want. >2) When you slice up a job to distribute it between nodes, >how does slice size affect performance when large speed >differences exist (i.e. ratios of 1:2 up to 1:5)? >My experience indicates that (a) too small a slice gives you a >large communication overhead, and (b) too large a slice makes >you lose time at overall completion while the slower nodes >finish up their last slice. I use two ways to get around the speed difference problem (which in our setup is around 1:10 or 1:15, depending on which machines I use). 1. Don't use slices, but groups of slices, where the slices are fairly small (but e.g. fixed in size if it is easier to deal with in your application), and the groups contain a variable number of slices. Early in the calculation you distribute groups with a large number of slices, and as the number of remaining slices decreases, you decrease the number of slices in the groups. The faster nodes finish faster, get new groups of slices which early in the calculation are still fairly large (=low communication overhead). As the calculation progresses, the outstanding groups get smaller and smaller, so the slower nodes won't stall the calculation due to the wait time for the last group (=little loss at end of calculation). If you use a good size of "early" group size, and a good function to decrease the size of the groups through time (ending with 1-slice groups), you get most of the benefits of large groups (=low communication overhead), but strongly reduce the idle time at the end of the global process since only small groups are outstanding by then. I found you get most of the benefits and little of the drawbacks even with fairly rough "initial size" and "decreasing function" guesses. Example : 240 slices, 10 processes, I start with 10-slice-groups, and decrease to 1 near the end. Our program is also build on a Master-Slave paradigm. Just in case it is run in a homogenous environment, I avoid idling the Master most of the time, and then flooding it all of a sudden, by not giving all of the Slave processes the same initial number of slices in a group : example : 240 slices, 10 processes, I start with 10-slice-groups, of which I distribute a few, then some 9-slice groups, then some 8-slice groups, until all processes are busy. This distributes the number of Slave-to-Master requests through time. Since the size of the slice-groups changes through time, the chances of the processes getting into sync on accident are remote. It's a fairly coarse method, I admit, but it works like a charm. 2. Start a different number of Slave processes on different processors. On processors which are 5 times faster, I start e.g. 3 Slave processes (reducing the relative speed of the processes running on this processor) When the slaves are idle at the end of the global calculations, they enter a sleep-mode (blocking communinication usually does this automatically, since it waits for a network event). The remaining processes then get more of the fast processor, so the last Slaves tend to work through their last slice faster, again decreasing idle time on the other processors at the end of the calculation. Of course, if you start e.g. polling at the end of the calculation, this method won't work of course. For the MPICH version I use (can't remember the version#) the blocking communication doesn't use polling, so I just make the Slaves blocking-receive a communication from the Master (which in our case are the specificiations of the job they have to to next). Starting 2 or more processes per processor also tends to work nicely without additional programming time if your Master has to work a bit on the results of the Slave before it can send it its next job. The waiting Slave does nothing (make sure it doesn't do polling), so the second Slave gets all (most) of the processor which is then in near 100% productive use. I don't try to "benchmark" the nodes to determine the chunck of work they have to work through. The first method above adjusts automatically if one of the fast processors becomes "slow" (e.g. due to an interactive just sucking CPUtime all of a sudden) Just some idea's that work nicely on our problem. Luc Vereecken
- Previous message: Apps & Design
- Next message: Apps & Design
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
