Newbie question
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduSun May 26 09:53:53 PDT 2002
- Previous message: Newbie question
- Next message: Newbie question
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sat, 25 May 2002, Mike Ladwig wrote: > Hi. I am new to this stuff and am working through Robert Brown's very helpful > "Engineering a Beowulf-style Compute Cluster", and decided to take his advice > to "ask on the beowulf list before tackling a new parallelization project". > > But if this question isn't appropriate here or there is a better place to > ask...please let me know. > > A brief description: I started with an algorithm that does something very > slowly. Each individual calculation takes about 1.1 seconds on a .9 Ghz P3. > Every work unit is made up of at best 100k and at worst 3M calculations to > complete; average is 400k. Several work units have to be completed before a > set of results becomes meaningful, and when some day in a real world system > numerous result sets need to be completed in human-friendly time. > > In preparation for moving to a parallel execution environment, I have > redesigned the algorithm so that a controller spawns off a number of > threads, each of which takes as input a common to the work unit 1.5MB binary > blob and a unique 50KB binary blob. The threads do not communicate between > themselves. After each thread completes it's calculation, it reports > results, requests another unique 50k block, and performs another calculation. > The controller is typically able to detect early completion of a given work > unit based on the results from every 2000 calculations. So it sounds like (after initialization when you send out the common 1.5 MB initialization data) you get a 50KB dataset, work a second or so, send back a "result" (presumed small enough to fit in a single IP packet) and iterate until you complete a work unit. On a single CPU a work unit would take between a bit more than a day to five or six weeks. This sounds like a fairly classic master/slave computation. To see how far you can scale it, you need to estimate or measure some more numbers, in particular how long your IPC's will take relative to your computation. To estimate IPC time on 100BT (presuming that your 50KB can be formatted as one big message) you can use 10 MB/sec as a reasonable rough guess as to best-case available bandwidth). The message could thus take as little as 0.005 seconds. This time will almost certainly be increased, though, by overhead. You can get better idea of overhead by using e.g. netpipe's NPmpi or NPpvm. You'll also have to deal with the time for the return packet to be sent to and received by the master -- I'm going to fairly arbitrarily guesstimate this at 100 microseconds (0.0001 seconds) plus possible library/implementation overhead and ignore it relative to the 0.005+, but this may not be justified. Assuming that processing incoming results and preparing the next 50KB block for a slave takes zero time (relative to 0.005 seconds) on the master (again, might well NOT be true) you can send out 200 50KB blocks per second to slaves AND catch their returned short packet at the end of 1 second of processing. If you try to go beyond this to, say, 300 slaves, all slaves will have to wait close to half a second after returning their results before getting their next chunk of work. If the master has to spend MORE time preparing, sending, receiving a block per slave, you might well find yourself saturating at only 100 slaves, or even fewer. Now, I have no idea how many slaves you can afford or the relative value of your result. 100 cpu clusters aren't THAT expensive. Looking back at your problem, you can reduce your minimum time to order of 1000 seconds from 100,000 seconds with 100 nodes, your maximum from over a month to considerably less than a day. If that latter isn't good enough, and you need to get to even larger numbers of nodes, and the result is WORTH the expense of building a cluster with 500 or 1000 nodes, you will need to consider any of the following (all of which will increase the scaling limit): a) Increase the amount of time spent per calculation relative to the IPC time. If you could arrange it for a node to do 10 seconds of computation per 50KB block, you could scale (in principle) to 2000 nodes, not 200, and instead of taking 500 seconds minimum you could take 50 seconds minimum, presuming that you have $2-3 million dollars to buy the requisite CPUs and care for and feed them. Even your biggest work unit would complete in only 45 minutes or so instead of five or six weeks. b) Decrease the amount of time spent per calculation doing the IPC time. If you went to Gigabit ethernet AND actually realized 100 MB/sec total bandwidths AND things scale perfectly (they probably wouldn't) you'd once again get the factor of ten better node scaling and could run efficiently up to as many as 2000 nodes. c) Go to a multiple masters parallelization -- arrange a sort of a tree of masters, each with 200 slaves, themselves communicating back with a supermaster. Presuming that whatever it is you are doling out in the 50K blocks can itself be at least quasi-independently computed, you can likely arrange for several masters to be handling disjoint distributions and once again get out to more than 200 slaves. If (as is not unlikely) your result is "valuable but not that valuable" and you're talking about 32 node or maybe 64 node clusters tops, then DEPENDING ON HOW MUCH WORK the master has to do to prepare the next 50KB block and deal with the returns, you are probably pretty safe and should get pretty much linear scaling with ordinary switched 100BT -- a 64 CPU cluster (plus a master) isn't particularly expensive -- with rackmount dual Athlons I'd guesstimate less about $1000-1100 per cpu, not much different for rackmount single athlons or single P4's, a few hundred less per CPU if you use tower cases and steel shelving. Your next step in the design phase is probably to write a prototype and measure both parallel speedup and (if possible) the communication times. Think clever thoughts about the 50KB block and the resultant calculation. Can you increase the "size" (completion time) of the calculation RELATIVE to the communication time of the block? If so, this is "good". In a lot of cases, this is possible -- for example, if the master is just incrementing some counters and generating the next set of data in a predictable sequence, you might be able to send the counter boundaries and let the slaves generate a whole sequence per IPC. In others it isn't -- each slave gets a unique set of numbers being (say) read from a big file on the master's disk (which, by the way, might take a nontrivial amount of time to locate on disk, read into a buffer, and arrange to send to the next free slave, hence the need to estimate the total amount of SERIAL work done by the master to send an IPC packet off to the next free slave!). Hope this helps. This is a "fun" sort of question as analyzing execution patterns and speedup is moderately rewarding. Be sure to at least consider consulting a book on parallel programming if there is any chance that some part of your underlying algorithms can be parallelized more efficiently than master/slave or embarrassingly parallel. BTW, this may be a discouraging reply in that to work-unit times down from days to weeks to "human timescales" (minutes to hours) will cost you a LOT of systems, but tanstaafl, after all. The absolute best you can hope for (barring superlinear speedup somewhere in your computation) is to speed up your work units linearly, with completion time inversely proportional to the number of participating nodes. As you should know from looking at my book, many computations don't, at least for very large numbers of nodes. Your computation has a reasonable chance of scaling linearly out to quite a few nodes with fairly straightforward cluster designs. That still leaves you with the problem of paying for the nodes and possibly a high-end communications network. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: Newbie question
- Next message: Newbie question
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
