Robert G. Brown
rgb at phy.duke.edu
Sun May 26 09:53:53 PDT 2002
On Sat, 25 May 2002, Mike Ladwig wrote:
> Hi. I am new to this stuff and am working through Robert Brown's very helpful
> "Engineering a Beowulf-style Compute Cluster", and decided to take his advice
> to "ask on the beowulf list before tackling a new parallelization project".
> But if this question isn't appropriate here or there is a better place to
> ask...please let me know.
> A brief description: I started with an algorithm that does something very
> slowly. Each individual calculation takes about 1.1 seconds on a .9 Ghz P3.
> Every work unit is made up of at best 100k and at worst 3M calculations to
> complete; average is 400k. Several work units have to be completed before a
> set of results becomes meaningful, and when some day in a real world system
> numerous result sets need to be completed in human-friendly time.
> In preparation for moving to a parallel execution environment, I have
> redesigned the algorithm so that a controller spawns off a number of
> threads, each of which takes as input a common to the work unit 1.5MB binary
> blob and a unique 50KB binary blob. The threads do not communicate between
> themselves. After each thread completes it's calculation, it reports
> results, requests another unique 50k block, and performs another calculation.
> The controller is typically able to detect early completion of a given work
> unit based on the results from every 2000 calculations.
So it sounds like (after initialization when you send out the common 1.5
MB initialization data) you get a 50KB dataset, work a second or so,
send back a "result" (presumed small enough to fit in a single IP
packet) and iterate until you complete a work unit. On a single CPU a
work unit would take between a bit more than a day to five or six weeks.
This sounds like a fairly classic master/slave computation. To see how
far you can scale it, you need to estimate or measure some more numbers,
in particular how long your IPC's will take relative to your
To estimate IPC time on 100BT (presuming that your 50KB can be formatted
as one big message) you can use 10 MB/sec as a reasonable rough guess as
to best-case available bandwidth). The message could thus take as
little as 0.005 seconds. This time will almost certainly be increased,
though, by overhead. You can get better idea of overhead by using e.g.
netpipe's NPmpi or NPpvm. You'll also have to deal with the time for
the return packet to be sent to and received by the master -- I'm going
to fairly arbitrarily guesstimate this at 100 microseconds (0.0001
seconds) plus possible library/implementation overhead and ignore it
relative to the 0.005+, but this may not be justified.
Assuming that processing incoming results and preparing the next 50KB
block for a slave takes zero time (relative to 0.005 seconds) on the
master (again, might well NOT be true) you can send out 200 50KB blocks
per second to slaves AND catch their returned short packet at the end of
1 second of processing.
If you try to go beyond this to, say, 300 slaves, all slaves will have
to wait close to half a second after returning their results before
getting their next chunk of work. If the master has to spend MORE time
preparing, sending, receiving a block per slave, you might well find
yourself saturating at only 100 slaves, or even fewer.
Now, I have no idea how many slaves you can afford or the relative value
of your result. 100 cpu clusters aren't THAT expensive. Looking back
at your problem, you can reduce your minimum time to order of 1000
seconds from 100,000 seconds with 100 nodes, your maximum from over a
month to considerably less than a day.
If that latter isn't good enough, and you need to get to even larger
numbers of nodes, and the result is WORTH the expense of building a
cluster with 500 or 1000 nodes, you will need to consider any of the
following (all of which will increase the scaling limit):
a) Increase the amount of time spent per calculation relative to the
IPC time. If you could arrange it for a node to do 10 seconds of
computation per 50KB block, you could scale (in principle) to 2000
nodes, not 200, and instead of taking 500 seconds minimum you could take
50 seconds minimum, presuming that you have $2-3 million dollars to buy
the requisite CPUs and care for and feed them. Even your biggest work
unit would complete in only 45 minutes or so instead of five or six
b) Decrease the amount of time spent per calculation doing the IPC
time. If you went to Gigabit ethernet AND actually realized 100 MB/sec
total bandwidths AND things scale perfectly (they probably wouldn't)
you'd once again get the factor of ten better node scaling and could run
efficiently up to as many as 2000 nodes.
c) Go to a multiple masters parallelization -- arrange a sort of a
tree of masters, each with 200 slaves, themselves communicating back
with a supermaster. Presuming that whatever it is you are doling out in
the 50K blocks can itself be at least quasi-independently computed, you
can likely arrange for several masters to be handling disjoint
distributions and once again get out to more than 200 slaves.
If (as is not unlikely) your result is "valuable but not that valuable"
and you're talking about 32 node or maybe 64 node clusters tops, then
DEPENDING ON HOW MUCH WORK the master has to do to prepare the next 50KB
block and deal with the returns, you are probably pretty safe and should
get pretty much linear scaling with ordinary switched 100BT -- a 64 CPU
cluster (plus a master) isn't particularly expensive -- with rackmount
dual Athlons I'd guesstimate less about $1000-1100 per cpu, not much
different for rackmount single athlons or single P4's, a few hundred
less per CPU if you use tower cases and steel shelving.
Your next step in the design phase is probably to write a prototype and
measure both parallel speedup and (if possible) the communication times.
Think clever thoughts about the 50KB block and the resultant
calculation. Can you increase the "size" (completion time) of the
calculation RELATIVE to the communication time of the block? If so,
this is "good". In a lot of cases, this is possible -- for example, if
the master is just incrementing some counters and generating the next
set of data in a predictable sequence, you might be able to send the
counter boundaries and let the slaves generate a whole sequence per IPC.
In others it isn't -- each slave gets a unique set of numbers being
(say) read from a big file on the master's disk (which, by the way,
might take a nontrivial amount of time to locate on disk, read into a
buffer, and arrange to send to the next free slave, hence the need to
estimate the total amount of SERIAL work done by the master to send an
IPC packet off to the next free slave!).
Hope this helps.
This is a "fun" sort of question as analyzing execution patterns and
speedup is moderately rewarding. Be sure to at least consider
consulting a book on parallel programming if there is any chance that
some part of your underlying algorithms can be parallelized more
efficiently than master/slave or embarrassingly parallel.
BTW, this may be a discouraging reply in that to work-unit times down
from days to weeks to "human timescales" (minutes to hours) will cost
you a LOT of systems, but tanstaafl, after all. The absolute best you
can hope for (barring superlinear speedup somewhere in your computation)
is to speed up your work units linearly, with completion time inversely
proportional to the number of participating nodes. As you should know
from looking at my book, many computations don't, at least for very
large numbers of nodes. Your computation has a reasonable chance of
scaling linearly out to quite a few nodes with fairly straightforward
cluster designs. That still leaves you with the problem of paying for
the nodes and possibly a high-end communications network.
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf