[Beowulf] $2500 cluster. What it's good for?
Robert G. Brown
rgb at phy.duke.edu
Mon Dec 20 06:55:38 PST 2004
On Sun, 19 Dec 2004, Jim Lux wrote:
> This brings up an interesting optimization question. Just like in many
> things (I'm thinking RF amplifiers in specific) it's generally cheaper/more
> cost effective to buy one big thing IF it's fast enough to meet the
> requirements. Once you get past what ONE widget can do, then, you're forced
> to some form of parallelism or combining smaller widgets, and to a certain
> extent it matters not how many you need to combine (to an order of
> magnitude). The trade comes from the inevitable increase in system
> management/support/infrastructure to support N things compared to supporting
> just one. (This leaves aside high availability/high reliability kinds of
> So, for clusters, where's the breakpoint? Is it at whatever the fastest
> currently available processor is? This is kind of the question that's been
> raised before.. Do I buy N processors now with my grant money, or do I wait
> a year and buy N processors that are 2x as fast and do all the computation
> in the second of two years? If one can predict the speed of future
> processors, this might guide you whether you should wait for that single
> faster processor, or decide that no matter if you wait 3 years, you'll need
> more than the crunch of a single processor to solve your problem, so you
> might as well get cracking on the cluster.
This has actually been discussed on list several times, and some actual
answers posted. The interesting thing is that it is susceptible to
algebraic analysis and can actually be answered, at least in a best
approximation (since there are partially stochastic delays that
contribute to the actual optimal solution).
The optimal solution depends on a number of parameters, of course:
The problem. EP problems are far more flexible as far as mixing CPU
speeds and hardware types goes. Synchronous fine grained computations
are far more difficult to efficiently implement on mixed hardware.
Moore's Law (smoothed) for all the various components. You have to be
able to predict the APPROXIMATE rate of growth in hardware speed at
constant cost to be able to determine how to spend your money optimally.
Moore's Law (corrected). Moore's Law advances are NOT smooth -- they
are discrete and punctuated by sudden jumps. Worse, those jumps aren't
even uniform -- sometimes a processor or chipset is introduced that
speeds up some operations by X and others by Y, so mere clockspeed
scaling isn't a good predictor -- or where one of the underlying e.g.
memory subsystems is suddenly changed while the processor remains the
same. One cannot tell the future with any precision, but one needs to
pay attention to (for example) the "roadmaps" published by Intel and AMD
and IBM and Motorola and all the other major chip manufacturers that
make key components that affect the work flow for your task(s).
"TCO". Gawd, I hate that term, because it is much-abused by
marketeers, but truly it IS something to think about. There are
(economic) risks associated with building a cluster with bleeding-edge
technology. There are risks associated with mixing hardware from many
low-bid vendors. There are administrative costs (sometimes big ones)
associated from mixing hardware architectures, even generally similar
ones such as Intel and AMD or i386 and X86_64. Maintenance costs are
sometimes as important to consider as pure Moore's Law and hardware
costs. Human time requirements can vary wildly and are often neglected
when doing the CBA for a cluster.
Infrastructure costs are also an important specific factor in TCO. In
fact, they (plus Moore's Law) tend to put an absolute upper bound on the
useful lifetime of any given cluster node. Node power consumption (per
CPU) scales up, but it seems to be following a much slower curve than
Moore's Law -- slower than linear. A "node CPU" has cost in the
ballpark of 100W form quite a few years now -- a bit over 100W for the
highest clock highest end nodes, but well short of the MW that would be
required if they followed anything like a ML trajectory from e.g. the
original IBM PC. Consequently, just the cost of the >>power<< to run
and cool older nodes at some point exceeds the cost of buying and
running a single new node of equivalent aggregate compute power. This
is probably the most predictable point of all -- a sort of "corallary"
to Moore's Law. If one assumes a node cost of $1000/CPU and a node
power cost of $100/year (for 100W nodes) and a ML doubling time of 18
months, then sometime between year four and year six -- depending on the
particular discrete jumps -- it will be break even to buy a new node for
$1000 and pay $100 for its power versus operate 11 nodes for the year.
Except that Amdahl's Law guarantees that this is an upper bound time,
and for most non-EP tasks the break even point will come earlier.
Except that TCO costs for maintaining the node start to escalate after
roughly year three (when most extended warranties stop and getting
replacement hardware gets very difficult indeed).
Finally, there is one consideration that often trumps all of the
above. Many clusters if not most clusters are built to perform some
specific, grant funded or corporate funded, piece of work. Even if it
turns out to be "optimal" to wait until the end of year three, buy all
your hardware then, and work for one year to complete the absolute most
work that could be done on a four year grant, it is simply impossible to
actually DO this. So people do the opposite -- spend all their money in
year one and waste the work they could have accomplished riding ML, or
if they are very clever and their task permits, spend it in e.g. 1/3's
and get (1/3)*4*1 + (1/3)*2.5*2.0 + (1/3)*1*4 = 13/3 = 4 1/3 work units
done instead of the 4 they'd get done on a flat year one investment.
It is amusing to note that it is break even to buy in year one and run
for four years versus buy at the end of year three and run for one year,
EXCEPT for TCO. TCO makes the latter much, much cheaper, as it includes
the infrastructure and administrative cost for running the nodes for
four years instead of one, which are likely to equal or exceed the cost
of all the hardware combined! However, you will convince very few
researchers or granting agencies that the best/optimal course is for
them to do nothing for the next three years and then work for one year
-- and it probably isn't true. The truth is that there are nonlinear
social and economic benefits from doing the work over time, even at a
less than totally efficient rate.
If there is a rule of thumb, though, it is that a true optimum given
this sort of macroeconomic consideration is likely the distributed
expenditure model. It is generally better for MANY kinds of tasks or
task organizations to take any fixed budget for N>3 years and split it
up into N-1 chunks or thereabouts and try to ride the ML breaks as they
come. This means that in your organization you always have access to a
cluster that is new/current technology and can exploit its nonlinear
benefits; you have access to a workhorse cluster than is only 1-2 years
old. You have access to a mish-mosh cluster that is 2-4 years old but
still capable of doing useful work for lots of kinds of tasks (including
e.g. prototyping, code development, EP tasks as well as some
production). From there, as warranties expire and maintenance costs
escalate, you retire them and ultimately (one hopes) recycle them in
some socially responsible way.
> > I also have an interest in seeing a cluster version of Octave or SciLab
> > set to work like a server. (as I recall rgb had some reasons not to use
> > these high level tools, but we can save this discussion for later)
> I'd be real interested in this... Mathworks hasn't shown much interest in
> accomodating clusters in the Matlab model, and I spend a fair amount of time
> running Matlab code.
I believe that there is an MPI library and some sort of compiler thing
for making your own libraries, though. I don't use the tool and don't
keep close track, although that will change next year as I'll be using
it in teaching. The real problem is that people who CAN program matlab
to do stuff in parallel aren't the people who are likely to use matlab
in the first place. And since matlab is far, far from open source --
actually annoyingly expensive to run and carefully licensed -- the
people who might be the most inclined to invest the work don't/can't do
so in a way that is generally useful. One of the many evils of closed
source, non-free applications. So I think Doug is on track here -- work
should really be devoted to octave, where it can nucleate a serious
community development effort and possibly give researchers a solid
reason to choose octave instead of matlab in the first place.
> > What I can say as part of the project, we will be collecting a software
> > list of applications and projects.
> > Finally, once we all have our local clusters and software running to our
> > hearts content, maybe we can think about a grid to provide spare compute
> > cycles to educational and public projects around the world.
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf