Building a beowulf with old computers

Mon Mar 10 05:37:08 PST 2003

On Sun, 9 Mar 2003, Robert Myers wrote:

> Robert G. Brown wrote:
> 
> >The sad truth is that cluster nodes have an ECONOMICALLY useful lifetime
> >of somewhere between 18 months and 3 years, depending on lots of things,
> >although one can arguably get work done out to 5 years on nodes that
> >require no human time to run or repair that other people are paying to
> >feed and cool.
> >
> >  
> >
> That makes a strong argument for considering energy consumption when 
> building a cluster in the first place.  Lower energy consumption = Lower 
> energy cost, longer economically useful life = Lower TCO/year.

It absolutely is unwise to ignore energy consumption (for power and
cooling), renovation costs, "rent" on the physical space, human
mangement costs and so forth -- really all the infrastructure and
management costs -- when building a cluster.  That doesn't stop many
folks from doing so.  Let us do a little review of cluster economics.

When computing TCO for clusters being inserted into existing facilities
and being run by existing personnel (as opposed to ones being created
from whole cloth with everything a line item and full time enterprise) a
lot of the expense is "opportunity cost". (Opportunity cost is an
economic term for the value of the time, space, power that people spend
building and managing a cluster that THEY COULD HAVE SPENT ON SOMETHING
ELSE.  Or more particularly, it is the cost of the something else.  This
needs to be compared to the BENEFIT of building the cluster and
diverting all of those resources.)

If you have a suitable space and preexiting LAN infrastructure handy,
some low priority/low return tasks that can easily be displaced or put
off, and a decent "return" from building a cluster (in term of whatever
goals you might have) then cluster infrastructure, except for power, can
be nearly "free".  If you have to renovate a space in a crowded building
that displaces or prevents other work from being done, and manage the
cluster with the time of an ALREADY overworked systems manager who has
to delay other tasks that reduce the productivity of the environment,
using a LAN infrastructure built and managed from the ground up just for
the cluster and task at hand, the cost of a cluster can be high, much
higher than you likely estimated when considering building the cluster.

MANY of the original beowulfish clusters, and many clusters today, have
a very low relative TCO because much of their indirect (non-hardware)
expense is/was opportunity cost labor and other resources that wouldn't
otherwise be able to produce anything of comparable value, OR because
they had dedicated resources but those resources were still far less
costly than alternative approaches to getting the same work done with
big iron.  Or both.  Frankly, this is still very much true, but as
clusters get bigger the differences between their infrastructure
requirements and those of big iron clusters gets smaller.

When clusters have more than 8-16 nodes, the recurring costs for
operating them start to get to large to safely ignore.  I think the
$1/watt/year figure comes as a bit of a surprise to a lot of folks (it's
based on electrical costs of $0.08 per kilowatt-hour, 8760 hours per
year, or about $0.70 for the electricity up front, plus another
estimated $0.30 to remove the heat with an air conditioner with a
coefficient of efficiency in the range of 2-3, so you can see that there
is nothing up my sleeve).  Of course it could be off by a factor of 2
either way depending on energy costs and AC efficiency in your actual
environment.  I remain very aware of these numbers as I pay them out of
pocket for my home beowulf.  It's "different" when it is your
money...;-)

> Same argument works for server blades, and I'm amazed that energy costs 
> don't come up as a consideration more often.
> 
> A researcher at LANL has built a cluster based on Transmeta chips called 
> Green Destiny, making the energy cost argument, which is documented in
> 
> http://public.lanl.gov/feng/Bladed-Beowulf.pdf
> 
> He claims a much lower TCO for his Transmeta-based system, but only a 
> small part of the claimed savings is electricity costs.

I'm not convinced that these save THAT much money (or any at all) for
the following reasons:

  a) From on-list discussions in the past it does not appear that one
saves THAT much energy PER FLOP (or aggregate MHz, or bogomip, or
whatever measure of performance you like). Indeed, it seems likely that
in a lot of cases one will LOSE energy per unit of actual work done.

This is for a lot of reasons, the most fundamental of which is that it
takes a certain amount of energy to switch the state of a flip-flop, a
certain amount of energy to hold the state of a flip-flop.  That energy
isn't scale invariant in either the spatial domain or temporal domain.
Also, there is the energy cost of shared infrastructure for the CPU --
the number of disks, the amount of memory, the number and kind of
peripheral cards supported PER UNIT OF WORK DONE (not per CPU). I'm not
certain that anybody has done a systematic study of the scaling laws for
WORK done for typical numerical tasks, but my feeling (which could be
incorrect) is that one's total energy cost per unit of work done by a
non-idle system actually decreases with e.g. CPU clock and VLSI
generation.  As in my guestimate for the P5's vs P6's, a 2.4 GHz
P6-class CPU might well get 24 times the work done as a 200 MHz P5 class
CPU, but I don't think that there is any way in hell that it draws 24
times as much power.  More like 2-3.  

If we are generous and presume four times as much power for twenty four
times as much work over six years, this suggests VERY CRUDELY that there
is a Moore's Law-like scaling law that decreases power cost per unit of
work done with a time constant maybe twice that of Moore's Law itself.
Then there is Amdahl's law, which dictates that it is (nearly, barring
accidental superlinear speedup) ALWAYS less efficient to use three 800
MHz CPUS than one 2400 MHz CPU, all things being equal.  Some
computer-science-economist out there may have done the math and
published it, or somebody out there may read this and find a master
thesis or senior honors thesis...;-)

So I'm by no means convinced that blades signficantly lower total power
consumed per unit of work done relative to much higher clock, hotter,
but faster node units and wouldn't be horribly surprised if it were the
opposite.

  b) On top of that, blades are very, very expensive per raw FLOP (or
whatever measure du jour you like/need).  At any given point in time,
there is some hardware combination out there that gives you optimum work
accomplished per dollar spent for your particular task.  For a CPU-bound
task, that is likely to currently be something in the lowball
Celeron/Duron/Athlon/P4 family, in a tower (cheapest but space
inefficent) or rack case (if space is an issue) depending on how
sensitive your task is to memory type and speed and CPU cache size.  For
a memory bound task, likely a P4 or Athlon, on a relatively good
motherboard with high FSB clock and with high end memory.  For
communications bound parallel tasks, a 64/66 PCI bus and high end
communications card is necessary.  Since CPU >>prices<< tend to vary
highly nonlinearly with clock (and many times work done scales at least
approximately with clock) one can just go down the list and pick the
most cost efficient CPU clock and packaging.  I don't think it will be a
blade.

Now, if blade packaging of a relatively low-clock CPU is only TWICE as
expensive per unit of work done than a current generation cost-optimum
packaging, the energy savings over the lifetime of the unit in no way
justify the higher cost.  The cost per year to operate a 2.4 GHz CPU is
likely to be in the $100-150 range, and over a three year lifetime that
is a TOTAL cost of $300-450.  The marginal cost of three 800 MHz blades
(presuming work done that DOES scale perfectly with clock) is way, way
higher than $300, and although they >>may<< draw less total power, they
aren't going to draw no power at all, so this will be further reduced.

  c) Even the management/ease of installation element of TCO touted for
blades is, in my opinion, very questionable.  Or rather, they may well
be extremely easy to install and manage, but >>so are plenty of
alternative, more traditional hardware configurations<<.  We are well
past the "hobby cluster" stage, and linux has come a long way from when
nodes had to be cobbled together and installed "by hand" one at a time,
taking perhaps and hour or hours each.  The
http://www.phy.duke.edu/brahma/linux-mag.html collection is just a small
and probably highly incomplete snapshot of installation and management
methodologies that reduce the marginal cost per node for installation
and management to near zero (once a fixed cost of setting up for the
methodology of your choice is paid).  RH archive+DHCP/PXE+Kickstart+Yum,
Debian archive+DHCP/PXE+Apt, Scyld, Clustermatic, various turnkey
vendors, there are open source and free installation methodologies,
shrink-wrapped methodologies that you can buy with support, and turnkey
clusters you can buy where you've already paid the fixed cost of
installation and setup and even the cost of programming and customizing
your primary application(s), usually for a fairly paltry 10-20% of the
per-node hardware price.  Even hardware reliability isn't necessarily a
significant differentiator, as one can easily enough buy nodes with 3
year onsite service plans, or maintain a small stock of spare parts to
minimize hardware downtime.

The BEST reason to consider a blade cluster, in my opinion, is to save
the nonlinear and potentially high capital costs for large spaces and/or
renovation in environments where the "cost" of space and renovation is
high, or there is no way to reasonably amortize thosee costs over (say)
10 years (where they diminish to being a comfortably small fraction of
the recurring costs for power and cooling you're paying anyway).

Bottom line: I think one has to do a fairly serious CBA for the various
alternative ways of building a cluster for accomplishing any particular
task in any particular environment.  One will very likely need to either
ignore vendor claims for their "TCO" or take them with a large, shiny
grain of salt and do your own less biased estimates.  Perhaps estimates
derived by companies that make roughly equal amounts of money selling
both bladed and rackmount and tower/shelf clusters all three can be
trusted, I don't know.  However, there is no substitute for running, and
fully understanding, the numbers yourself.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu