[Beowulf] The True Cost of HPC Cluster Ownership

Tue Aug 11 18:50:26 PDT 2009

On Tue, 11 Aug 2009, Joe Landman wrote:

> There is a cost to going cheap.  This cost is time, and loss of productivity. 
> If your time (your students time) is free, and you don't need to pay for 
> consequences (loss of grants, loss of revenue, loss of productivity, ...) in 
> delayed delivery of results from computing or storage systems, then, by all 
> means, roll these things yourself, and deal with the myriad of debugging 
> issues in making the complex beasts actually work.  You have hardware stack 
> issues, software stack issues, interaction issues, ...

Oh, damn, might as well demonstrate that I'm not dead yet.  I'm getting
better.  Actually, I'm just getting back from Beaufort and so fish are
not calling me and neither is the mountain of unpacking, so I might as
well chip in.

My own experiences in this regard are that one can span a veritable
spectrum of outcomes from great and very cost efficient to horrible and
expensive in money and time.  Larger projects the odds of the latter go
up as what is a small inefficiency in 16-32 systems becomes and enormous
and painful one for 1024.

I'll skip the actual anecdotes -- most of them are probably in the
archives anyway -- and just go straight to the (IMO) conclusions.

The price of your systems should scale with the number you buy.
Building an 8 node starter cluster?  Tiger.com $350 specials are fine.
Building a professional/production cluster with 32 or more nodes?  Go
rackmount (already a modest premium) and start kicking in for service
contracts and a few extras to help keep things smooth.  Building 32
nodes with an eye on expandibility?  Go tier 1 or tier 2 vendor (or a
professional and experienced cluster consultant, such as Joe), with four
year service, after asking on list to see if the vendor is competent and
uses or builds good hardware.  IBM nodes are great.  Penguin nodes (in
my own experience) are great.  Dell nodes are "ok", sort of the low end
of high end.  I don't have much experience with HP in a cluster setting.
And do not, not, not, get no-name nodes from a cheap online vendor
unless you value pain.

This is advice that works for anything on the DIY side -- even a 1024
node cluster can be built by just you (or you and a friend/flunky) as
long as you allow a realistic amount of time to install it and debug it
-- say 10-15 minutes a node in production with electric screwdrivers to
rack them from delivery box to rack (after a bit of practice, and you'll
get LOTS of practice:-) Call it a day per rack, so yeah, 3-4 human-weeks
of FTE.  PLUS at least 10 minutes of install/debugging time, on average,
per node.  These aren't fixed numbers -- I'm sure there are humans who
can rack/derack a node in five minutes, and if you get your vendor to
premount the rails then ANYBODY can rack a node in two or three minutes
in production.  Then there are people like me, who might edge over
closer to twenty minutes, or circumstances like "oops, this back of rack
screws is the wrong size, time for crazed phone calls to and overnights
from the vendor" that can ruin your expected average fast.

(Software) install time depends on your general competence in linux,
clustering, and how much energy you expended ahead of time setting up
servers to accomodate the cluster install.  If you are a linux god, a
cluster god, and have a thoroughly debugged e.g. kickstart server (and
got the vendor to default the BIOS to "DHCP boot" on fallthrough from an
naked hard drive) then you might knock the install time down to making a
table entry and turning on the systems -- and debugging the ones that
failed to boot and install, or (in the case of diskless systems) boot
and operate.  A less gifted and experienced sysadmin might have to hook
up a console to each system and hand install it, but nowadays even doing
this isn't very time consuming as many installs can proceed in parallel.

For non-DIY clusters -- turnkey, or contract built by somebody else --
the same general principles apply.  If the cluster is fairly small, you
aren't horribly at risk if you get relatively inexpensive nodes, bearing
in mind that you're still trading off money SOMEWHERE later to save
money now.  If you are getting a medium large cluster or if downtime is
very expensive, don't skimp on nodes -- get nodes that have a solid
vendor standing behind them, with guaranteed onsite service for 3-4
years (the expected service life of your cluster).  Here, in addition,
you need to be damn sure you get your turnkey cluster from somebody who
is not an idiot, who knows what they are doing and can actually deliver
a functional cluster no less efficiently than described above and who
will stand with you through the inevitable problems that will surface
installing a larger cluster.

In a nutshell, the "cost of going cheap" isn't linear, with or without
student/cheap labor.  For small clusters installed by somebody who knows
what they are doing and e.g. operated and used by the owner or the
owner's lab including students, operated by departmental sysadmins with
cluster experience and enough warm bodies to have some opportunity cost
labor handy -- sure, go cheap -- if a node or two is DOA or fails, so
what?  It takes you an extra day or two to get the cluster going, but
most of that time is waiting for parts -- OC time is much smaller, and
everybody has other things to do while waiting.  But as clusters get
larger, the marginal cost of the differential failure rate between cheap
and expensive scales up badly and can easily exceed the OC labor pool's
capacity, especially if by bad luck you get a cheap node and it turns
out to be a "lemon" and the faraway dot com that sold it to you refuses
to fix or replace it.  The turnover from cheap to much more expensive
than just getting good nodes from a reputable vendor (which don't
usually cost THAT much more than cheap) can happen real fast, and the
time wasted can go from a few days to months equally fast.

So be aware of this.  It is easy to find people on this list with horror
stories associated with building larger clusters with cheap nodes.  With
smaller clusters it isn't horrible -- it is annoying.  You can often
afford to throw e.g. a bad motherboard away and just buy another one and
reinstall a better one in the nodes one at a time for eight or a dozen
nodes.  You can't do this, sanely, for 64, or 128, or 1024.

One last thing to be aware of is the politics of grants.  Few people out
there buy nodes out of pocket.  They pay for clusters using OPM (Other
People's Money).  In many cases it is MUCH EASIER to budget expensive
nodes, with onsite service and various guarantees, up front in the
initial purchase in a grant than it is to budget less money on
more cheaper nodes and then budget ENOUGH money to be able to handle
any possible failure contingency in the next three years of the grant
cycle.  Granting agencies actually might even prefer it this way (they
should if they have any sense).  Imagine their excitement if halfway
through the computation they are funding your entire cluster blows a
cheap capacitor made in Taiwan (same one on every motherboard) and your
cheap vendor vanishes like dust in the wind, bankrupted like everybody
else who sold motherboards with that cap on them.  Now they have a
choice -- buy you a NEW cluster so you can finish, or write off the
whole project (and quite possibly write off you as well, for ever and
ever).  Conservatism may cost you a few nodes that you could have added
if you went cheap, but it is INSURANCE that the nodes you get will be
around to reliably complete the computation.

    rgb

>
> What I am saying is that Doug is onto something here.  It ain't easy. Doug 
> simply expressed that it isn't.
>
> As for the article being self serving?  I dunno, I don't think so.  Doug runs 
> a consultancy called Basement Supercomputing that provides services for such 
> folks.  I didn't see overt advertisements, or even, really, covert "hire us" 
> messages.  I think this was fine as a white paper, and Doug did note that it 
> started life as one.
>
> My $0.02
>
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics Inc.
> email: landman at scalableinformatics.com
> web  : http://scalableinformatics.com
>       http://scalableinformatics.com/jackrabbit
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu