[Beowulf] Station wagon full of tapes

Tue May 26 07:33:03 PDT 2009

On Tue, 26 May 2009, Jeff Layton wrote:

> I haven't seen the cloud ready yet for anything other than embarrassingly
> parallel codes (i.e. since node, small IO requirements). Has anyone seen
> differently? (as an example of what might work, CloudBurst seems to be
> gaining some traction - doing sequencing in the cloud. The only problem
> is that sequencing can generate a great deal of data pretty rapidly).

I'm pretty skeptical of commercial rent-a-cluster business models.
Businesses have to exploit a window of cost-benefit between the DIY cost
and the rent-it cost.  For example, leasing a server+plus routine
sysadmin in a commercial server farm -- this makes sense because a tier
3 commercial data center is very expensive to build, humans are very
costly to buy for a single server or small constellation of servers.
Hence server farms can pay the amortized cost of large server facility,
pay for the physical colocated hardware, pay for sysadmins that provide
24x7 coverage for a very large number of server nodes, and still charge
"outrageous" prices to their clients (often equal to the cost of buying
the hardware outright, per year) and still make money on the deal while
saving their clients money.

I'm not convinced that this model works for cluster computing.  Those
outrageous prices don't exactly make server farm colocation companies
outrageously rich -- their base operation IS quite expensive and there
ARE scaling limits -- every N nodes they have to add another sysadmin,
every N nodes require Nx power, Nx cooling capacity, Nx space, and there
are nonlinear breaks of MAJOR expense adding more space, power, cooling
capacity where you have to pay for excess and there is a large cost for
undersubscription of capacity.  There is a window of reasonable profit
for server farms precisely because MOST businesses DON'T need the
minimum number of servers to make a home grown operation profitable.

Is this true in cluster computing?  Nearly every cluster computer user I
know of wants "infinite capacity", not finite capacity.  They are
limited by their budget, not their needs.  Give them more capacity,
they'll scale up their computation and finish it faster or do it bigger
or both.  The "buy in" cost of cluster computing is far less than the
cost of a commercial server operation.  This list exists because for
over a decade now DIY clusters have absolutely dominated cluster
computing, even when DIY is done on a large scale basis and may well
involve hiring consultants or buying a turnkey cluster.  One builds
one's OWN cluster on your OWN site and run it YOURSELF, shared or for a
single purpose.  Even the notion of a "Grid" of interlinked resource
shared clusters never quite got off the ground and appears to be a niche
rather than the dominant paradigm in spite of its apparent economies of
scale.

Where are the windows of opportunity here?  If we use the commercial
server colocation/rental model as a first estimate of the resale/rental
costs required to make a profit, these companies will have to rent
cluster nodes at rates that will basically pay for the nodes in a single
node-year of operation.  Their customers will forever want to use only
their <i>newest, fastest</i> nodes, so they will have to discount those
rates as soon as hardware is 2 or more years old, and that hardware will
be EOL/obsolete after three or four years of operation.  Unlike the
server model (where one can use VM technology to provide sandboxes per
node and where rental is continuous) a cluster model will probably NOT
be continuous, it will be on demand, and the software required is task
specific and may require a specific operating system with specific
libraries and a specific (local) build to run efficiently.
Administrative overhead can easily be higher, that is, with what amounts
to a complete reinstall of a node's OS image in order to support a
transient computation, followed by another complete reinstall of another
customer's image.  The kind of thing Cluster On Demand was supposed to
provision cheaply, but I don't know that it was ever finished and turned
into a commercial grade product.

Worse, the very bread and butter of a server farm is the small client.
For webservers and business servers, one only needs a few, usually --
sometimes only one, maybe as many as three or four, for a quite large
business (all the way through "small businesses" and up to many "medium
sized businesses).  Lots of potential clients where outsourcing will be
very definitely cheaper than doing it in house, and with better
Internet bandwidth as well to the webservers, better security, better
failover and backup (all available for rent) as well.

In cluster computing, historically, small clients ALWAYS do it
themselves, using opportunity cost labor and incidental space.  They see
the "cost" as the cost of the hardware (in year one) plus a bit of their
time setting it all up, time that they will have to spend ANYWAY to use
a remote cluster but with the extra hassle of having the nodes they are
setting up far far away and not under their direct access and control.
Sure, there are fixed costs of a few hundred per node per year in power
and cooling and "opportunity cost" reuse of space, but the amortized
cost of a cluster you OWN is easily half the amortized cost of RENTING
the exact same cluster from somebody else, depending on how you value
the differential costs of your own time.

Large scale cluster users are IMO always going to do it themselves.
They face exactly the same cost/benefit landscape that the would-be
commercial provider does in terms of Nx scaling, and can keep the
"profit" that the commercial provider expects to make from the LARGER
cost they have to resell the resource for.  Also, building their own
cluster gives them control over I/O and IPCs and almost certainly will
give them better/cheaper performance than "vanilla" cluster hardware in
a commercial setup even at EQUIVALENT cost.

Is there room in between small and large?  A set of clients that need
too many nodes to do on an opportunity cost basis, too few to justify
hiring a full time sysadmin or setting up a dedicated space?  I'm
guessing the answer is "sort of" yes.  Especially in environments that
can run commercial, prebuilt cluster software or a standard commerical
cluster OS (e.g. RHEL, Windows) and that are highly deficient in some
mix of human resources that can do sysadmin intelligently,
space/power/cooling, enlightened management capable of doing a CBA.
Plus edge cases -- somebody that needs a cluster desperately, but only
for six months and to do one single computation (how common is that?
not very, I think).

For umpty years, the notion of thin clients backed by powerful servers
that do all the "real work" has been one of the most enduring MYTHS of
computing.  The very few exceptions (Google!) are very important, to be
sure, but in truth from the time of the PC on, the cost barrier of
collectivized remote resources PLUS the thin clients has consistently
outweighed the benefits, and this has become only MORE true as PCs have
long since become far more powerful individually than enormous
supercomputers were when the notion was first proposed. Remote cluster
computing is yet another variation of this same idea, with individual
PCs being the relatively thin clients used to access the resource.  In
the past, it has for many years now been true that simply buying more of
those local PCs and doing the work locally end up being cheaper than
going "thin", with a few exceptions that somehow end up being transient
as certain HIDDEN costs of client/server local/remote thin/remote
computing surface in actual application.

Exactly WHY does Amazon expect to make money where so many have failed
before?  Because their server provisioning is (like Google's) so close
to the theoretical minimum in cost that they have enough margin to play
with that they can compete with DIY and still make money?  Because they
don't understand the technical challenges or provisioning?  Because
they've identified niche markets for which they'll have an edge and
therefore will "win", resulting in a cash cow of moderate size that will
last until the next change in compute technology that makes it obsolete
(that is, they'll be lucky if it lasts five years)?

    rgb

>
> Jeff
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu