[Beowulf] The True Cost of HPC Cluster Ownership

Joshua Baker-LePain jlb17 at duke.edu
Wed Aug 12 08:43:03 PDT 2009


On Tue, 11 Aug 2009 at 9:50pm, Robert G. Brown wrote

> In a nutshell, the "cost of going cheap" isn't linear, with or without
> student/cheap labor.  For small clusters installed by somebody who knows
> what they are doing and e.g. operated and used by the owner or the
> owner's lab including students, operated by departmental sysadmins with
> cluster experience and enough warm bodies to have some opportunity cost
> labor handy -- sure, go cheap -- if a node or two is DOA or fails, so
> what?  It takes you an extra day or two to get the cluster going, but
> most of that time is waiting for parts -- OC time is much smaller, and
> everybody has other things to do while waiting.  But as clusters get
> larger, the marginal cost of the differential failure rate between cheap
> and expensive scales up badly and can easily exceed the OC labor pool's
> capacity, especially if by bad luck you get a cheap node and it turns
> out to be a "lemon" and the faraway dot com that sold it to you refuses
> to fix or replace it.  The turnover from cheap to much more expensive
> than just getting good nodes from a reputable vendor (which don't
> usually cost THAT much more than cheap) can happen real fast, and the
> time wasted can go from a few days to months equally fast.

One thing I haven't seen addressed is to look at the proposed usage of the 
cluster.  If most of the code to be run on the cluster is embarrassingly 
parallel, then the cost of a node going down or the network being less 
than optimal is fairly low.  In this case, IMO, it's pretty easy to make 
the argument to go the DIY route (depending on size and available labor 
pool, of course, as others have mentioned).  If, OTOH, you intend to run 
tightly coupled MPI code across the entire cluster, then it becomes very 
valuable to ensure that everything is working together just so.  There a 
turn-key vendor (and/or highly skilled third party) can make more sense.

In other words, the answer, as always, is "It depends."

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF



More information about the Beowulf mailing list