[Beowulf] Re: Good upgrade intervals (Was: Oldest functioning clusters)

Robert G. Brown rgb at phy.duke.edu
Tue Nov 23 10:43:27 PST 2004

On Tue, 23 Nov 2004, Josip Loncaric wrote:

> So, the work/cost optimal policy is roughly this:
> (1) Initially, pick a budget p which you can sustain every 3-4 years
> (2) Buy the highest performing system available at price p
> (3) Every time you can get about 5x performance at cost p, repeat (2)
> This simple calculation assumes complete equipment replacement at a 
> fixed budget.  The above does not take into account component upgrades 
> along the way which may extend the useful life of the original 
> equipment, nor inflation-adjusted budget increases.  However, as Robert 
> has pointed out, software is a moving target, and eventually old 
> hardware just won't comfortably run new software.
> Each situation is a bit different, but the above "5x performance 
> upgrades" rule is not a bad choice.
> Sincerely,
> Josip

The only really significant modification to the Josip's beautiful math
I'd suggest is one associated with hardware reliability.

I've encountered two generically different kinds of hardware:

  a) Hardware that you buy and is almost totally trouble free "forever".
An original 1982 IBM PC, for example, was still running when I gave it
away to my kid's kindergarten in 1994 or thereabouts.  Aside from a
single hard drive crash, it had never required any sort of repair.

  b) #*!&@ hardware that has anything from a single bad component that
fails repeatedly to a total bad mix of components that are prone to
failure.  We've all seen this, some of us by specific brand or
configuration.  This is hardware that, service contract or not, eats
(our) time and (somebody's, possibly our) money like crazy.

Josip's budget computation presumes "a" type hardware.  The fraction of
"b-ness" of any given hardware batch shifts the curve, possibly
significantly, towards earlier replacement.  We have some node hardware
that we are counting the days on, literally, until we can decommission
it and stop fixing it when it (regularly, frequently) breaks.  It is
"severe b" type and breaks repeatedly even when we replace components
with new ones (we've nearly totally rebuilt some of the nodes several
times with warranty replacement parts and parts we've bought ourselves).

One cannot usually justify throwing grant-purchased hardware out and
asking for more before the third year is up, but if one wants to get OUT
to 3.5 years or more, try very hard to ensure the "a-ness" of your

Another thing to insert is that for reasonably a-like hardware IN its
third+ year that still has some use in it (and has somebody else paying
for electricity;-) is "eat your dead" -- let nodes fail, use them for
spare parts, and gradually let your node count diminish until you cannot
be bothered to take the time to mess with node repair out of the
boneyard (which happens, believe me).  This depends, of course, on
having opportunity-cost time available or a systems person with a bit of
spare time on their hands.  If your operation already saturates your
labor pool, it may be better to let failure mean failure after year 3
and just donate the dead to a recycling or charitable organization.


Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list