[Beowulf] Lifespan of a cluster
Gavin W. Burris
bug at wharton.upenn.edu
Mon Apr 28 06:02:14 PDT 2014
I would say five years is an OK lifetime. If you want to be aggressive
about your lifecycle, a case can be made for three years.
Things keeping a cluster running longer:
- lack of funding
- one time cost
- lack of communication
- it isn't broken
- researchers don't pay for electricity, cooling, facilities
- users do not want to migrate
- some applications may be difficult to map to new hardware / OS
Things that should convince you to update:
- two new servers can replace an entire rack of 10yo hardware
- the savings in electricity could equal the new hardware cost
- space is limited, new in, old out, temporary overlap
- IO and core performance is way up!
- warranty support = staff AND researchers sleep at night
- refreshing the OS and software is a very good thing
- new car smell
That said, I know clusters that won't be turned off until a data center
migration happens. I think the key here is to set expectations and have
an SLA before deploying anything.
On Sun 04/27/14 09:45AM +0100, Jörg Saßmannshausen wrote:
> Dear all,
> in some of the discussions here I came across the 'lifespan of a cluster'
> argument. What I was wondering is: how long is that in HPC for number
> Is it 3 years (end of warranty), 5 years (making good use of hardware) or
> The reason behind that asking is: I got clusters here which are 10 years old,
> and quite a number of them, and I would like to get a scheme implemented to
> get the hardware replaced every X years with X being the 'lifespan of a
> cluster'. One of the various options which are currently thrown around is to
> move from my local data-centre (3 rooms, one is purely for the backup/file
> storage and the other two for HPC) into the College shared data centre (single
> room). IF we are doing that, I am a bit worried that I get told in 5 years
> time (for the sake of that argument): your clusters are end of lifetime, you
> have to get rid of them as we need space / they are consuming too much energy.
> Thus, I am looking to get some answers for: how long are clusters run
> typically and how is that done in other shared data centres?
> The current funding situation here means it is difficult, if not impossible, to
> get HPC hardware from funding agencies. Even if you get a bit of money, it is
> just enough to get a new node. So most clusters are a bit organically grown
> which makes administration difficult if you want to get really the best out of
> waht you paid for. In an ideal world, I would like to have that replaced every
> 5 years: old kit out, new kit in. In the real world, I got to run the kit
> until it falls apart and hope that the Principal Investigator, i.e. the owner
> of the cluster, got some money to replace the old/broken nodes. Hence the
> questions so I can build up a good case to change there.
> I hope that makes sense to you.
> All the best from a overcast London!
> Dr. Jörg Saßmannshausen, MRSC
> University College London
> Department of Chemistry
> Gordon Street
> WC1H 0AJ
> email: j.sassmannshausen at ucl.ac.uk
> web: http://sassy.formativ.net
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Gavin W. Burris
Senior Project Leader for Research Computing
The Wharton School
University of Pennsylvania
More information about the Beowulf