[Beowulf] Lifespan of a cluster

Mon Apr 28 06:02:14 PDT 2014

Hi, Jörg.

I would say five years is an OK lifetime.  If you want to be aggressive
about your lifecycle, a case can be made for three years.

Things keeping a cluster running longer:
- lack of funding
- one time cost
- lack of communication
- it isn't broken
- researchers don't pay for electricity, cooling, facilities
- users do not want to migrate
- some applications may be difficult to map to new hardware / OS

Things that should convince you to update:
- two new servers can replace an entire rack of 10yo hardware
- the savings in electricity could equal the new hardware cost
- space is limited, new in, old out, temporary overlap
- IO and core performance is way up!
- warranty support = staff AND researchers sleep at night
- refreshing the OS and software is a very good thing
- new car smell

That said, I know clusters that won't be turned off until a data center
migration happens.  I think the key here is to set expectations and have
an SLA before deploying anything.  

Cheers.

On Sun 04/27/14 09:45AM +0100, Jörg Saßmannshausen wrote:
> Dear all,
> 
> in some of the discussions here I came across the 'lifespan of a cluster' 
> argument. What I was wondering is: how long is that in HPC for number 
> crunching?
> Is it 3 years (end of warranty), 5 years (making good use of hardware) or 
> longer?
> 
> The reason behind that asking is: I got clusters here which are 10 years old, 
> and quite a number of them, and I would like to get a scheme implemented to 
> get the hardware replaced every X years with X being the 'lifespan of a 
> cluster'. One of the various options which are currently thrown around is to 
> move from my local data-centre (3 rooms, one is purely for the backup/file 
> storage and the other two for HPC) into the College shared data centre (single 
> room). IF we are doing that, I am a bit worried that I get told in 5 years 
> time (for the sake of that argument): your clusters are end of lifetime, you 
> have to get rid of them as we need space / they are consuming too much energy.
> 
> Thus, I am looking to get some answers for: how long are clusters run 
> typically and how is that done in other shared data centres?
> 
> The current funding situation here means it is difficult, if not impossible, to 
> get HPC hardware from funding agencies. Even if you get a bit of money, it is 
> just enough to get a new node. So most clusters are a bit organically grown 
> which makes administration difficult if you want to get really the best out of 
> waht you paid for. In an ideal world, I would like to have that replaced every 
> 5 years: old kit out, new kit in. In the real world, I got to run the kit 
> until it falls apart and hope that the Principal Investigator, i.e. the owner 
> of the cluster, got some money to replace the old/broken nodes. Hence the 
> questions so I can build up a good case to change there.
> 
> I hope that makes sense to you.
> 
> All the best from a overcast London!
> 
> Jörg
> 
> 
> -- 
> *************************************************************
> Dr. Jörg Saßmannshausen, MRSC
> University College London
> Department of Chemistry
> Gordon Street
> London
> WC1H 0AJ 
> 
> email: j.sassmannshausen at ucl.ac.uk
> web: http://sassy.formativ.net
> 
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Gavin W. Burris
Senior Project Leader for Research Computing
The Wharton School
University of Pennsylvania