[Beowulf] Fault tolerance & scaling up clusters (was Re: Bright Cluster Manager)

Sat May 12 00:33:05 PDT 2018

On Wednesday, 9 May 2018 2:34:11 AM AEST Lux, Jim (337K) wrote:

> While I’d never claim my pack of beagles is HPC, it does share some aspects
> – there’s parallel work going on, the nodes need to be aware of each other
> and synchronize their behavior (that is, it’s not an embarrassingly
> parallel task that’s farmed out from a queue), and most importantly, the
> management has to be scalable.   While I might have 4 beagles on the bench
> right now – the idea is to scale the approach to hundreds.  Typing “sudo
> apt-get install tbd-package” on 4 nodes sequentially might be ok (although
> pdsh and csshx help a lot) it’s not viable for 100 nodes.

At ${JOB-1) we moved to diskless nodes and booting RAMdisk images from the 
management node back in 2013 and it worked really well for us.  You no longer 
have the issue about nodes getting out of step because one of them was down 
when you ran your install of a package across the cluster, removed HDD 
failures from the picture (though that's likely less an issue with SSDs these 
days) and did I mention the peace of mind of knowing everything is the same? 
:-)

It's not new, the Blue Gene systems we had (BG/P 2010-2012 and BG/Q 2012-2016) 
booted RAMdisks as they were designed to scale up to huge systems from the 
beginning and to try and remove as many points of failure as possible - no 
moving parts on the node cards, no local storage, no local state,

Where I am now we're pretty much the same, except instead of booting a pure 
RAM disk we boot an initrd that pivots onto an image stored on our Lustre 
filesystem instead.  These nodes do have local SSDs for local scratch, but 
again no real local state.

I think the place where this is going to get hard is on the application side 
of things, there were things like Fault-Tolerant MPI (which got subsumed into 
Open-MPI) but it still relies on the applications being written to use and 
cope with that.   Slurm includes fault tolerance support too, in that you can 
request an allocation and should a node fail you can have "hot-spare" nodes 
replace the dead node but again your application needs to be able to cope with 
it!

It's a fascinating subject, and the exascale folks have been talking about it 
for a while - LLNL's Dona Crawford keynote was about it at the Slurm User 
Group in 2013 and is well worth a read.

https://slurm.schedmd.com/SUG13/keynote.pdf

Slide 21 talks about the reliability/recovery side of things:

# Mean time between failures of minutes or seconds for exascale
[...]
# Need 100X improvement in MTTI so that applications
# can run for many hours. Goal is 10X improvement in
# hardware reliability. Local recovery and migration may
# yield another 10X. However, for exascale, applications
# will need to be fault resilient

She also made the point that checkpoint/restart doesn't scale, you will likely 
end up spending all your compute time doing C/R at exascale due to failures 
and never actually getting any work done.

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC