[Beowulf] Fault tolerance & scaling up clusters (was Re: Bright Cluster Manager)
Lux, Jim (337K)
james.p.lux at jpl.nasa.gov
Tue May 15 15:19:38 PDT 2018
Yes.. the checkpoint restart thing was discussed on the list some years ago..
The reason I hadn't looked at "diskless boot from a server" is the size of the image - assume you don't have a high bandwidth or reliable link.
On 5/12/18, 12:33 AM, "Beowulf on behalf of Chris Samuel" <beowulf-bounces at beowulf.org on behalf of chris at csamuel.org> wrote:
On Wednesday, 9 May 2018 2:34:11 AM AEST Lux, Jim (337K) wrote:
> While I’d never claim my pack of beagles is HPC, it does share some aspects
> – there’s parallel work going on, the nodes need to be aware of each other
> and synchronize their behavior (that is, it’s not an embarrassingly
> parallel task that’s farmed out from a queue), and most importantly, the
> management has to be scalable. While I might have 4 beagles on the bench
> right now – the idea is to scale the approach to hundreds. Typing “sudo
> apt-get install tbd-package” on 4 nodes sequentially might be ok (although
> pdsh and csshx help a lot) it’s not viable for 100 nodes.
At ${JOB-1) we moved to diskless nodes and booting RAMdisk images from the
management node back in 2013 and it worked really well for us. You no longer
have the issue about nodes getting out of step because one of them was down
when you ran your install of a package across the cluster, removed HDD
failures from the picture (though that's likely less an issue with SSDs these
days) and did I mention the peace of mind of knowing everything is the same?
:-)
It's not new, the Blue Gene systems we had (BG/P 2010-2012 and BG/Q 2012-2016)
booted RAMdisks as they were designed to scale up to huge systems from the
beginning and to try and remove as many points of failure as possible - no
moving parts on the node cards, no local storage, no local state,
Where I am now we're pretty much the same, except instead of booting a pure
RAM disk we boot an initrd that pivots onto an image stored on our Lustre
filesystem instead. These nodes do have local SSDs for local scratch, but
again no real local state.
I think the place where this is going to get hard is on the application side
of things, there were things like Fault-Tolerant MPI (which got subsumed into
Open-MPI) but it still relies on the applications being written to use and
cope with that. Slurm includes fault tolerance support too, in that you can
request an allocation and should a node fail you can have "hot-spare" nodes
replace the dead node but again your application needs to be able to cope with
it!
It's a fascinating subject, and the exascale folks have been talking about it
for a while - LLNL's Dona Crawford keynote was about it at the Slurm User
Group in 2013 and is well worth a read.
https://slurm.schedmd.com/SUG13/keynote.pdf
Slide 21 talks about the reliability/recovery side of things:
# Mean time between failures of minutes or seconds for exascale
[...]
# Need 100X improvement in MTTI so that applications
# can run for many hours. Goal is 10X improvement in
# hardware reliability. Local recovery and migration may
# yield another 10X. However, for exascale, applications
# will need to be fault resilient
She also made the point that checkpoint/restart doesn't scale, you will likely
end up spending all your compute time doing C/R at exascale due to failures
and never actually getting any work done.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list