[Beowulf] Fault tolerance & scaling up clusters (was Re: Bright Cluster Manager)

Tue May 15 15:19:38 PDT 2018

Yes.. the checkpoint restart thing was discussed on the list some years ago..

The reason I hadn't looked at "diskless boot from a server" is the size of the image - assume you don't have a high bandwidth or reliable link. 

On 5/12/18, 12:33 AM, "Beowulf on behalf of Chris Samuel" <beowulf-bounces at beowulf.org on behalf of chris at csamuel.org> wrote:

    On Wednesday, 9 May 2018 2:34:11 AM AEST Lux, Jim (337K) wrote:

    > While I’d never claim my pack of beagles is HPC, it does share some aspects
    > – there’s parallel work going on, the nodes need to be aware of each other
    > and synchronize their behavior (that is, it’s not an embarrassingly
    > parallel task that’s farmed out from a queue), and most importantly, the
    > management has to be scalable.   While I might have 4 beagles on the bench
    > right now – the idea is to scale the approach to hundreds.  Typing “sudo
    > apt-get install tbd-package” on 4 nodes sequentially might be ok (although
    > pdsh and csshx help a lot) it’s not viable for 100 nodes.

    At ${JOB-1) we moved to diskless nodes and booting RAMdisk images from the 
    management node back in 2013 and it worked really well for us.  You no longer 
    have the issue about nodes getting out of step because one of them was down 
    when you ran your install of a package across the cluster, removed HDD 
    failures from the picture (though that's likely less an issue with SSDs these 
    days) and did I mention the peace of mind of knowing everything is the same? 
    :-)

    It's not new, the Blue Gene systems we had (BG/P 2010-2012 and BG/Q 2012-2016) 
    booted RAMdisks as they were designed to scale up to huge systems from the 
    beginning and to try and remove as many points of failure as possible - no 
    moving parts on the node cards, no local storage, no local state,

    Where I am now we're pretty much the same, except instead of booting a pure 
    RAM disk we boot an initrd that pivots onto an image stored on our Lustre 
    filesystem instead.  These nodes do have local SSDs for local scratch, but 
    again no real local state.

    I think the place where this is going to get hard is on the application side 
    of things, there were things like Fault-Tolerant MPI (which got subsumed into 
    Open-MPI) but it still relies on the applications being written to use and 
    cope with that.   Slurm includes fault tolerance support too, in that you can 
    request an allocation and should a node fail you can have "hot-spare" nodes 
    replace the dead node but again your application needs to be able to cope with 
    it!

    It's a fascinating subject, and the exascale folks have been talking about it 
    for a while - LLNL's Dona Crawford keynote was about it at the Slurm User 
    Group in 2013 and is well worth a read.

    https://slurm.schedmd.com/SUG13/keynote.pdf

    Slide 21 talks about the reliability/recovery side of things:

    # Mean time between failures of minutes or seconds for exascale
    [...]
    # Need 100X improvement in MTTI so that applications
    # can run for many hours. Goal is 10X improvement in
    # hardware reliability. Local recovery and migration may
    # yield another 10X. However, for exascale, applications
    # will need to be fault resilient

    She also made the point that checkpoint/restart doesn't scale, you will likely 
    end up spending all your compute time doing C/R at exascale due to failures 
    and never actually getting any work done.

    All the best,
    Chris
    -- 
     Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

    _______________________________________________
    Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
    To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf