What happens with a failed node? (Scyld)

Thu Feb 7 08:37:58 PST 2002

Sean,

Okay, if there's no mail sent and the slave node keeps rebooting itself (for 
instance if its network connection is down) or if the slave node never comes 
back up (it died).  What happens to the process that was running on it?  
Does the host node reassign it to another slave node after some period of 
time?  What becomes of this "lost" process?  If there's no information 
provided to a user that their process was lost when the node went down, and 
the host node never reassigns it to be completed then it's conceivable that 
an entire string of processing could be brought to a halt because of this 
silent failure.  Since the host node maintains the master process list, it 
should be aware that a process was running on a node that it now lists as 
down.  What happens to the representation of this process in the table?

Thanks for the help,

-Tony

>From: Sean Dilda <agrajag at scyld.com>
>To: tonystocker at mail.com
>CC: beowulf at beowulf.org
>Subject: Re: What happens with a failed node? (Scyld)
>Date: Thu, 7 Feb 2002 00:56:30 -0500
>
>On Wed, 06 Feb 2002, Tony Stocker wrote:
>
> >
> > Hi All,
> >
> > Quick question.  What happens if a compute node fails or loses it 
>network
> > connectivity while processing something (non-parrallelized)?  How long 
>does
> > it take the host node to realize something is wrong?  What does the host
> > node do then?  Does it send mail reporting which node went down and what 
>was
> > running on it at the time?
>
>The master node and continually pings all the slave nodes and the slave
>nodes continually ping the master node.  If the master node doesn't get
>a ping response in 30 seconds, it automaticlly sets the node to down (it
>doesn't tell the node this, but changes its internal representation of
>the nodes state.
>
>If the slave node doesn't get a ping response in 30 seconds, the node
>will reboot.  On boot it will then try to connect to the master again,
>and if there are problems it will keep rebooting until it can connect.
>
>There is no mail sent, just the cluster trying to auto-fix itself.
> >
> > What about if the node was running a parrallelized program that is also
> > being run by other elements of the cluster?  What's the node-fault
> > procedures/setup in that case?
>
>The status of your parallelized program depends on what you're using to
>parallelize it.  The implementation of MPI that we ship (mpich) will end
>up falling over if one of its nodes disappears under its feet, and as
>far as I know, so will all other implementations.  It is for this reason
>that we recommend users with long-running programs have their programs
>regularly checkpoint, so that in the unlikely event that there is a
>problem, minimal work will be lost.
>
>
>Sean
><< attach3 >>

_________________________________________________________________
MSN Photos is the easiest way to share and print your photos: 
http://photos.msn.com/support/worldwide.aspx