What happens with a failed node? (Scyld)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Sean Dilda agrajag at scyld.comThu Feb 7 11:19:14 PST 2002
- Previous message: What happens with a failed node? (Scyld)
- Next message: performance monitoring/sysstat -> html?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 07 Feb 2002, Tony Stocker wrote: > Sean, > > Okay, if there's no mail sent and the slave node keeps rebooting itself (for > instance if its network connection is down) or if the slave node never comes > back up (it died). What happens to the process that was running on it? The node reboots, thus all processes that were running on it die. > Does the host node reassign it to another slave node after some period of > time? What becomes of this "lost" process? If there's no information > provided to a user that their process was lost when the node went down, and > the host node never reassigns it to be completed then it's conceivable that > an entire string of processing could be brought to a halt because of this > silent failure. Since the host node maintains the master process list, it > should be aware that a process was running on a node that it now lists as > down. What happens to the representation of this process in the table? I'm not certain, but I beleive when a node goes down, all processes on it exit as far as the master node is concerned. As for the lost process getting reassigned, that all depends on what you are using to spawn jobs. If it knows enough to realize a node went down and a process didn't exit properlly it could theoreticlly respawn the job on another node. Nothing we ship does this. Most of our split jobs are done with MPI.. with MPI, the current state of your job is the processes on /all/ the nodes plus all data that is currently on the wire (in transit over the network). This makes it nearly impossible to just restart one of the processes, and rework all the net connections, plus keep all the internal data representations consistant. This is why checkpointing is best, it allows you to save data in a consistant state, then reload it in that same state. Without knowing the internal workings of your program, its essentially impossible for the spawning program/library to properlly do this for you. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 232 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20020207/d72f17bf/attachment.bin
- Previous message: What happens with a failed node? (Scyld)
- Next message: performance monitoring/sysstat -> html?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
