Two heads are better than one! :)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Bob Drzyzgula bob at drzyzgula.orgThu Oct 31 19:27:30 PST 2002
- Previous message: Two heads are better than one! :)
- Next message: Two heads are better than one! :)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Joe, Thanks, that's a good point. This becomes more or less of an issue, I suppose, depending on the length of time it takes any single "run" of an application to complete. In the past, most of our focus from a systems redundancy perspective has been to minimize the amount of time it would take to get the users back up and running in the event of a system failure; if it takes us more than an hour or so we start feeling bad about it. Checkpointing or other restart capability has been more of an user responsibility -- they know if they can't find a way to break the job up and it winds up taking weeks for a single thread to complete they are running a risk we can't much help them with. Luckily for us (the system builders and managers), it is mostly the research (read: low priority) stuff that tends to run for weeks. These are econometric models and simulations, FWIW. Generally, production applications are designed to be completable by deadline with a reasonable margin of safety, while the analysts don't feel so constrained when selecting problems for research projects, which often have no firm deadlines. Still, with the application parallelized and scattered to a number of machines like this, there are new dimensions to this problem. As Don mentioned in another post in this thread, one issue is what you do with the processes that continue to run on the cluster compute nodes if the master fails -- what is the recovery/cleanup procedure and how much can you salvage. Beyond designing the system to be as fault-resilient as possible, of course, this all becomes an exercise in expectation management -- making sure that the user community knows where the risks are and where they aren't, so that they can plan accordingly. Your point is particularly valuable in that regard. :-) --Bob On Thu, Oct 31, 2002 at 09:26:55PM -0500, Joseph Landman wrote: > > On Thu, 2002-10-31 at 20:56, Bob Drzyzgula wrote: > > > Thus, the question becomes whether any of the various > > cluster APIs and services such as PVM, MPI, BPROC, PBS, > > etc. are dependant on the selection of a single, exclusive > > master. Clearly if multiple simultaneously operating > > masters are tolerated in the API, you can just have > > multiple head nodes which are available all the time. If > > an API requires a single master, one might have to effect > > some sort of manual switch-over in the event of a head > > node failure; this would then raise the question of the > > complexity of such a switch-over, e.g. would compute node > > reconfiguration be required or would it simply be a matter > > of starting up the controller service on a new system. > > Hi Bob: > > It is more complex than that, in that you would need to preserve state > changes over the length of the program, and PVM/MPI/et al do not > preserve this state information. The folks at LANL had a fault tolerant > MPI at one point, but I haven't heard much of it recently. > > Joe
- Previous message: Two heads are better than one! :)
- Next message: Two heads are better than one! :)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
