[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
John Hearns john.hearns at clustervision.comWed Nov 3 01:05:36 PST 2004
- Previous message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Next message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, 2 Nov 2004, Brian Dobbins wrote: > I have just begun looking into a checkpoint / restart capability for > clusters, but looking into the archives here and doing a search has > shown few viable solutions. Some, like CKPOX (1), appear to be only > written for the 2.4 series kernels, and I recall seeing one product that > seemed to indicate it had full support for these operations, but it was > a commercial product. >From what you say below, you mean suspending user jobs, rather than entire systems. I was rather taken by 'swusp' at one time, this is a Linux suspend to disk. Homepage is down today. Anyone know the state of this? > > > Additionally, though this is a much wider question (and one tackled > before!), what are people's pros and cons of the various queuing > systems? I've played with OpenPBS before, and 'seen' SGE, but once > again, I thought it'd be nice to hear what some of the heavy hitters on > this list prefer. I am in no way a heavy hitter! I would say go for Gridengine. It has the checkpointing and suspend facilities you are after. However - see below. > > Background: The reason we're looking for a checkpoint/restart option > has more to do with preempting a running job (of a lower priority) by > checkpointing it than it does with saving the state in case of a crash. In Gridengine, there is the concept of a 'subordinate' queue. The lower priority queue is suspended on that node if a higher priority queue needs to run. > While functionally these may be pretty close or the same, if that gives > rise to another solution, I'd like to hear it. In essence, we have some > Monte Carlo sims which are highly parallel, and could run 24-7 for many > months, but we want to be able to submit a high priority CFD code that > will take over, run for a few days or so, and then have the system > automagically restart the MC sim. I must say though that from what I know checkpointing/restarting serial codes is OK. Checkpointing parallel jobs is problematic, and from what I've read not recommended (the various processes are passing messages, and how do you checkpoint in a consistent state?). I haven't implemented it. This is worth a discussion from the said heavy hitters. Comments on parallel jobs?
- Previous message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Next message: [Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
