[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Wed Nov 3 01:05:36 PST 2004

On Tue, 2 Nov 2004, Brian Dobbins wrote:

>   I have just begun looking into a checkpoint / restart capability for
> clusters, but looking into the archives here and doing a search has
> shown few viable solutions.  Some, like CKPOX (1), appear to be only
> written for the 2.4 series kernels, and I recall seeing one product that
> seemed to indicate it had full support for these operations, but it was
> a commercial product.
>From what you say below, you mean suspending user jobs,
rather than entire systems.
I was rather taken by 'swusp' at one time, this is a Linux suspend
to disk. Homepage is down today.
Anyone know the state of this?

> 
> 
>   Additionally, though this is a much wider question (and one tackled
> before!), what are people's pros and cons of the various queuing
> systems?  I've played with OpenPBS before, and 'seen' SGE, but once
> again, I thought it'd be nice to hear what some of the heavy hitters on
> this list prefer.
I am in no way a heavy hitter! 
I would say go for Gridengine.
It has the checkpointing and suspend facilities you are after.
However - see below.

> 
>   Background: The reason we're looking for a checkpoint/restart option
> has more to do with preempting a running job (of a lower priority) by
> checkpointing it than it does with saving the state in case of a crash.
In Gridengine, there is the concept of a 'subordinate' queue.
The lower priority queue is suspended on that node if a higher
priority queue needs to run.

> While functionally these may be pretty close or the same, if that gives
> rise to another solution, I'd like to hear it.  In essence, we have some
> Monte Carlo sims which are highly parallel, and could run 24-7 for many
> months, but we want to be able to submit a high priority CFD code that
> will take over, run for a few days or so, and then have the system
> automagically restart the MC sim.
I must say though that from what I know checkpointing/restarting
serial codes is OK.
Checkpointing parallel jobs is problematic, and from what I've read
not recommended (the various processes are passing
messages, and how do you checkpoint in a consistent state?).

I haven't implemented it.

This is worth a discussion from the said heavy hitters. Comments on
parallel jobs?