[Beowulf] Checkpoint / Restart on 2.6 series kernels for clusters?

Reuti reuti at staff.uni-marburg.de
Tue Nov 2 15:47:59 PST 2004


>   I have just begun looking into a checkpoint / restart capability for
> clusters, but looking into the archives here and doing a search has
> shown few viable solutions.  Some, like CKPOX (1), appear to be only
> written for the 2.4 series kernels, and I recall seeing one product that
> seemed to indicate it had full support for these operations, but it was
> a commercial product.
> 
>   What solutions have people on this list used for this functionality?
> Amy I restricted to going back to the 2.4 series?  (I'd prefer to run
> 2.6 on the AMD64 hardware I've got.)

The add-ons to have checkpointing in the Linux kernel often imply some 
restrictions (e.g. no forks and/or threads in the application...) To do 
checkpointing at an application level seems to me to be the better solution for 
now. You may also check the Condor project. I depends on your applications, 
whether you can use the builtin checkpointing there. It's also a queueing 
system. I think, the main goal of the developement of Condor was to use idle 
workstations during the night (maybe I'm wrong - just correct me). So some 
features may not applicable in a cluster configuration with dedicated machines 
and servers.

>   Additionally, though this is a much wider question (and one tackled
> before!), what are people's pros and cons of the various queuing
> systems?  I've played with OpenPBS before, and 'seen' SGE, but once
> again, I thought it'd be nice to hear what some of the heavy hitters on
> this list prefer.

I think the statements are still the same when you look in the archive. 
Although both are free (and you get the source), SGE is much more stable. I saw 
clusters having OpenPBS where one failing node holds the whole system. 
Furthermore SGE is in active developement, and it controls the slave tasks of 
MPI jobs (also, if you don't have the sources of the applications).

> Monte Carlo sims which are highly parallel, and could run 24-7 for many
> months, but we want to be able to submit a high priority CFD code that
> will take over, run for a few days or so, and then have the system
> automagically restart the MC sim.

Do you have the sources of the used applications? Maybe it's easier to shut 
down the application in a proper way (hence add some kind of checkpointing 
support triggered by a signal), and restart it later. This behavior can be used 
by SGE to have exactly the required behavior.

Cheers - Reuti



More information about the Beowulf mailing list