Is there any work management tools like that.

Donald Becker becker at
Tue Jul 30 09:01:54 PDT 2002

On Tue, 30 Jul 2002, William Thies wrote:

> We need such kind of work management tools working on
> a 32-node cluster.
> 1. We will always run a very large master-slave
> program on this cluster.
> 2. Sometimes, we need to use this cluster to do other
> works. 

Most any scheduling system can handle this kind of job allocation, at
least for new jobs.

The devil is in the details.  For the large job workload, is that job a
number of short-lived independent processes, or a single
job with many long-lived communicating processes?

> (1) We want to power off 8 nodes first,

Why power off?  You can use WOL or IPMI, but that power-cycle will take
on the order of minutes -- far longer than scheduling, and significantly
longer than other approaches to clearing the machine state.  The Scyld
system can clear the machine state in just a few seconds.

> And at that time we don't want the GA program to use those 8 nodes

Every scheduling system can prevent jobs #1 from allocating new
processes on the reserved nodes.  The question is, what happens to
the processes of job #1?
    Are they short-lived enough that they will terminate naturally in a
      few seconds?
    Can the slave processes just be suspended?
    Do you expect the system to check-point and restart them later?
     (If so, what about the non-check-pointed processes they are
      communicating with?)
    Do you expect the system to migrate them to another node?
     (Again, what are you communication expectations?)
    Can the processes be signalled to check-point or migrate itself?
      (Scyld Beowulf provides tools to make this very easy, but it's not
       a common feature on other scheduling system.)

> 3. This should be a multi-user management tool.
> Would you like to recommend some tools like that?
> Thanks very much!

We provide a queuing, scheduling and node allocation systems(*) that can
accomplish this within a cluster.  If you need site-wide scheduling
(multiple OSes, a mix of cluster and independant nodes, crossing
firewall boundaries, etc) you should look at PBSPro, LSF, and SGE.

Donald Becker				becker at
Scyld Computing Corporation
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993

More information about the Beowulf mailing list