Is there any work management tools like that.

Donald Becker becker at scyld.com
Tue Jul 30 09:01:54 PDT 2002


On Tue, 30 Jul 2002, William Thies wrote:

> We need such kind of work management tools working on
> a 32-node cluster.
..
> 1. We will always run a very large master-slave
> program on this cluster.
..
> 2. Sometimes, we need to use this cluster to do other
> works. 

Most any scheduling system can handle this kind of job allocation, at
least for new jobs.

The devil is in the details.  For the large job workload, is that job a
number of short-lived independent processes, or a single
job with many long-lived communicating processes?

> (1) We want to power off 8 nodes first,

Why power off?  You can use WOL or IPMI, but that power-cycle will take
on the order of minutes -- far longer than scheduling, and significantly
longer than other approaches to clearing the machine state.  The Scyld
system can clear the machine state in just a few seconds.

> And at that time we don't want the GA program to use those 8 nodes

Every scheduling system can prevent jobs #1 from allocating new
processes on the reserved nodes.  The question is, what happens to
the processes of job #1?
    Are they short-lived enough that they will terminate naturally in a
      few seconds?
    Can the slave processes just be suspended?
    Do you expect the system to check-point and restart them later?
     (If so, what about the non-check-pointed processes they are
      communicating with?)
    Do you expect the system to migrate them to another node?
     (Again, what are you communication expectations?)
    Can the processes be signalled to check-point or migrate itself?
      (Scyld Beowulf provides tools to make this very easy, but it's not
       a common feature on other scheduling system.)

> 3. This should be a multi-user management tool.
> Would you like to recommend some tools like that?
> Thanks very much!

We provide a queuing, scheduling and node allocation systems(*) that can
accomplish this within a cluster.  If you need site-wide scheduling
(multiple OSes, a mix of cluster and independant nodes, crossing
firewall boundaries, etc) you should look at PBSPro, LSF, and SGE.


-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993




More information about the Beowulf mailing list