Is there any work management tools like that.
Donald Becker
becker at scyld.com
Tue Jul 30 09:01:54 PDT 2002
On Tue, 30 Jul 2002, William Thies wrote:
> We need such kind of work management tools working on
> a 32-node cluster.
..
> 1. We will always run a very large master-slave
> program on this cluster.
..
> 2. Sometimes, we need to use this cluster to do other
> works.
Most any scheduling system can handle this kind of job allocation, at
least for new jobs.
The devil is in the details. For the large job workload, is that job a
number of short-lived independent processes, or a single
job with many long-lived communicating processes?
> (1) We want to power off 8 nodes first,
Why power off? You can use WOL or IPMI, but that power-cycle will take
on the order of minutes -- far longer than scheduling, and significantly
longer than other approaches to clearing the machine state. The Scyld
system can clear the machine state in just a few seconds.
> And at that time we don't want the GA program to use those 8 nodes
Every scheduling system can prevent jobs #1 from allocating new
processes on the reserved nodes. The question is, what happens to
the processes of job #1?
Are they short-lived enough that they will terminate naturally in a
few seconds?
Can the slave processes just be suspended?
Do you expect the system to check-point and restart them later?
(If so, what about the non-check-pointed processes they are
communicating with?)
Do you expect the system to migrate them to another node?
(Again, what are you communication expectations?)
Can the processes be signalled to check-point or migrate itself?
(Scyld Beowulf provides tools to make this very easy, but it's not
a common feature on other scheduling system.)
> 3. This should be a multi-user management tool.
> Would you like to recommend some tools like that?
> Thanks very much!
We provide a queuing, scheduling and node allocation systems(*) that can
accomplish this within a cluster. If you need site-wide scheduling
(multiple OSes, a mix of cluster and independant nodes, crossing
firewall boundaries, etc) you should look at PBSPro, LSF, and SGE.
--
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993
More information about the Beowulf
mailing list