[Beowulf] Re: Grid Engine, Parallel Environment, Scheduling, Myrinet, and MPICH
William Burke
wburke999 at msn.com
Wed Mar 23 12:42:39 PST 2005
I can't get PE to work on a 50 node class II Beowulf. It has a front-end
Sunfire v40 (qmaster host) and 49 Sunfire v20s (execution hosts) running
Linux configured to communicate data over Myrinet using MPICH-GM version
1.26.14a.
These are the requirements of the N1GE environment to handle:
1. Serial type jobs for pre-processing the data - average runtime 15
minutes.
2. Output is pipelined into parallel processing jobs - range of runtime
1- 6 hours.
3. Concurrently running is post-processing serial jobs.
I have setup a Parallel Environment called mpich-gm and a straight-forward
FIFO scheduling schema for testing. When I submit parallel jobs they hang in
limbo in a 'qw' state pending submission. I am not sure why the scheduler
does not see jobs that I submit.
I used the myrinet mpich template located $SGE_ROOT/< sge_cell >/mpi/myrinet
directory to configure the pe (parallel environment) plus I copied the
sge_mpirun script to the $SGE_ROOT/< sge_cell >/bin directory. I configured
a Production.q queue that runs only parallel jobs. As a last sanity check I
ran a trace on the scheduler, submitted a simple parallel job, and this is
the results that I got from the logs:
JOB RUN Window
[wems at wems examples]$ qsub -now y -pe mpich-gm 1-4 -b y hello++
Your job 277 ("hello++") has been submitted.
Waiting for immediate job to be scheduled.
Your qsub request could not be scheduled, try again later.
[wems at wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++
Your job 278 ("hello++") has been submitted.
[wems at wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++
Your job 279 ("hello++") has been submitted.
This is the 2nd window SCHEDULER LOG
[root at wems bin]# qconf -tsm
[root at wems bin]# qconf -tsm
[root at wems bin]# cat /WEMS/grid/default/common/schedd_runlog
Wed Mar 23 06:08:55 2005|-------------START-SCHEDULER-RUN-------------
Wed Mar 23 06:08:55 2005|queue instance "all.q at wems10.grid.wni.com" dropped
because it is temporarily not available
Wed Mar 23 06:08:55 2005|queue instance "Production.q at wems10.grid.wni.com"
dropped because it is temporarily not available
Wed Mar 23 06:08:55 2005|queues dropped because they are temporarily not
available: all.q at wems10.grid.wni.com Production.q at wems10.grid.wni.com
Wed Mar 23 06:08:55 2005|no pending jobs to perform scheduling on
Wed Mar 23 06:08:55 2005|--------------STOP-SCHEDULER-RUN-------------
Wed Mar 23 06:11:37 2005|-------------START-SCHEDULER-RUN-------------
Wed Mar 23 06:11:37 2005|queue instance "all.q at wems10.grid.wni.com" dropped
because it is temporarily not available
Wed Mar 23 06:11:37 2005|queue instance "Production.q at wems10.grid.wni.com"
dropped because it is temporarily not available
Wed Mar 23 06:11:37 2005|queues dropped because they are temporarily not
available: all.q at wems10.grid.wni.com Production.q at wems10.grid.wni.com
Wed Mar 23 06:11:37 2005|no pending jobs to perform scheduling on
Wed Mar 23 06:11:37 2005|--------------STOP-SCHEDULER-RUN-------------
[root at wems bin]# qstat
job-ID prior name user state submit/start at queue
slots ja-task-ID
----------------------------------------------------------------------------
-------------------------------------
279 0.55500 hello++ wems qw 03/23/2005 06:11:43
1
[root at wems bin]#
BTW that node wems10.grid.wni.com has connectivity issues and I have not
removed it from the cluster queue.
What causes this type of problem in N1GE to return "no pending jobs to
perform scheduling on" in the schedd_runlog even though there are available
slots ready to take jobs?
I had no problem submitting serial jobs, only the parallel jobs resulted as
such. Are there N1GE - Myrinet issue that I am not aware of? FYI the same
binary (hello++) runs with no problems from the command line.
Since I generally run scripts from qsub instead of binaries I created a
script to run the mpich executable but that yield the same result.
I have an additional question regarding setting a queue.conf parameter
called "subordinate_list". How is it read from the result of qconf -mq
<queue_name>?
Example
i.e., subordinate_list low_pri.q=5,small.q.
Which queue has priority over the other based on the slots?
William Burke
Tellitec Sollutions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050323/611516b3/attachment.html>
More information about the Beowulf
mailing list