[Beowulf] Re: Grid Engine, Parallel Environment, Scheduling, Myrinet, and MPICH
Andrew Wang
andrewxwang at yahoo.com.tw
Fri Mar 25 06:51:25 PST 2005
Please send your question to the SGE mailing list:
http://gridengine.sunsource.net/project/gridengine/maillist.html
The "users" list is what you want.
BTW, you should try commands like "qstat -f", or
"qhost" to find out the status of the machines.
ALso, do serial jobs work?
Andrew.
--- William Burke <wburke999 at msn.com> 的訊息:
> I can't get PE to work on a 50 node class II
> Beowulf. It has a front-end
> Sunfire v40 (qmaster host) and 49 Sunfire v20s
> (execution hosts) running
> Linux configured to communicate data over Myrinet
> using MPICH-GM version
> 1.26.14a.
>
>
>
> These are the requirements of the N1GE environment
> to handle:
>
> 1. Serial type jobs for pre-processing the data -
> average runtime 15
> minutes.
> 2. Output is pipelined into parallel processing jobs
> - range of runtime
> 1- 6 hours.
> 3. Concurrently running is post-processing serial
> jobs.
>
> I have setup a Parallel Environment called mpich-gm
> and a straight-forward
> FIFO scheduling schema for testing. When I submit
> parallel jobs they hang in
> limbo in a 'qw' state pending submission. I am not
> sure why the scheduler
> does not see jobs that I submit.
>
>
>
> I used the myrinet mpich template located
> $SGE_ROOT/< sge_cell >/mpi/myrinet
> directory to configure the pe (parallel environment)
> plus I copied the
> sge_mpirun script to the $SGE_ROOT/< sge_cell >/bin
> directory. I configured
> a Production.q queue that runs only parallel jobs.
> As a last sanity check I
> ran a trace on the scheduler, submitted a simple
> parallel job, and this is
> the results that I got from the logs:
>
>
>
>
>
> JOB RUN Window
>
> [wems at wems examples]$ qsub -now y -pe mpich-gm 1-4
> -b y hello++
>
> Your job 277 ("hello++") has been submitted.
>
> Waiting for immediate job to be scheduled.
>
>
>
> Your qsub request could not be scheduled, try again
> later.
>
> [wems at wems examples]$ qsub -pe mpich-gm 1-4 -b y
> hello++
>
> Your job 278 ("hello++") has been submitted.
>
> [wems at wems examples]$ qsub -pe mpich-gm 1-4 -b y
> hello++
>
> Your job 279 ("hello++") has been submitted.
>
>
>
> This is the 2nd window SCHEDULER LOG
>
> [root at wems bin]# qconf -tsm
>
> [root at wems bin]# qconf -tsm
>
> [root at wems bin]# cat
> /WEMS/grid/default/common/schedd_runlog
>
> Wed Mar 23 06:08:55
> 2005|-------------START-SCHEDULER-RUN-------------
>
> Wed Mar 23 06:08:55 2005|queue instance
> "all.q at wems10.grid.wni.com" dropped
> because it is temporarily not available
>
> Wed Mar 23 06:08:55 2005|queue instance
> "Production.q at wems10.grid.wni.com"
> dropped because it is temporarily not available
>
> Wed Mar 23 06:08:55 2005|queues dropped because they
> are temporarily not
> available: all.q at wems10.grid.wni.com
> Production.q at wems10.grid.wni.com
>
> Wed Mar 23 06:08:55 2005|no pending jobs to perform
> scheduling on
>
> Wed Mar 23 06:08:55
> 2005|--------------STOP-SCHEDULER-RUN-------------
>
> Wed Mar 23 06:11:37
> 2005|-------------START-SCHEDULER-RUN-------------
>
> Wed Mar 23 06:11:37 2005|queue instance
> "all.q at wems10.grid.wni.com" dropped
> because it is temporarily not available
>
> Wed Mar 23 06:11:37 2005|queue instance
> "Production.q at wems10.grid.wni.com"
> dropped because it is temporarily not available
>
> Wed Mar 23 06:11:37 2005|queues dropped because they
> are temporarily not
> available: all.q at wems10.grid.wni.com
> Production.q at wems10.grid.wni.com
>
> Wed Mar 23 06:11:37 2005|no pending jobs to perform
> scheduling on
>
> Wed Mar 23 06:11:37
> 2005|--------------STOP-SCHEDULER-RUN-------------
>
> [root at wems bin]# qstat
>
> job-ID prior name user state
> submit/start at queue
> slots ja-task-ID
>
>
----------------------------------------------------------------------------
> -------------------------------------
>
> 279 0.55500 hello++ wems qw
> 03/23/2005 06:11:43
> 1
>
> [root at wems bin]#
>
>
>
> BTW that node wems10.grid.wni.com has connectivity
> issues and I have not
> removed it from the cluster queue.
>
>
>
> What causes this type of problem in N1GE to return
> "no pending jobs to
> perform scheduling on" in the schedd_runlog even
> though there are available
> slots ready to take jobs?
>
> I had no problem submitting serial jobs, only the
> parallel jobs resulted as
> such. Are there N1GE - Myrinet issue that I am not
> aware of? FYI the same
> binary (hello++) runs with no problems from the
> command line.
>
> Since I generally run scripts from qsub instead of
> binaries I created a
> script to run the mpich executable but that yield
> the same result.
>
>
>
> I have an additional question regarding setting a
> queue.conf parameter
> called "subordinate_list". How is it read from the
> result of qconf -mq
> <queue_name>?
>
> Example
>
> i.e., subordinate_list
> low_pri.q=5,small.q.
>
>
>
> Which queue has priority over the other based on the
> slots?
>
>
>
>
>
> William Burke
>
> Tellitec Sollutions
>
> > _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
_______________________________________________________________________
Yahoo!奇摩電子信箱
免費容量250MB,信件在多也不怕
http://tw.promo.yahoo.com/mail_new/index.html
More information about the Beowulf
mailing list