<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 11 (filtered medium)">
<style>
<!--
/* Font Definitions */
@font-face
{font-family:Courier;
panose-1:2 7 4 9 2 2 5 2 4 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman";}
a:link, span.MsoHyperlink
{color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{color:purple;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:Arial;
color:windowtext;}
@page Section1
{size:8.5in 11.0in;
margin:1.0in 1.25in 1.0in 1.25in;}
div.Section1
{page:Section1;}
/* List Definitions */
@list l0
{mso-list-id:1946233728;
mso-list-type:hybrid;
mso-list-template-ids:-134322820 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
{mso-level-tab-stop:.5in;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level2
{mso-level-tab-stop:1.0in;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level3
{mso-level-tab-stop:1.5in;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level4
{mso-level-tab-stop:2.0in;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level5
{mso-level-tab-stop:2.5in;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level6
{mso-level-tab-stop:3.0in;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level7
{mso-level-tab-stop:3.5in;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level8
{mso-level-tab-stop:4.0in;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level9
{mso-level-tab-stop:4.5in;
mso-level-number-position:left;
text-indent:-.25in;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
-->
</style>
</head>
<body lang=EN-US link=blue vlink=purple>
<div class=Section1>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'>I can’t get PE to work on a 50 node class II Beowulf. It
has a front-end Sunfire v40 (qmaster host) and 49 Sunfire v20s (execution
hosts) running Linux configured to communicate data over Myrinet using MPICH-GM
version 1.26.14a. <o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'><o:p> </o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'>These are the requirements of the N1GE environment to handle:
<o:p></o:p></span></font></p>
<ol style='margin-top:0in' start=1 type=1>
<li class=MsoNormal style='mso-list:l0 level1 lfo1'><font size=2 face=Arial><span
style='font-size:11.0pt;font-family:Arial'>Serial type jobs for pre-processing
the data – average runtime 15 minutes.</span></font> <font size=2
face=Arial><span style='font-size:11.0pt;font-family:Arial'><o:p></o:p></span></font></li>
<li class=MsoNormal style='mso-list:l0 level1 lfo1'><font size=2 face=Arial><span
style='font-size:11.0pt;font-family:Arial'>Output is pipelined into
parallel processing jobs – range of runtime 1- 6 hours.</span></font>
<font size=2 face=Arial><span style='font-size:11.0pt;font-family:Arial'><o:p></o:p></span></font></li>
<li class=MsoNormal style='mso-list:l0 level1 lfo1'><font size=2 face=Arial><span
style='font-size:11.0pt;font-family:Arial'>Concurrently running is
post-processing serial jobs.</span></font> <font size=2 face=Arial><span
style='font-size:11.0pt;font-family:Arial'><o:p></o:p></span></font></li>
</ol>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'>I have setup a Parallel Environment called mpich-gm and a
straight-forward FIFO scheduling schema for testing. When I submit parallel
jobs they hang in limbo in a ‘qw’ state pending submission. I am
not sure why the scheduler does not see jobs that I submit. <o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'><o:p> </o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'>I used the myrinet mpich template located $SGE_ROOT/<
sge_cell >/mpi/myrinet directory to configure the pe (parallel environment) plus
I copied the sge_mpirun script to the $SGE_ROOT/< sge_cell >/bin
directory. I configured a Production.q queue that runs only parallel
jobs. As a last sanity check I ran a trace on the scheduler, submitted a simple
parallel job, and this is the results that I got from the logs:<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'><o:p> </o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'><o:p> </o:p></span></font></p>
<p class=MsoNormal><u><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>JOB RUN Window<o:p></o:p></span></font></u></p>
<p class=MsoNormal><b><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial;font-weight:bold'>[wems@wems examples]$</span></font></b><font
size=2 face=Arial><span style='font-size:10.0pt;font-family:Arial'> qsub -now y
-pe mpich-gm 1-4 -b y hello++<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Your job 277 ("hello++") has been submitted.<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Waiting for immediate job to be scheduled.<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'><o:p> </o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Your qsub request could not be scheduled, try again later.<o:p></o:p></span></font></p>
<p class=MsoNormal><b><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial;font-weight:bold'>[wems@wems examples]$</span></font></b><font
size=2 face=Arial><span style='font-size:10.0pt;font-family:Arial'> qsub -pe
mpich-gm 1-4 -b y hello++<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Your job 278 ("hello++") has been submitted.<o:p></o:p></span></font></p>
<p class=MsoNormal><b><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial;font-weight:bold'>[wems@wems examples]$</span></font></b><font
size=2 face=Arial><span style='font-size:10.0pt;font-family:Arial'> qsub -pe
mpich-gm 1-4 -b y hello++<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Your job 279 ("hello++") has been submitted.<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'><o:p> </o:p></span></font></p>
<p class=MsoNormal><u><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>This is the 2<sup>nd</sup> window SCHEDULER LOG<o:p></o:p></span></font></u></p>
<p class=MsoNormal><b><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial;font-weight:bold'>[root@wems bin]#</span></font></b><font
size=2 face=Arial><span style='font-size:10.0pt;font-family:Arial'> qconf
–tsm<o:p></o:p></span></font></p>
<p class=MsoNormal><b><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial;font-weight:bold'>[root@wems bin</span></font></b><font
size=2 face=Arial><span style='font-size:10.0pt;font-family:Arial'>]# qconf
-tsm<o:p></o:p></span></font></p>
<p class=MsoNormal><b><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial;font-weight:bold'>[root@wems bin]#</span></font></b><font
size=2 face=Arial><span style='font-size:10.0pt;font-family:Arial'> cat
/WEMS/grid/default/common/schedd_runlog<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Wed Mar 23 06:08:55
2005|-------------START-SCHEDULER-RUN-------------<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Wed Mar 23 06:08:55 2005|queue instance "all.q@wems10.grid.wni.com"
dropped because it is temporarily not available<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Wed Mar 23 06:08:55 2005|queue instance
"Production.q@wems10.grid.wni.com" dropped because it is temporarily
not available<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Wed Mar 23 06:08:55 2005|queues dropped because they are temporarily
not available: all.q@wems10.grid.wni.com Production.q@wems10.grid.wni.com<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Wed Mar 23 06:08:55 2005|no pending jobs to perform
scheduling on<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Wed Mar 23 06:08:55
2005|--------------STOP-SCHEDULER-RUN-------------<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Wed Mar 23 06:11:37 2005|-------------START-SCHEDULER-RUN-------------<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Wed Mar 23 06:11:37 2005|queue instance
"all.q@wems10.grid.wni.com" dropped because it is temporarily not
available<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Wed Mar 23 06:11:37 2005|queue instance
"Production.q@wems10.grid.wni.com" dropped because it is temporarily
not available<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Wed Mar 23 06:11:37 2005|queues dropped because they are
temporarily not available: all.q@wems10.grid.wni.com
Production.q@wems10.grid.wni.com<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Wed Mar 23 06:11:37 2005|no pending jobs to perform
scheduling on<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Wed Mar 23 06:11:37
2005|--------------STOP-SCHEDULER-RUN-------------<o:p></o:p></span></font></p>
<p class=MsoNormal><b><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial;font-weight:bold'>[root@wems bin]#</span></font></b><font
size=2 face=Arial><span style='font-size:10.0pt;font-family:Arial'> qstat<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>job-ID prior
name
user state submit/start
at
queue
slots ja-task-ID<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>-----------------------------------------------------------------------------------------------------------------<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'> 279 0.55500 hello++
wems qw
03/23/2005
06:11:43
1<o:p></o:p></span></font></p>
<p class=MsoNormal><b><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial;font-weight:bold'>[root@wems bin]#<o:p></o:p></span></font></b></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'><o:p> </o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'>BTW that node wems10.grid.wni.com has connectivity issues
and I have not removed it from the cluster queue. <o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'><o:p> </o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'>What causes this type of problem in N1GE to return “no
pending jobs to perform scheduling on” in the schedd_runlog even though there
are available slots ready to take jobs? <o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'>I had no problem submitting serial jobs, only the parallel
jobs resulted as such. Are there N1GE - Myrinet issue that I am not aware
of? FYI the same binary (hello++) runs with no problems from the command
line.<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'>Since I generally run scripts from qsub instead of binaries I
created a script to run the mpich executable but that yield the same result.<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'><o:p> </o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'>I have an additional question regarding setting a queue.conf
parameter called "subordinate_list". How is it read from the result
of qconf –mq <queue_name>?<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'>Example <o:p></o:p></span></font></p>
<p class=MsoNormal><font size=3 face="Times New Roman"><span style='font-size:
12.0pt'> </span></font><font
size=2><span style='font-size:9.5pt'>i.e., </span></font><font size=2
face=Courier><span style='font-size:9.5pt;font-family:Courier'>subordinate_list
low_pri.q=5,small.q</span></font><font size=2><span
style='font-size:9.5pt'>.<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face="Times New Roman"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:11.0pt;
font-family:Arial'>Which queue has priority over the other based on the slots?<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'><o:p> </o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'><o:p> </o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>William Burke<o:p></o:p></span></font></p>
<p class=MsoNormal><font size=2 face=Arial><span style='font-size:10.0pt;
font-family:Arial'>Tellitec Sollutions<o:p></o:p></span></font></p>
</div>
</body>
</html>