<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:st1="urn:schemas-microsoft-com:office:smarttags" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 11 (filtered medium)">
<title>Reuti,</title>
<o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags"
name="PersonName"/>
<!--[if !mso]>
<style>
st1\:*{behavior:url(#default#ieooui) }
</style>
<![endif]-->
<style>
<!--
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman";}
a:link, span.MsoHyperlink
{color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{color:purple;
text-decoration:underline;}
p.MsoPlainText, li.MsoPlainText, div.MsoPlainText
{margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Courier New";}
@page Section1
{size:8.5in 11.0in;
margin:1.0in 77.95pt 1.0in 77.95pt;}
div.Section1
{page:Section1;}
-->
</style>
</head>
<body lang=EN-US link=blue vlink=purple>
<div class=Section1>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>Reuti,<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>>> I'd suggest to move over to the SGE users list at: <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>>> <a
href="http://gridengine.sunsource.net/servlets/ProjectMailingListList">http://gridengine.sunsource.net/servlets/ProjectMailingListList</a><o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>I have but I do not see my name yet? How long is the verification
process?<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>>> Although there is a special Myrinet directory, you can also
try to use <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>>> the files in the mpi directory instead.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>The mpi directory's mpich.template doesn't use mpirun.ch_gm so how does
it know what version of mpirun to use? If I use the mpi what changes do I have
to make?<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>>> Can you please give more details of your queue and PE setup
(qconf -sq/sp <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>>> output<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>SEE BELOW<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>>> Do you have an admin account for SGE? I'd prefer not to do
anything in <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>>> SGE as root.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>Yes, its grid... SEE BELOW<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>>> Not really an issue: you have to make a small change to the <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>>> mpirun.ch_gm.pl to make all jobs staying in the same process
group to get <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>>> them correctly killed in case of a jobb abort:<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>I have to double check that in:<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>http://gridengine.sunsource.net/howto/mpich-integration.html<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><u><font size=2 face="Courier New"><span
style='font-size:10.0pt'>Here is the new problem I have this situation in the
PE:<o:p></o:p></span></font></u></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>My jobs won’t run when I run my script it goes into pending mode
for about 10 sec (status qw), SGE submits to N number of hosts (status t), jobs
hangs in a status t mode, then quickly exit. When I investigated both the Jobscript_name.{pe|po}JobID
output it states that SGE can't make links in the <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>/WEMS/grid/tmp/549.1.Production.q/ directory. <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>It looks like the startmpi.sh script links files in $TMPDIR and from my
understanding the value of $TMPDIR <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>is derived from the tmpdir parameter in the queue's configuration. I
have designated this attribute as <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>'/WEMS/grid/tmp/' but according to the errorlog qsub_wrf.sh.pe549 it is
'/WEMS/grid/tmp/549.1.Production.q/' <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>Possibly the source of the problem is here, so what created the
'549.1.Production.q' addendum? <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>I then checked the permission of /WEMS/grid/tmp<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>[wems@wems grid]$ ls -ltr /WEMS/grid | grep tmp<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>drwxrwxrwx 2 root
root 4096 Mar 26 17:34 tmp<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>As a sanity check within the startmpi.sh I echo out the ls -ltr of
$TMPDIR:<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>drwxr-xr-x 2 65534
65534 4096 Mar 26 2005
549.1.Production.q<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>as expected there is no UID/GID that is 65534 in my /etc/passwd.
Furthermore there are only write permission <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>for UID/GID 65534 so if it(N1GE) is the only one writing and reading
this directory what else could be <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>preventing the writing into that directory? I thought maybe there was a
lock file in /WEMS/grid/tmp so I checked..<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>[wems@wems tmp]$ ls -al /WEMS/grid/tmp<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>total 8<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>drwxrwxrwx 2 root
root 4096 Mar 26 17:34 <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>drwxr-xr-x 22 grid
grid 4096 Mar 26 04:20 .<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>No Avail, so I am out of solutions. Is this a known issue when using
myrinet, mpich, tight integration or am I overlook something?? I am using the
sge_mpirun script instead of mpirun script. Have you seen any problem like this
before?<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>I also suspect that the editor may be reading the PE mpich
configuration file’s argument <i><span style='font-style:italic'>start_proc_args</span></i>
incorrectly since the editor wraps the string of argument around to the next
line according to the /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.pbs
file. In CHECK 5 it says “The mpirun command "\" does not exist”<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>SEE CHECK 5, CHECK 8, then CHECK 7 BELOW.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>Oh yeah, this may be a silly question but where does SGE get
$pe_hostfile, $TMPDIR from and what is the process of how it acquires these
variables? I would like some clarification.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>Thanks,<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>William<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>Things that I checked<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>CHECK 0.5<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>[root@wems wrfprd]# cat qsub_wrf.sh<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>#!/bin/sh<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>#$ -S /bin/ksh<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>#$ -pe mpich 32<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>#$ -l h_rt=10800<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>#$ -q Production.q<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>#<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>#. /WEMS/wems/external/WRF/wrfsi/etc/setup-mpi.sh<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>cd /WEMS/wems/data/WRF/wni001a/wrfprd<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>echo 'This is the job ID '$JOB_ID >
/WEMS/wems/data/WRF/wni001a/log/050811200.wrf.pbs<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>echo 'This is the pe_hostfile '$PE_HOSTFILE >>
/WEMS/wems/data/WRF/wni001a/log/050811200.wrf.ps<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>echo 'This is the tmpdir '$TMPDIR >>
/WEMS/wems/data/WRF/wni001a/log/050811200.wrf.ps<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>/WEMS/grid/mpi/myrinet/sge_mpirun
/WEMS/wems/external/WRF/wrfsi/../run/wrf.exe >> /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.pbs
2>&1<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>CHECK 1<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>[wems@wems wems]$ qsub -pe mpich 32 -P test -q Production.q
/WEMS/wems/data/WRF/wni001a/wrfprd/qsub_wrf.sh<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>CHECK 2<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>[wems@wems grid]$ cat qsub_wrf.sh.pe549<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>ln: creating symbolic link `/WEMS/grid/tmp/549.1.Production.q/mpirun.sge'
to <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>`/WEMS/pkgs/mpich-gm-1.2.6.14a/bin/mpirun.ch_gm': Permission denied<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>/WEMS/grid/mpi/myrinet/startmpi.sh[142]: cannot create
/WEMS/grid/tmp/549.1.Production.q/machines: Permission denied<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>cat: /WEMS/grid/tmp/549.1.Production.q/machines: No such file or
directory<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>ln: creating symbolic link `/WEMS/grid/tmp/549.1.Production.q/rsh' to
`/WEMS/grid/mpi/rsh': Permission denied<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>CHECK 3<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>[wems@wems grid]$ cat qsub_wrf.sh.po549<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>-catch_rsh /WEMS/grid/wems-hosts2 /WEMS/pkgs/mpich-gm-1.2.6.14a/bin/mpirun.ch_gm<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>this is the value of mpirun
/WEMS/pkgs/mpich-gm-1.2.6.14a/bin/mpirun.ch_gm<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>I am doing a ls -ltr on $TMPDIR<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>total 4<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>drwxr-xr-x 2 65534
65534 4096 Mar 26 2005
549.1.Production.q<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>Machine file is /WEMS/grid/tmp/549.1.Production.q/machines<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>CHECK 4<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>[wems@wems grid]$ cat Queue-config<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>qname
Production.q<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>hostlist
@Parallel<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>seq_no
0<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>load_thresholds np_load_avg=1.75<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>suspend_thresholds NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>nsuspend
1<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>suspend_interval 00:05:00<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>priority
0<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>min_cpu_interval 00:05:00<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>processors
2<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>qtype
BATCH<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>ckpt_list
NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>pe_list
mpich<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>rerun
FALSE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>slots
2<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>tmpdir /WEMS/grid/tmp<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>shell
/bin/ksh<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>prolog
NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>epilog
NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>shell_start_mode posix_compliant<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>starter_method NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>suspend_method NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>resume_method NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>terminate_method NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>notify
00:00:60<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>owner_list
NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>user_lists
Test_A<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>xuser_lists
NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>subordinate_list NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>complex_values NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>projects
test<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>xprojects
NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>calendar
NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>initial_state default<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>s_rt
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>h_rt
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>s_cpu
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>h_cpu
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>s_fsize
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>h_fsize
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>s_data
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>h_data
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>s_stack
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>h_stack
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>s_core
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>h_core
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>s_rss
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>h_rss
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>s_vmem
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>h_vmem
INFINITY<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>CHECK 5<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>[wems@wems grid]$ cat mpich-PE-config<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>pe_name
mpich<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>slots
78<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>user_lists Test_A<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>xuser_lists NONE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>start_proc_args /WEMS/grid/mpi/myrinet/startmpi.sh
-catch_rsh \<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>
/WEMS/grid/wems-hosts2 \<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>
/WEMS/pkgs/mpich-gm-1.2.6.14a/bin/mpirun.ch_gm<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>stop_proc_args /WEMS/grid/mpi/myrinet/stopmpi.sh<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>allocation_rule $fill_up<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>control_slaves TRUE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>job_is_first_task FALSE<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>urgency_slots min<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>CHECK 6<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>[wems@wems wems]# cat /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.ps<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>This is the pe_hostfile
/WEMS/grid/default/spool/wems18/active_jobs/388.1/pe_hostfile<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>This is the tmpdir /WEMS/grid/tmp/388.1.Production.q<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>This is the pe_hostfile
/WEMS/grid/default/spool/wems07/active_jobs/389.1/pe_hostfile<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>This is the tmpdir /WEMS/grid/tmp//389.1.Production.q<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>This is the pe_hostfile
/WEMS/grid/default/spool/wems24/active_jobs/390.1/pe_hostfile<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>This is the tmpdir /WEMS/grid/tmp/398.1.Production.q<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>This is the pe_hostfile /WEMS/grid/default/spool/wems22/active_jobs/549.1/pe_hostfile<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>This is the tmpdir /WEMS/grid/tmp/549.1.Production.q<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>This is the pe_hostfile<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>This is the tmpdir<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>CHECK 7<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>[wems@wems wems]$ cat /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.pbs<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>This is the job ID 549<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>The mpirun command "\" does not exist<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>There must be a problem with the mpich parallel environment<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>CHECK 8<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>[root@wems wrfprd]# cat qsub_wrf.sh<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>#!/bin/sh<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>#$ -S /bin/ksh<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>#$ -pe mpich 32<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>#$ -l h_rt=10800<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>#$ -q Production.q<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>#<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>#. /WEMS/wems/external/WRF/wrfsi/etc/setup-mpi.sh<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>cd /WEMS/wems/data/WRF/wni001a/wrfprd<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>echo 'This is the job ID '$JOB_ID >
/WEMS/wems/data/WRF/wni001a/log/050811200.wrf.pbs<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>echo 'This is the pe_hostfile '$PE_HOSTFILE >>
/WEMS/wems/data/WRF/wni001a/log/050811200.wrf.ps<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>echo 'This is the tmpdir '$TMPDIR >>
/WEMS/wems/data/WRF/wni001a/log/050811200.wrf.ps<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>/WEMS/grid/mpi/myrinet/sge_mpirun
/WEMS/wems/external/WRF/wrfsi/../run/wrf.exe >> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>/WEMS/wems/data/WRF/wni001a/log/050811200.wrf.pbs 2>&1<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>exit<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>-----Original Message-----<br>
From: Reuti [mailto:reuti@staff.uni-marburg.de] <br>
Sent: Wednesday, March 23, 2005 6:26 PM<br>
To: William Burke<br>
Cc: beowulf@beowulf.org<br>
Subject: Re: [Beowulf]</span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>Hi,<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>I'd suggest to move over to the SGE users list at: <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>http://gridengine.sunsource.net/servlets/ProjectMailingListList<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>But anyway, let's sort the things out:<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>Quoting William Burke <<st1:PersonName w:st="on">wburke999@msn.com</st1:PersonName>>:<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> I can't get PE to work on a 50 node class II Beowulf. It has a
front-end<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Sunfire v40 (qmaster host) and 49 Sunfire v20s (execution hosts)
running<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Linux configured to communicate data over Myrinet using MPICH-GM
version<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> 1.26.14a. <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>Although there is a special Myrinet directory, you can also try to use
the <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>files in the mpi directory instead.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> These are the requirements of the N1GE environment to handle: <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> 1. Serial type jobs for pre-processing the data - average
runtime 15<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> minutes. <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> 2. Output is pipelined into parallel processing jobs - range
of runtime<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> 1- 6 hours. <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> 3. Concurrently running is post-processing serial jobs. <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> I have setup a Parallel Environment called mpich-gm and a
straight-forward<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> FIFO scheduling schema for testing. When I submit parallel jobs
they hang<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> in<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> limbo in a 'qw' state pending submission. I am not sure why the
scheduler<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> does not see jobs that I submit. <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> I used the myrinet mpich template located $SGE_ROOT/< sge_cell<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> >/mpi/myrinet<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> directory to configure the pe (parallel environment) plus I copied
the<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> sge_mpirun script to the $SGE_ROOT/< sge_cell >/bin
directory. I<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> configured<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> a Production.q queue that runs only parallel jobs. As a last
sanity check I<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> ran a trace on the scheduler, submitted a simple parallel job, and
this is<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> the results that I got from the logs:<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>Can you please give more details of your queue and PE setup (qconf
-sq/sp <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>output).<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> JOB RUN Window<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> [wems@wems examples]$ qsub -now y -pe mpich-gm 1-4 -b y hello++<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Your job 277 ("hello++") has been submitted.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Waiting for immediate job to be scheduled.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Your qsub request could not be scheduled, try again later.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> [wems@wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Your job 278 ("hello++") has been submitted.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> [wems@wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Your job 279 ("hello++") has been submitted.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>You can't start a parallel job this way, as there is no mpirun used.
When you <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>used your mentioned script, you get the same behavior (and there you
used <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>mpirun -np $NSLOTS ...)?<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> This is the 2nd window SCHEDULER LOG<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> [root@wems bin]# qconf -tsm<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> [root@wems bin]# qconf -tsm<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> [root@wems bin]# cat /WEMS/grid/default/common/schedd_runlog<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Wed Mar 23 06:08:55
2005|-------------START-SCHEDULER-RUN-------------<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Wed Mar 23 06:08:55 2005|queue instance
"all.q@wems10.grid.wni.com" dropped<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> because it is temporarily not available<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Wed Mar 23 06:08:55 2005|queue instance
"Production.q@wems10.grid.wni.com"<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> dropped because it is temporarily not available<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Wed Mar 23 06:08:55 2005|queues dropped because they are
temporarily not<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> available: all.q@wems10.grid.wni.com Production.q@wems10.grid.wni.com<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Wed Mar 23 06:08:55 2005|no pending jobs to perform scheduling on<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Wed Mar 23 06:08:55
2005|--------------STOP-SCHEDULER-RUN-------------<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Wed Mar 23 06:11:37
2005|-------------START-SCHEDULER-RUN-------------<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Wed Mar 23 06:11:37 2005|queue instance
"all.q@wems10.grid.wni.com" dropped<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> because it is temporarily not available<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Wed Mar 23 06:11:37 2005|queue instance
"Production.q@wems10.grid.wni.com"<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> dropped because it is temporarily not available<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Wed Mar 23 06:11:37 2005|queues dropped because they are
temporarily not<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> available: all.q@wems10.grid.wni.com
Production.q@wems10.grid.wni.com<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Wed Mar 23 06:11:37 2005|no pending jobs to perform scheduling on<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Wed Mar 23 06:11:37 2005|--------------STOP-SCHEDULER-RUN-------------<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> [root@wems bin]# qstat<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> job-ID prior name
user state submit/start
at queue<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> slots ja-task-ID<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>>
----------------------------------------------------------------------------<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> -------------------------------------<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> 279 0.55500 hello++
wems qw
03/23/2005 06:11:43<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> 1<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> [root@wems bin]#<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>Do you have an admin account for SGE? I'd prefer not to do anything in
SGE as <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>root.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> BTW that node wems10.grid.wni.com has connectivity issues and I
have not<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> removed it from the cluster queue. <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> What causes this type of problem in N1GE to return "no
pending jobs to<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> perform scheduling on" in the schedd_runlog even though there
are available<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> slots ready to take jobs? <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> I had no problem submitting serial jobs, only the parallel jobs
resulted as<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> such. Are there N1GE - Myrinet issue that I am not aware of?
FYI the same<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> binary (hello++) runs with no problems from the command line.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>If you just start hello++, it will not run in parallel I think.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>Not really an issue: you have to make a small change to the
mpirun.ch_gm.pl to <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>make all jobs staying in the same process group to get them correctly
killed in <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>case of a jobb abort:<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>http://gridengine.sunsource.net/howto/mpich-integration.html<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Since I generally run scripts from qsub instead of binaries I
created a<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> script to run the mpich executable but that yield the same result.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> I have an additional question regarding setting a queue.conf
parameter<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> called "subordinate_list". How is it read from the
result of qconf -mq<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <queue_name>?<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> Example <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>> <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>>
i.e., subordinate_list low_pri.q=5,small.q.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>The queue "low_pri.q" will be suspended, when 5 or more slots
of "<queue_name>" <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>are filled. The "small.q" will be suspened, if all slots of
"<queue_name>" are <o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>filled.<o:p></o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'><o:p> </o:p></span></font></p>
<p class=MsoPlainText><font size=2 face="Courier New"><span style='font-size:
10.0pt'>Cheers - Reuti<o:p></o:p></span></font></p>
</div>
</body>
</html>