[Beowulf] Tight MPICH2 Integration with SGE

Chris Dagdigian dag at sonsorol.org
Sat Jan 26 19:40:40 PST 2008


Hi Sangamesh,

First things first -

Not sure if this affects you but the mpich2-1.06p1 release does not  
currently work with tight SGE integration.

The specific SGE mailing list thread where this is discussed is linked  
to from here:
http://gridengine.info/articles/2008/01/25/tight-mpich2-integration-broken-with-mpich2-1-0-6p1

Another problem I see is inside your job script:

>> $MPI_HOME/mpiexec -np 4 -machinefile /root/MFM /opt/MEME-MAX/bin/ 
>> meme_p /opt/MEME-MAX/NCCS/samevivo_sample.txt -dna -mod tcm - 
>> nmotifs 10 -nsites 100 -minw 5 -maxw 50 -revcomp -text -maxsize  
>> 200500


In this command you are explicitly asking for 4 CPUs and you are hard- 
coding in the path to a MPI machines file. This makes nonsense of the  
entire concept of Grid Engine MPICH integration, the whole point which  
is to allow the SGE scheduler to control how many CPUs you job gets  
and (more importantly) where those CPUs actually are.

Your mpiexec command needs to take the value for "-np" and the value  
for "-machinefile" from the SGE scheduler. This is done via  
environment variables.

Your command should probably look something like this:

$MPI_HOME/mpiexec -np $NSLOTS -machinefile $TMPDIR/machines <rest of  
command ... >


Finally, your PE configuration does not match what you say is in the  
documentation:

> start_proc_args /usr/sge/mpich2_smpd_rsh/startmpich2.sh -catch_rsh  
> $pe_hostfile

vs.

> start_proc_args   /opt/gridengine/mpi/MPICH2-SM/startmpich2sm.sh


I would guess that not passing $pe_hostfile to startmpich2.sh in your  
start_proc_args is probably the reason for the specific error you quote.


So my specific advice boils down to:

(1) Make sure you are not using the MPICH2 that has been causing  
problems for SGE people recently
(2) Fix your SGE job script by adding in "-np $NSLOTS" and "- 
machinefile $TMPDIR/machines"
(3) Pass the parameter $pe_hostfile to your start_proc_args line in  
your parallel environment (PE) config

Regards,
Chris






On Jan 25, 2008, at 9:41 AM, Sangamesh B wrote:

>  Hi all,
>
>     I'm doing the Tight MPICH2 (not MPICH)  Integration with SGE on  
> a cluster with, dual core dual AMD64 opteron processor.
>
>  Followed the sun document located at:
>
>   http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html
>
>   The document explains following three kinds of TI:
>           Tight Integration(TI) using Process Manager(PM): gforker
> TI using PM: SMPD – Daemonless
> TI using PM: SMPD – Daemonbased
>
> I did the TI with gforker and tested it successfully.
>
>
> But failed to do TI with daemonless-SMPD.
>
> Let me explain what I did.
>
> Installed the MPICH2 with smpd configuration.
>
> The sge is installed at: /opt/gridengine
>
> And created MPICH2-SM folder in /opt/gridengine/mpi by referring the  
> following lines from the document
>
> start_proc_args   /usr/sge/mpich2_smpd_rsh/startmpich2.sh -catch_rsh  
> $pe_hostfile
> stop_proc_args    /usr/sge/mpich2_smpd_rsh/stopmpich2.sh
> Copied the startmpi.sh, stopmpi.sh from /opt/gridengine/mpi  to /opt/ 
> gridengine/mpi/MPICH2-SM dir, because nothing has given in the doc  
> what to include in these scripts.
>
> Using qmon, created MPICH2-GF pe .
>
> # qconf -sp MPICH2-SM
> pe_name           MPICH2-SM
> slots             999
> user_lists        rootuserset
> xuser_lists       NONE
> start_proc_args   /opt/gridengine/mpi/MPICH2-SM/startmpich2sm.sh
> stop_proc_args    /opt/gridengine/mpi/MPICH2-SM/stopmpich2sm.sh
> allocation_rule   $round_robin
> control_slaves    FALSE
> job_is_first_task TRUE
> urgency_slots     min
>
> Added this PE to default queue all.q .
>
> Then submitted the job with following script:
>
> # cat sgeSM.sh
> #!/bin/sh
>
> #$ -cwd
>
> #$ -pe MPICH2-SM 4
>
> #$ -e msge2.Err
>
> #$ -o msge2.out
>
> #$ -v MPI_HOME=/opt/MPI_LIBS/MPICH2-GNU/MPICH2-SM/bin
>
> #$ -v MEME_DIRECTORY=/opt/MEME-MAX
>
> $MPI_HOME/mpiexec -np 4 -machinefile /root/MFM /opt/MEME-MAX/bin/ 
> meme_p /opt/MEME-MAX/NCCS/samevivo_sample.txt -dna -mod tcm -nmotifs  
> 10 -nsites 100 -minw 5 -maxw 50 -revcomp -text -maxsize 200500
>
> It gave following error:
>
> # cat msge2.Err
>
> startmpich2sm.sh: got wrong number of arguments
> rm: cannot remove `/tmp/92.1.all.q/machines': No such file or  
> directory
> rm: cannot remove `/tmp/92.1.all.q/rsh': No such file or directory
>
> I guess the problem might be with the scripts startmpich2sm.sh and  
> stopmpich2sm.sh.
>
> Can any one guide me to resolve this issue..
>
> Thanks & Regards,
> Sangamesh
> HPC Engineer
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf





More information about the Beowulf mailing list