[Beowulf] Tight MPICH2 Integration with SGE

Sangamesh B forum.san at gmail.com
Fri Jan 25 06:41:20 PST 2008


 Hi all,

    I'm doing the Tight MPICH2 (not MPICH)  Integration with SGE on a
cluster with, dual core dual AMD64 opteron processor.

 Followed the sun document located at:


http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html

  The document explains following three kinds of TI:

          Tight Integration(TI) using Process Manager(PM): gforker
TI using PM: SMPD – Daemonless
TI using PM: SMPD – Daemonbased

I did the TI with gforker and tested it successfully.


But failed to do TI with daemonless-SMPD.

Let me explain what I did.

Installed the MPICH2 with smpd configuration.

The sge is installed at: /opt/gridengine

And created MPICH2-SM folder in /opt/gridengine/mpi by referring the
following lines from the document

start_proc_args   /usr/sge/mpich2_smpd_rsh/startmpich2.sh -catch_rsh
$pe_hostfile
stop_proc_args    /usr/sge/mpich2_smpd_rsh/stopmpich2.sh

Copied the startmpi.sh, stopmpi.sh from /opt/gridengine/mpi  to
/opt/gridengine/mpi/MPICH2-SM dir, because nothing has given in the doc what
to include in these scripts.
Using qmon, created MPICH2-GF pe.

# qconf -sp MPICH2-SM
pe_name           MPICH2-SM
slots             999
user_lists        rootuserset
xuser_lists       NONE
start_proc_args   /opt/gridengine/mpi/MPICH2-SM/startmpich2sm.sh
stop_proc_args    /opt/gridengine/mpi/MPICH2-SM/stopmpich2sm.sh
allocation_rule   $round_robin
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min

Added this PE to default queue all.q.

Then submitted the job with following script:

# cat sgeSM.sh
#!/bin/sh

#$ -cwd

#$ -pe MPICH2-SM 4

#$ -e msge2.Err

#$ -o msge2.out

#$ -v MPI_HOME=/opt/MPI_LIBS/MPICH2-GNU/MPICH2-SM/bin

#$ -v MEME_DIRECTORY=/opt/MEME-MAX

$MPI_HOME/mpiexec -np 4 -machinefile /root/MFM /opt/MEME-MAX/bin/meme_p
/opt/MEME-MAX/NCCS/samevivo_sample.txt -dna -mod tcm -nmotifs 10 -nsites 100
-minw 5 -maxw 50 -revcomp -text -maxsize 200500

It gave following error:

# cat msge2.Err

startmpich2sm.sh: got wrong number of arguments
rm: cannot remove `/tmp/92.1.all.q/machines': No such file or directory
rm: cannot remove `/tmp/92.1.all.q/rsh': No such file or directory

I guess the problem might be with the scripts startmpich2sm.sh and
stopmpich2sm.sh.

Can any one guide me to resolve this issue..

Thanks & Regards,
Sangamesh
HPC Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20080125/279bdef4/attachment.html>


More information about the Beowulf mailing list