Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] SGE + LAM

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

C.L. Lai [ALAN] clai33 at uwo.ca
Mon Aug 16 20:50:34 PDT 2004


Trying
I don't think SGE6+LAM7 is that popular,
the only info from SGE on LAM is from July 2003, which gives a test script
for SGE5.6 + LAM6.5 integration.

Alan.

On Tue, 17 Aug 2004, [big5] Andrew Wang wrote:

> Did you try the SGE mailing list? There are several
> people using SGE+LAM on Linux.
> 
> Andrew.
> 
>  --- "C.L. Lai [ALAN]" <clai33 at uwo.ca> ªº°T®§¡G
> > 
> > I have been trying to do an SGE6+LAM7 integration,
> > but no luck so far.
> > 
> > After a long conversation to LAM mailing list, I
> > still don't know
> > whether the problem is from my setting, LAM, SGE, or
> > SGE+LAM, but some
> > people pointed out an error about the rsh/rshd from
> > SGE didn't work
> > properly.
> > 
> > I am not getting any useful SGE log, here is some
> > log generated by the
> > sge-lam script:
> > 
> > This is 'sge-lam start'
> > 
> > SGE-LAM DEBUG: LAMHOME = /usr
> > SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
> > SGE-LAM DEBUG: PATH =
> >
> /tmp/537.1.all.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin::/home/compute/sge/bin/lx26-amd64:/usr/bin
> > SGE-LAM DEBUG: qrsh =
> > /home/compute/sge/bin/lx26-amd64/qrsh
> > SGE-LAM DEBUG: ARGV = ""
> > SGE-LAM DEBUG: sgelamconf =
> > /home/compute/sge/lam/sge-lam-conf.lamd
> > SGE-LAM DEBUG: func=start
> > SGE-LAM DEBUG: LAMBOOT ARGS: -nn -ssi boot rsh -ssi
> > boot_rsh_agent
> > /home/compute/sge/lam/sge-lam qrsh-remote -c
> > /home/compute/sge/lam/sge-lam-conf.lamd -v -d
> > /tmp/537.1.all.q/lamhostfile
> > /tmp/537.1.all.q/lamhostfile
> > SGE-LAM DEBUG: LAMHOSTSLIST: rational.math.uwo.ca
> > cpu=2
> > n0<24778> ssi:boot: Opening
> > n0<24778> ssi:boot: looking for module named rsh
> > n0<24778> ssi:boot: opening module rsh
> > n0<24778> ssi:boot: initializing module rsh
> > n0<24778> ssi:boot:rsh: module initializing
> > n0<24778> ssi:boot:rsh:agent:
> > /home/compute/sge/lam/sge-lam qrsh-remote
> > n0<24778> ssi:boot:rsh:username: <same>
> > n0<24778> ssi:boot:rsh:verbose: 1000
> > n0<24778> ssi:boot:rsh:algorithm: linear
> > n0<24778> ssi:boot:rsh:priority: 10
> > n0<24778> ssi:boot: Selected boot module rsh
> > n0<24778> ssi:boot:base: looking for boot schema in
> > following directories:
> > n0<24778> ssi:boot:base:   <current directory>
> > n0<24778> ssi:boot:base:   $TROLLIUSHOME/etc
> > n0<24778> ssi:boot:base:   $LAMHOME/etc
> > n0<24778> ssi:boot:base:   /etc/lam
> > n0<24778> ssi:boot:base: looking for boot schema
> > file:
> > n0<24778> ssi:boot:base:  
> > /tmp/537.1.all.q/lamhostfile
> > n0<24778> ssi:boot:base: found boot schema:
> > /tmp/537.1.all.q/lamhostfile
> > n0<24778> ssi:boot:rsh: found the following hosts:
> > n0<24778> ssi:boot:rsh:   n0 rational.math.uwo.ca
> > (cpu=2)
> > n0<24778> ssi:boot:rsh: resolved hosts:
> > n0<24778> ssi:boot:rsh:   n0 rational.math.uwo.ca
> > --> 129.100.75.80
> > n0<24778> ssi:boot:rsh: starting RTE procs
> > n0<24778> ssi:boot:base:linear: starting
> > n0<24778> ssi:boot:base:server: opening server TCP
> > socket
> > n0<24778> ssi:boot:base:server: opened port 35804
> > n0<24778> ssi:boot:base:linear: booting n0
> > (rational.math.uwo.ca)
> > n0<24778> ssi:boot:rsh: starting lamd on
> > (rational.math.uwo.ca)
> > n0<24778> ssi:boot:rsh: starting on n0
> > (rational.math.uwo.ca): hboot -t -c
> > /home/compute/sge/lam/sge-lam-conf.lamd -d -v
> > -sessionsuffix sge-537-0 -I
> > -H 129.100.75.80 -P 35804 -n 0 -o 0
> > n0<24778> ssi:boot:rsh: launching locally
> > n0<24778> ssi:boot:rsh: successfully launched on n0
> > (rational.math.uwo.ca)
> > n0<24778> ssi:boot:base:server: expecting connection
> > from finite list
> > n0<24778> ssi:boot:base:server: got connection from
> > 0.0.0.0
> >
> -----------------------------------------------------------------------------
> > The lamboot agent timed out while waiting for the
> > newly-booted process
> > to call back and indicated that it had successfully
> > booted.
> > 
> > As far as LAM could tell, the remote process started
> > properly, but
> > then never called back.  Possible reasons that this
> > may happen:
> > 
> >         - There are network filters between the
> > lamboot agent host and
> >           the remote host such that communication on
> > random TCP ports
> >           is blocked
> >         - Network routing from the remote host to
> > the local host isn't
> >           properly configured (this is uncommon)
> > 
> > You can check these things by watching the output
> > from "lamboot -d".
> > 
> > 1. On the command line for hboot, there are two
> > important parameters:
> >    one is the IP address of where the lamboot agent
> > was invoked, the
> >    other is the port number that the lamboot agent
> > is expecting the
> >    newly-booted process to call back on (this will
> > be a random
> >    integer).
> > 
> > 2. Manually login to the remote machine and try to
> > telnet to the port
> >    indicated on the hboot command line.  For
> > example, 
> >        telnet <ipnumber> <portnumber>
> >    If all goes well, you should get a "Connection
> > refused" error.  If
> >    you get any other kind of error, it could
> > indicate either of the
> >    two conditions above.  Consult with your
> > system/network
> >    administrator.
> >
> -----------------------------------------------------------------------------
> > n0<24778> ssi:boot:base:server: failed to connect to
> > remote lamd!
> > n0<24778> ssi:boot:base:server: closing server
> > socket
> > n0<24778> ssi:boot:base:linear: aborted!
> >
> -----------------------------------------------------------------------------
> > lamboot encountered some error (see above) during
> > the boot process,
> > and will now attempt to kill all nodes that it was
> > previously able to
> > boot (if any).
> > 
> > Please wait for LAM to finish; if you interrupt this
> > process, you may
> > have LAM daemons still running on remote nodes.
> >
> -----------------------------------------------------------------------------
> > lamboot did NOT complete successfully
> > 
> > 
> > 
> > This is 'sge-lam qrsh-local'
> > 
> > SGE-LAM DEBUG: LAMHOME = /usr
> > SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
> > SGE-LAM DEBUG: PATH =
> >
> /tmp/537.1.all.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin::/home/compute/sge/bin/lx26-amd64:/usr/bin:/home/compute/sge/bin/lx26-amd64:/usr/bin
> > SGE-LAM DEBUG: qrsh =
> > /home/compute/sge/bin/lx26-amd64/qrsh
> > SGE-LAM DEBUG: ARGV =
> > "/usr/bin/lamd" "-H" "129.100.75.80" "-P" "35804"
> > "-n" "0" "-o" "0" "-d" "-sessionsuffix" "sge-537-0"
> > SGE-LAM DEBUG: sgelamconf =
> > /home/compute/sge/lam/sge-lam-conf.lamd
> > SGE-LAM DEBUG: func=qrsh-local
> > SGE-LAM DEBUG: QRSH LOCAL CONFIG: -inherit -nostdin
> > -V
> > rational.math.uwo.ca /usr/bin/lamd -H 129.100.75.80
> > -P 35804 -n 0 -o 0 -d
> > -sessionsuffix sge-537-0
> > SGE-LAM DEBUG: Exec qrsh-local:
> > /home/compute/sge/bin/lx26-amd64/qrsh
> > -inherit -nostdin -V rational.math.uwo.ca
> > /usr/bin/lamd -H 129.100.75.80
> > -P 35804 -n 0 -o 0 -d -sessionsuffix sge-537-0
> > rcmd: socket: Permission denied
> > 
> > 
> > 
> > The last line above is the line people think it's
> > qrsh/rsh/rshd related.
> > 
> > 
> > 
> > %qconf -sp lam
> > pe_name           lam
> > slots             100
> > user_lists        NONE
> > xuser_lists       NONE
> > start_proc_args   /home/compute/sge/lam/sge-lam
> > start
> > stop_proc_args    /home/compute/sge/lam/sge-lam stop
> > allocation_rule   $fill_up
> > control_slaves    TRUE
> > job_is_first_task FALSE
> > urgency_slots     min
> > 
> > 
> > Thanks,
> > 
> === message truncated ===> #!/usr/bin/perl
> > 
> > ### INSTALL DIRECTIONS:
> > #
> > #  1. Install this PERL executable, sge-lam inside
> > the LAM bin dir. 
> > #     Make sure it is executable.
> > #  2. Modify the following variables: LAMHOME below
> > to fit your site setup. 
> > #
> > 
> > $LAMHOME="/usr";
> > 
> > #  3. Create an SGE PE that can be used to submit
> > lam jobs. The following 
> > #     is an example assuming the scripts exist in
> > /usr/local/lam/bin. 
> > #     You should replace the queue_list and slots
> > with your site specific 
> > #     values or set it to "all" to use all the
> > queues.  
> > #
> > #        % qconf -sp lammpi 
> > #        pe_name lammpi
> > #        queue_list all
> > #        slots 6
> > #        user_lists NONE
> > #        xuser_lists NONE
> > #        start_proc_args /usr/local/lam/bin/sge-lam
> > start
> > #        stop_proc_args /usr/local/lam/bin/sge-lam
> > stop
> > #        allocation_rule $fill_up
> > #        control_slaves TRUE
> > #        job_is_first_task FALSE
> > #
> > #    NOTE: It is probably easiest to use the qmon
> > GUI to create the PE.
> > #
> > #   4. Add a new LAM node process schema into the
> > $LAMHOME/etc area
> > #      named sge-lam-conf.lamd. This should be a
> > single line that
> > #      adds the "sge-lam qrsh-local" prefix to the
> > lamd startup.
> > #
> > #       % cat /usr/local/lam/etc/sge-lam-conf.lamd
> > #       /usr/local/lam/bin/sge-lam qrsh-local
> > /usr/local/lam/bin/lamd  
> > #         $inet_topo $debug $session_prefix
> > $session_suffix
> > #
> > #### Submitting SGE JOBS
> > #
> > #   Once this is setup users can submit jobs as
> > normal and should not need to 
> > #   lamboot on their own. Users need only call
> > mpirun for their MPI programs. 
> > #   Here is an example job:
> > #
> > #        % cat lamjob.csh
> > #        #$ -cwd
> > #        set path=(/usr/local/lam/bin $path)
> > #        echo "Starting my LAM MPI job"
> > #        mpirun C conn-60
> > #        echo "LAM MPI job done"
> > #
> > #
> > #
> > #### Comments/Issues email:
> > christopher.duncan at xxxxxxx
> > #
> > # END INSTALL
> > 
> > 
> > $verbose=1;
> > #$debug=0;
> > $debug=1;
> > 
> > # close STDIN to avoid stdio race conditions and tty
> > issues
> > close(STDIN);
> > 
> > if( $debug eq 1){
> > 	open(SGEDEBUG,"> /tmp/sgedebug.$ENV{JOB_ID}.$$");
> > 	select(SGEDEBUG); $|=1;
> > 	open(STDERR,">> /tmp/sgedebug.$ENV{JOB_ID}.$$");
> > }
> > 
> > # set output for stderr and stdout to be unbuffered
> > select(STDERR); $|=1;
> > select(STDOUT); $|=1;
> > 
> > $lamboot="$LAMHOME/bin/lamboot";
> > $lamhalt="$LAMHOME/bin/lamhalt";
> > #$sgelamconf="${SGE_ROOT}/lam/sge-lam-conf.lamd";
> > 
> > # read in the args to figure out our task
> > $func=shift @ARGV;
> > 
> > $SGE_ROOT="$ENV{SGE_ROOT}";
> > $sgelamconf="$SGE_ROOT/lam/sge-lam-conf.lamd";
> > 
> > 
> > $arch=`${SGE_ROOT}/util/arch`;
> > chomp($arch);
> > $qrsh="${SGE_ROOT}/bin/${arch}/qrsh";
> > 
> > # add LAM and SGE to path
> > $ENV{'PATH'}.=":${SGE_ROOT}/bin/${arch}";
> > $ENV{'PATH'}.=":${LAMHOME}/bin";
> > 
> > #debug_print("TMPDIR = $ENV{TMPDIR}");
> > debug_print("LAMHOME = $LAMHOME");
> > debug_print("SGE_ROOT = $SGE_ROOT");
> > debug_print("PATH = $ENV{PATH}");
> > debug_print("qrsh = $qrsh");
> > debug_print("ARGV = \"".join("\" \"", at ARGV)."\"");
> > debug_print("sgelamconf = $sgelamconf");
> > 
> > if("$func" eq "start"){
> > 	debug_print("func=start");
> > 	print "Starting SGE + LAM Integration\n";
> > 	print "\t using tight integration scheme\n";
> > 	start_proc_args();
> > }elsif("$func" eq "stop"){
> > 	debug_print("func=stop");
> > 	print "Stoping SGE + LAM Integration\n";
> > 	stop_proc_args();
> > }elsif("$func" eq "qrsh-remote"){
> > 	debug_print("func=qrsh-remote");
> >         qrsh_remote();
> > }elsif("$func" eq "qrsh-local"){
> > 	debug_print("func=qrsh-local");
> >         qrsh_local();
> > }else{
> > 	print STDERR "\nusage: $0 {start|stop}\n\n";	
> > 	exit(-1);
> > }
> > 
> > 
> > sub start_proc_args()
> > {
> > 
> >   # we currently place the LAM host file in the
> > TMPDIR that SGE uses.
> >   # if we place it elsewhere we need to clean it up
> >   $lamhostsfile="$ENV{TMPDIR}/lamhostfile";
> > 
> >   # flags and options for lamboot (-x, -s and -np
> > may be useful in some envs)
> >  
> >
> @lambootargs=("-nn","-ssi","boot","rsh","-ssi","boot_rsh_agent","$SGE_ROOT/lam/sge-lam
> > qrsh-remote","-c","$sgelamconf");
> >   if($verbose){ push(@lambootargs,"-v"); }
> >   if($debug){ push(@lambootargs,"-d"); }
> >   push(@lambootargs,"$lamhostsfile");
> >   debug_print("LAMBOOT ARGS: @lambootargs
> > $lamhostsfile");
> > 
> >   ### Need to convert the SGE hostfile to a LAM
> > hostfile format
> >   # open and read the PE hostfile
> >   #system("cp $pe_hostfile /tmp");
> > 
> >   open(SGEHOSTFILE,"< $ENV{PE_HOSTFILE}");
> >   # convert to LAM bhost file format
> >   @lamhostslist=();
> >   while(<SGEHOSTFILE>){
> > 	($host,$ncpu,$junk)=split(/\s+/);
> > 	push( @lamhostslist,"$host cpu=$ncpu");
> >   }
> >   close(SGEHOSTFILE);
> > 
> >   debug_print("LAMHOSTSLIST: @lamhostslist");
> >   # create the new lam bhost file
> >   open(LAMHOSTFILE,"> $lamhostsfile");
> >   print LAMHOSTFILE join("\n", at lamhostslist);
> >   print LAMHOSTFILE "\n";
> >   close(LAMHOSTFILE);
> > 
> > 
> >   if($debug){ close(SGEDEBUG); }
> >   debug_print("Exec Lamboot: $lamboot
> > @lambootargs");
> >   exec($lamboot, at lambootargs);
> > }
> > 
> > 
> > sub stop_proc_args(){
> > 
> >   if($verbose){ push(@lamhaltargs,"-v"); }
> >   if($debug){ push(@lamhaltargs,"-d"); }
> > 
> > #  if($debug){ close(SGEDEBUG); }
> >   debug_print("Exec Lamhalt: $lamhalt
> > @lamhaltargs");
> >   exec($lamhalt, at lamhaltargs);
> > }
> > 
> > 
> > 
> === message truncated ===>
> _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or
> > unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> >  
> 
> -----------------------------------------------------------------
> Yahoo!©_¼¯Messenger6.0
> §Y®É³q°e§A¤Ú¨½®q¤»¤H¦æ¡I
> http://tw.messenger.yahoo.com/promo/2004/mgm/index.html
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 





More information about the Beowulf mailing list