[Beowulf] SGE + LAM

C.L. Lai [ALAN] clai33 at uwo.ca
Mon Aug 16 20:50:34 PDT 2004


Trying
I don't think SGE6+LAM7 is that popular,
the only info from SGE on LAM is from July 2003, which gives a test script
for SGE5.6 + LAM6.5 integration.

Alan.

On Tue, 17 Aug 2004, [big5] Andrew Wang wrote:

> Did you try the SGE mailing list? There are several
> people using SGE+LAM on Linux.
> 
> Andrew.
> 
>  --- "C.L. Lai [ALAN]" <clai33 at uwo.ca> ªº°T®§¡G
> > 
> > I have been trying to do an SGE6+LAM7 integration,
> > but no luck so far.
> > 
> > After a long conversation to LAM mailing list, I
> > still don't know
> > whether the problem is from my setting, LAM, SGE, or
> > SGE+LAM, but some
> > people pointed out an error about the rsh/rshd from
> > SGE didn't work
> > properly.
> > 
> > I am not getting any useful SGE log, here is some
> > log generated by the
> > sge-lam script:
> > 
> > This is 'sge-lam start'
> > 
> > SGE-LAM DEBUG: LAMHOME = /usr
> > SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
> > SGE-LAM DEBUG: PATH =
> >
> /tmp/537.1.all.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin::/home/compute/sge/bin/lx26-amd64:/usr/bin
> > SGE-LAM DEBUG: qrsh =
> > /home/compute/sge/bin/lx26-amd64/qrsh
> > SGE-LAM DEBUG: ARGV = ""
> > SGE-LAM DEBUG: sgelamconf =
> > /home/compute/sge/lam/sge-lam-conf.lamd
> > SGE-LAM DEBUG: func=start
> > SGE-LAM DEBUG: LAMBOOT ARGS: -nn -ssi boot rsh -ssi
> > boot_rsh_agent
> > /home/compute/sge/lam/sge-lam qrsh-remote -c
> > /home/compute/sge/lam/sge-lam-conf.lamd -v -d
> > /tmp/537.1.all.q/lamhostfile
> > /tmp/537.1.all.q/lamhostfile
> > SGE-LAM DEBUG: LAMHOSTSLIST: rational.math.uwo.ca
> > cpu=2
> > n0<24778> ssi:boot: Opening
> > n0<24778> ssi:boot: looking for module named rsh
> > n0<24778> ssi:boot: opening module rsh
> > n0<24778> ssi:boot: initializing module rsh
> > n0<24778> ssi:boot:rsh: module initializing
> > n0<24778> ssi:boot:rsh:agent:
> > /home/compute/sge/lam/sge-lam qrsh-remote
> > n0<24778> ssi:boot:rsh:username: <same>
> > n0<24778> ssi:boot:rsh:verbose: 1000
> > n0<24778> ssi:boot:rsh:algorithm: linear
> > n0<24778> ssi:boot:rsh:priority: 10
> > n0<24778> ssi:boot: Selected boot module rsh
> > n0<24778> ssi:boot:base: looking for boot schema in
> > following directories:
> > n0<24778> ssi:boot:base:   <current directory>
> > n0<24778> ssi:boot:base:   $TROLLIUSHOME/etc
> > n0<24778> ssi:boot:base:   $LAMHOME/etc
> > n0<24778> ssi:boot:base:   /etc/lam
> > n0<24778> ssi:boot:base: looking for boot schema
> > file:
> > n0<24778> ssi:boot:base:  
> > /tmp/537.1.all.q/lamhostfile
> > n0<24778> ssi:boot:base: found boot schema:
> > /tmp/537.1.all.q/lamhostfile
> > n0<24778> ssi:boot:rsh: found the following hosts:
> > n0<24778> ssi:boot:rsh:   n0 rational.math.uwo.ca
> > (cpu=2)
> > n0<24778> ssi:boot:rsh: resolved hosts:
> > n0<24778> ssi:boot:rsh:   n0 rational.math.uwo.ca
> > --> 129.100.75.80
> > n0<24778> ssi:boot:rsh: starting RTE procs
> > n0<24778> ssi:boot:base:linear: starting
> > n0<24778> ssi:boot:base:server: opening server TCP
> > socket
> > n0<24778> ssi:boot:base:server: opened port 35804
> > n0<24778> ssi:boot:base:linear: booting n0
> > (rational.math.uwo.ca)
> > n0<24778> ssi:boot:rsh: starting lamd on
> > (rational.math.uwo.ca)
> > n0<24778> ssi:boot:rsh: starting on n0
> > (rational.math.uwo.ca): hboot -t -c
> > /home/compute/sge/lam/sge-lam-conf.lamd -d -v
> > -sessionsuffix sge-537-0 -I
> > -H 129.100.75.80 -P 35804 -n 0 -o 0
> > n0<24778> ssi:boot:rsh: launching locally
> > n0<24778> ssi:boot:rsh: successfully launched on n0
> > (rational.math.uwo.ca)
> > n0<24778> ssi:boot:base:server: expecting connection
> > from finite list
> > n0<24778> ssi:boot:base:server: got connection from
> > 0.0.0.0
> >
> -----------------------------------------------------------------------------
> > The lamboot agent timed out while waiting for the
> > newly-booted process
> > to call back and indicated that it had successfully
> > booted.
> > 
> > As far as LAM could tell, the remote process started
> > properly, but
> > then never called back.  Possible reasons that this
> > may happen:
> > 
> >         - There are network filters between the
> > lamboot agent host and
> >           the remote host such that communication on
> > random TCP ports
> >           is blocked
> >         - Network routing from the remote host to
> > the local host isn't
> >           properly configured (this is uncommon)
> > 
> > You can check these things by watching the output
> > from "lamboot -d".
> > 
> > 1. On the command line for hboot, there are two
> > important parameters:
> >    one is the IP address of where the lamboot agent
> > was invoked, the
> >    other is the port number that the lamboot agent
> > is expecting the
> >    newly-booted process to call back on (this will
> > be a random
> >    integer).
> > 
> > 2. Manually login to the remote machine and try to
> > telnet to the port
> >    indicated on the hboot command line.  For
> > example, 
> >        telnet <ipnumber> <portnumber>
> >    If all goes well, you should get a "Connection
> > refused" error.  If
> >    you get any other kind of error, it could
> > indicate either of the
> >    two conditions above.  Consult with your
> > system/network
> >    administrator.
> >
> -----------------------------------------------------------------------------
> > n0<24778> ssi:boot:base:server: failed to connect to
> > remote lamd!
> > n0<24778> ssi:boot:base:server: closing server
> > socket
> > n0<24778> ssi:boot:base:linear: aborted!
> >
> -----------------------------------------------------------------------------
> > lamboot encountered some error (see above) during
> > the boot process,
> > and will now attempt to kill all nodes that it was
> > previously able to
> > boot (if any).
> > 
> > Please wait for LAM to finish; if you interrupt this
> > process, you may
> > have LAM daemons still running on remote nodes.
> >
> -----------------------------------------------------------------------------
> > lamboot did NOT complete successfully
> > 
> > 
> > 
> > This is 'sge-lam qrsh-local'
> > 
> > SGE-LAM DEBUG: LAMHOME = /usr
> > SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
> > SGE-LAM DEBUG: PATH =
> >
> /tmp/537.1.all.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin::/home/compute/sge/bin/lx26-amd64:/usr/bin:/home/compute/sge/bin/lx26-amd64:/usr/bin
> > SGE-LAM DEBUG: qrsh =
> > /home/compute/sge/bin/lx26-amd64/qrsh
> > SGE-LAM DEBUG: ARGV =
> > "/usr/bin/lamd" "-H" "129.100.75.80" "-P" "35804"
> > "-n" "0" "-o" "0" "-d" "-sessionsuffix" "sge-537-0"
> > SGE-LAM DEBUG: sgelamconf =
> > /home/compute/sge/lam/sge-lam-conf.lamd
> > SGE-LAM DEBUG: func=qrsh-local
> > SGE-LAM DEBUG: QRSH LOCAL CONFIG: -inherit -nostdin
> > -V
> > rational.math.uwo.ca /usr/bin/lamd -H 129.100.75.80
> > -P 35804 -n 0 -o 0 -d
> > -sessionsuffix sge-537-0
> > SGE-LAM DEBUG: Exec qrsh-local:
> > /home/compute/sge/bin/lx26-amd64/qrsh
> > -inherit -nostdin -V rational.math.uwo.ca
> > /usr/bin/lamd -H 129.100.75.80
> > -P 35804 -n 0 -o 0 -d -sessionsuffix sge-537-0
> > rcmd: socket: Permission denied
> > 
> > 
> > 
> > The last line above is the line people think it's
> > qrsh/rsh/rshd related.
> > 
> > 
> > 
> > %qconf -sp lam
> > pe_name           lam
> > slots             100
> > user_lists        NONE
> > xuser_lists       NONE
> > start_proc_args   /home/compute/sge/lam/sge-lam
> > start
> > stop_proc_args    /home/compute/sge/lam/sge-lam stop
> > allocation_rule   $fill_up
> > control_slaves    TRUE
> > job_is_first_task FALSE
> > urgency_slots     min
> > 
> > 
> > Thanks,
> > 
> === message truncated ===> #!/usr/bin/perl
> > 
> > ### INSTALL DIRECTIONS:
> > #
> > #  1. Install this PERL executable, sge-lam inside
> > the LAM bin dir. 
> > #     Make sure it is executable.
> > #  2. Modify the following variables: LAMHOME below
> > to fit your site setup. 
> > #
> > 
> > $LAMHOME="/usr";
> > 
> > #  3. Create an SGE PE that can be used to submit
> > lam jobs. The following 
> > #     is an example assuming the scripts exist in
> > /usr/local/lam/bin. 
> > #     You should replace the queue_list and slots
> > with your site specific 
> > #     values or set it to "all" to use all the
> > queues.  
> > #
> > #        % qconf -sp lammpi 
> > #        pe_name lammpi
> > #        queue_list all
> > #        slots 6
> > #        user_lists NONE
> > #        xuser_lists NONE
> > #        start_proc_args /usr/local/lam/bin/sge-lam
> > start
> > #        stop_proc_args /usr/local/lam/bin/sge-lam
> > stop
> > #        allocation_rule $fill_up
> > #        control_slaves TRUE
> > #        job_is_first_task FALSE
> > #
> > #    NOTE: It is probably easiest to use the qmon
> > GUI to create the PE.
> > #
> > #   4. Add a new LAM node process schema into the
> > $LAMHOME/etc area
> > #      named sge-lam-conf.lamd. This should be a
> > single line that
> > #      adds the "sge-lam qrsh-local" prefix to the
> > lamd startup.
> > #
> > #       % cat /usr/local/lam/etc/sge-lam-conf.lamd
> > #       /usr/local/lam/bin/sge-lam qrsh-local
> > /usr/local/lam/bin/lamd  
> > #         $inet_topo $debug $session_prefix
> > $session_suffix
> > #
> > #### Submitting SGE JOBS
> > #
> > #   Once this is setup users can submit jobs as
> > normal and should not need to 
> > #   lamboot on their own. Users need only call
> > mpirun for their MPI programs. 
> > #   Here is an example job:
> > #
> > #        % cat lamjob.csh
> > #        #$ -cwd
> > #        set path=(/usr/local/lam/bin $path)
> > #        echo "Starting my LAM MPI job"
> > #        mpirun C conn-60
> > #        echo "LAM MPI job done"
> > #
> > #
> > #
> > #### Comments/Issues email:
> > christopher.duncan at xxxxxxx
> > #
> > # END INSTALL
> > 
> > 
> > $verbose=1;
> > #$debug=0;
> > $debug=1;
> > 
> > # close STDIN to avoid stdio race conditions and tty
> > issues
> > close(STDIN);
> > 
> > if( $debug eq 1){
> > 	open(SGEDEBUG,"> /tmp/sgedebug.$ENV{JOB_ID}.$$");
> > 	select(SGEDEBUG); $|=1;
> > 	open(STDERR,">> /tmp/sgedebug.$ENV{JOB_ID}.$$");
> > }
> > 
> > # set output for stderr and stdout to be unbuffered
> > select(STDERR); $|=1;
> > select(STDOUT); $|=1;
> > 
> > $lamboot="$LAMHOME/bin/lamboot";
> > $lamhalt="$LAMHOME/bin/lamhalt";
> > #$sgelamconf="${SGE_ROOT}/lam/sge-lam-conf.lamd";
> > 
> > # read in the args to figure out our task
> > $func=shift @ARGV;
> > 
> > $SGE_ROOT="$ENV{SGE_ROOT}";
> > $sgelamconf="$SGE_ROOT/lam/sge-lam-conf.lamd";
> > 
> > 
> > $arch=`${SGE_ROOT}/util/arch`;
> > chomp($arch);
> > $qrsh="${SGE_ROOT}/bin/${arch}/qrsh";
> > 
> > # add LAM and SGE to path
> > $ENV{'PATH'}.=":${SGE_ROOT}/bin/${arch}";
> > $ENV{'PATH'}.=":${LAMHOME}/bin";
> > 
> > #debug_print("TMPDIR = $ENV{TMPDIR}");
> > debug_print("LAMHOME = $LAMHOME");
> > debug_print("SGE_ROOT = $SGE_ROOT");
> > debug_print("PATH = $ENV{PATH}");
> > debug_print("qrsh = $qrsh");
> > debug_print("ARGV = \"".join("\" \"", at ARGV)."\"");
> > debug_print("sgelamconf = $sgelamconf");
> > 
> > if("$func" eq "start"){
> > 	debug_print("func=start");
> > 	print "Starting SGE + LAM Integration\n";
> > 	print "\t using tight integration scheme\n";
> > 	start_proc_args();
> > }elsif("$func" eq "stop"){
> > 	debug_print("func=stop");
> > 	print "Stoping SGE + LAM Integration\n";
> > 	stop_proc_args();
> > }elsif("$func" eq "qrsh-remote"){
> > 	debug_print("func=qrsh-remote");
> >         qrsh_remote();
> > }elsif("$func" eq "qrsh-local"){
> > 	debug_print("func=qrsh-local");
> >         qrsh_local();
> > }else{
> > 	print STDERR "\nusage: $0 {start|stop}\n\n";	
> > 	exit(-1);
> > }
> > 
> > 
> > sub start_proc_args()
> > {
> > 
> >   # we currently place the LAM host file in the
> > TMPDIR that SGE uses.
> >   # if we place it elsewhere we need to clean it up
> >   $lamhostsfile="$ENV{TMPDIR}/lamhostfile";
> > 
> >   # flags and options for lamboot (-x, -s and -np
> > may be useful in some envs)
> >  
> >
> @lambootargs=("-nn","-ssi","boot","rsh","-ssi","boot_rsh_agent","$SGE_ROOT/lam/sge-lam
> > qrsh-remote","-c","$sgelamconf");
> >   if($verbose){ push(@lambootargs,"-v"); }
> >   if($debug){ push(@lambootargs,"-d"); }
> >   push(@lambootargs,"$lamhostsfile");
> >   debug_print("LAMBOOT ARGS: @lambootargs
> > $lamhostsfile");
> > 
> >   ### Need to convert the SGE hostfile to a LAM
> > hostfile format
> >   # open and read the PE hostfile
> >   #system("cp $pe_hostfile /tmp");
> > 
> >   open(SGEHOSTFILE,"< $ENV{PE_HOSTFILE}");
> >   # convert to LAM bhost file format
> >   @lamhostslist=();
> >   while(<SGEHOSTFILE>){
> > 	($host,$ncpu,$junk)=split(/\s+/);
> > 	push( @lamhostslist,"$host cpu=$ncpu");
> >   }
> >   close(SGEHOSTFILE);
> > 
> >   debug_print("LAMHOSTSLIST: @lamhostslist");
> >   # create the new lam bhost file
> >   open(LAMHOSTFILE,"> $lamhostsfile");
> >   print LAMHOSTFILE join("\n", at lamhostslist);
> >   print LAMHOSTFILE "\n";
> >   close(LAMHOSTFILE);
> > 
> > 
> >   if($debug){ close(SGEDEBUG); }
> >   debug_print("Exec Lamboot: $lamboot
> > @lambootargs");
> >   exec($lamboot, at lambootargs);
> > }
> > 
> > 
> > sub stop_proc_args(){
> > 
> >   if($verbose){ push(@lamhaltargs,"-v"); }
> >   if($debug){ push(@lamhaltargs,"-d"); }
> > 
> > #  if($debug){ close(SGEDEBUG); }
> >   debug_print("Exec Lamhalt: $lamhalt
> > @lamhaltargs");
> >   exec($lamhalt, at lamhaltargs);
> > }
> > 
> > 
> > 
> === message truncated ===>
> _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or
> > unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> >  
> 
> -----------------------------------------------------------------
> Yahoo!©_¼¯Messenger6.0
> §Y®É³q°e§A¤Ú¨½®q¤»¤H¦æ¡I
> http://tw.messenger.yahoo.com/promo/2004/mgm/index.html
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 





More information about the Beowulf mailing list