[Beowulf] SGE + LAM
C.L. Lai [ALAN]
clai33 at uwo.ca
Mon Aug 16 20:50:34 PDT 2004
Trying
I don't think SGE6+LAM7 is that popular,
the only info from SGE on LAM is from July 2003, which gives a test script
for SGE5.6 + LAM6.5 integration.
Alan.
On Tue, 17 Aug 2004, [big5] Andrew Wang wrote:
> Did you try the SGE mailing list? There are several
> people using SGE+LAM on Linux.
>
> Andrew.
>
> --- "C.L. Lai [ALAN]" <clai33 at uwo.ca> ªº°T®§¡G
> >
> > I have been trying to do an SGE6+LAM7 integration,
> > but no luck so far.
> >
> > After a long conversation to LAM mailing list, I
> > still don't know
> > whether the problem is from my setting, LAM, SGE, or
> > SGE+LAM, but some
> > people pointed out an error about the rsh/rshd from
> > SGE didn't work
> > properly.
> >
> > I am not getting any useful SGE log, here is some
> > log generated by the
> > sge-lam script:
> >
> > This is 'sge-lam start'
> >
> > SGE-LAM DEBUG: LAMHOME = /usr
> > SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
> > SGE-LAM DEBUG: PATH =
> >
> /tmp/537.1.all.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin::/home/compute/sge/bin/lx26-amd64:/usr/bin
> > SGE-LAM DEBUG: qrsh =
> > /home/compute/sge/bin/lx26-amd64/qrsh
> > SGE-LAM DEBUG: ARGV = ""
> > SGE-LAM DEBUG: sgelamconf =
> > /home/compute/sge/lam/sge-lam-conf.lamd
> > SGE-LAM DEBUG: func=start
> > SGE-LAM DEBUG: LAMBOOT ARGS: -nn -ssi boot rsh -ssi
> > boot_rsh_agent
> > /home/compute/sge/lam/sge-lam qrsh-remote -c
> > /home/compute/sge/lam/sge-lam-conf.lamd -v -d
> > /tmp/537.1.all.q/lamhostfile
> > /tmp/537.1.all.q/lamhostfile
> > SGE-LAM DEBUG: LAMHOSTSLIST: rational.math.uwo.ca
> > cpu=2
> > n0<24778> ssi:boot: Opening
> > n0<24778> ssi:boot: looking for module named rsh
> > n0<24778> ssi:boot: opening module rsh
> > n0<24778> ssi:boot: initializing module rsh
> > n0<24778> ssi:boot:rsh: module initializing
> > n0<24778> ssi:boot:rsh:agent:
> > /home/compute/sge/lam/sge-lam qrsh-remote
> > n0<24778> ssi:boot:rsh:username: <same>
> > n0<24778> ssi:boot:rsh:verbose: 1000
> > n0<24778> ssi:boot:rsh:algorithm: linear
> > n0<24778> ssi:boot:rsh:priority: 10
> > n0<24778> ssi:boot: Selected boot module rsh
> > n0<24778> ssi:boot:base: looking for boot schema in
> > following directories:
> > n0<24778> ssi:boot:base: <current directory>
> > n0<24778> ssi:boot:base: $TROLLIUSHOME/etc
> > n0<24778> ssi:boot:base: $LAMHOME/etc
> > n0<24778> ssi:boot:base: /etc/lam
> > n0<24778> ssi:boot:base: looking for boot schema
> > file:
> > n0<24778> ssi:boot:base:
> > /tmp/537.1.all.q/lamhostfile
> > n0<24778> ssi:boot:base: found boot schema:
> > /tmp/537.1.all.q/lamhostfile
> > n0<24778> ssi:boot:rsh: found the following hosts:
> > n0<24778> ssi:boot:rsh: n0 rational.math.uwo.ca
> > (cpu=2)
> > n0<24778> ssi:boot:rsh: resolved hosts:
> > n0<24778> ssi:boot:rsh: n0 rational.math.uwo.ca
> > --> 129.100.75.80
> > n0<24778> ssi:boot:rsh: starting RTE procs
> > n0<24778> ssi:boot:base:linear: starting
> > n0<24778> ssi:boot:base:server: opening server TCP
> > socket
> > n0<24778> ssi:boot:base:server: opened port 35804
> > n0<24778> ssi:boot:base:linear: booting n0
> > (rational.math.uwo.ca)
> > n0<24778> ssi:boot:rsh: starting lamd on
> > (rational.math.uwo.ca)
> > n0<24778> ssi:boot:rsh: starting on n0
> > (rational.math.uwo.ca): hboot -t -c
> > /home/compute/sge/lam/sge-lam-conf.lamd -d -v
> > -sessionsuffix sge-537-0 -I
> > -H 129.100.75.80 -P 35804 -n 0 -o 0
> > n0<24778> ssi:boot:rsh: launching locally
> > n0<24778> ssi:boot:rsh: successfully launched on n0
> > (rational.math.uwo.ca)
> > n0<24778> ssi:boot:base:server: expecting connection
> > from finite list
> > n0<24778> ssi:boot:base:server: got connection from
> > 0.0.0.0
> >
> -----------------------------------------------------------------------------
> > The lamboot agent timed out while waiting for the
> > newly-booted process
> > to call back and indicated that it had successfully
> > booted.
> >
> > As far as LAM could tell, the remote process started
> > properly, but
> > then never called back. Possible reasons that this
> > may happen:
> >
> > - There are network filters between the
> > lamboot agent host and
> > the remote host such that communication on
> > random TCP ports
> > is blocked
> > - Network routing from the remote host to
> > the local host isn't
> > properly configured (this is uncommon)
> >
> > You can check these things by watching the output
> > from "lamboot -d".
> >
> > 1. On the command line for hboot, there are two
> > important parameters:
> > one is the IP address of where the lamboot agent
> > was invoked, the
> > other is the port number that the lamboot agent
> > is expecting the
> > newly-booted process to call back on (this will
> > be a random
> > integer).
> >
> > 2. Manually login to the remote machine and try to
> > telnet to the port
> > indicated on the hboot command line. For
> > example,
> > telnet <ipnumber> <portnumber>
> > If all goes well, you should get a "Connection
> > refused" error. If
> > you get any other kind of error, it could
> > indicate either of the
> > two conditions above. Consult with your
> > system/network
> > administrator.
> >
> -----------------------------------------------------------------------------
> > n0<24778> ssi:boot:base:server: failed to connect to
> > remote lamd!
> > n0<24778> ssi:boot:base:server: closing server
> > socket
> > n0<24778> ssi:boot:base:linear: aborted!
> >
> -----------------------------------------------------------------------------
> > lamboot encountered some error (see above) during
> > the boot process,
> > and will now attempt to kill all nodes that it was
> > previously able to
> > boot (if any).
> >
> > Please wait for LAM to finish; if you interrupt this
> > process, you may
> > have LAM daemons still running on remote nodes.
> >
> -----------------------------------------------------------------------------
> > lamboot did NOT complete successfully
> >
> >
> >
> > This is 'sge-lam qrsh-local'
> >
> > SGE-LAM DEBUG: LAMHOME = /usr
> > SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
> > SGE-LAM DEBUG: PATH =
> >
> /tmp/537.1.all.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin::/home/compute/sge/bin/lx26-amd64:/usr/bin:/home/compute/sge/bin/lx26-amd64:/usr/bin
> > SGE-LAM DEBUG: qrsh =
> > /home/compute/sge/bin/lx26-amd64/qrsh
> > SGE-LAM DEBUG: ARGV =
> > "/usr/bin/lamd" "-H" "129.100.75.80" "-P" "35804"
> > "-n" "0" "-o" "0" "-d" "-sessionsuffix" "sge-537-0"
> > SGE-LAM DEBUG: sgelamconf =
> > /home/compute/sge/lam/sge-lam-conf.lamd
> > SGE-LAM DEBUG: func=qrsh-local
> > SGE-LAM DEBUG: QRSH LOCAL CONFIG: -inherit -nostdin
> > -V
> > rational.math.uwo.ca /usr/bin/lamd -H 129.100.75.80
> > -P 35804 -n 0 -o 0 -d
> > -sessionsuffix sge-537-0
> > SGE-LAM DEBUG: Exec qrsh-local:
> > /home/compute/sge/bin/lx26-amd64/qrsh
> > -inherit -nostdin -V rational.math.uwo.ca
> > /usr/bin/lamd -H 129.100.75.80
> > -P 35804 -n 0 -o 0 -d -sessionsuffix sge-537-0
> > rcmd: socket: Permission denied
> >
> >
> >
> > The last line above is the line people think it's
> > qrsh/rsh/rshd related.
> >
> >
> >
> > %qconf -sp lam
> > pe_name lam
> > slots 100
> > user_lists NONE
> > xuser_lists NONE
> > start_proc_args /home/compute/sge/lam/sge-lam
> > start
> > stop_proc_args /home/compute/sge/lam/sge-lam stop
> > allocation_rule $fill_up
> > control_slaves TRUE
> > job_is_first_task FALSE
> > urgency_slots min
> >
> >
> > Thanks,
> >
> === message truncated ===> #!/usr/bin/perl
> >
> > ### INSTALL DIRECTIONS:
> > #
> > # 1. Install this PERL executable, sge-lam inside
> > the LAM bin dir.
> > # Make sure it is executable.
> > # 2. Modify the following variables: LAMHOME below
> > to fit your site setup.
> > #
> >
> > $LAMHOME="/usr";
> >
> > # 3. Create an SGE PE that can be used to submit
> > lam jobs. The following
> > # is an example assuming the scripts exist in
> > /usr/local/lam/bin.
> > # You should replace the queue_list and slots
> > with your site specific
> > # values or set it to "all" to use all the
> > queues.
> > #
> > # % qconf -sp lammpi
> > # pe_name lammpi
> > # queue_list all
> > # slots 6
> > # user_lists NONE
> > # xuser_lists NONE
> > # start_proc_args /usr/local/lam/bin/sge-lam
> > start
> > # stop_proc_args /usr/local/lam/bin/sge-lam
> > stop
> > # allocation_rule $fill_up
> > # control_slaves TRUE
> > # job_is_first_task FALSE
> > #
> > # NOTE: It is probably easiest to use the qmon
> > GUI to create the PE.
> > #
> > # 4. Add a new LAM node process schema into the
> > $LAMHOME/etc area
> > # named sge-lam-conf.lamd. This should be a
> > single line that
> > # adds the "sge-lam qrsh-local" prefix to the
> > lamd startup.
> > #
> > # % cat /usr/local/lam/etc/sge-lam-conf.lamd
> > # /usr/local/lam/bin/sge-lam qrsh-local
> > /usr/local/lam/bin/lamd
> > # $inet_topo $debug $session_prefix
> > $session_suffix
> > #
> > #### Submitting SGE JOBS
> > #
> > # Once this is setup users can submit jobs as
> > normal and should not need to
> > # lamboot on their own. Users need only call
> > mpirun for their MPI programs.
> > # Here is an example job:
> > #
> > # % cat lamjob.csh
> > # #$ -cwd
> > # set path=(/usr/local/lam/bin $path)
> > # echo "Starting my LAM MPI job"
> > # mpirun C conn-60
> > # echo "LAM MPI job done"
> > #
> > #
> > #
> > #### Comments/Issues email:
> > christopher.duncan at xxxxxxx
> > #
> > # END INSTALL
> >
> >
> > $verbose=1;
> > #$debug=0;
> > $debug=1;
> >
> > # close STDIN to avoid stdio race conditions and tty
> > issues
> > close(STDIN);
> >
> > if( $debug eq 1){
> > open(SGEDEBUG,"> /tmp/sgedebug.$ENV{JOB_ID}.$$");
> > select(SGEDEBUG); $|=1;
> > open(STDERR,">> /tmp/sgedebug.$ENV{JOB_ID}.$$");
> > }
> >
> > # set output for stderr and stdout to be unbuffered
> > select(STDERR); $|=1;
> > select(STDOUT); $|=1;
> >
> > $lamboot="$LAMHOME/bin/lamboot";
> > $lamhalt="$LAMHOME/bin/lamhalt";
> > #$sgelamconf="${SGE_ROOT}/lam/sge-lam-conf.lamd";
> >
> > # read in the args to figure out our task
> > $func=shift @ARGV;
> >
> > $SGE_ROOT="$ENV{SGE_ROOT}";
> > $sgelamconf="$SGE_ROOT/lam/sge-lam-conf.lamd";
> >
> >
> > $arch=`${SGE_ROOT}/util/arch`;
> > chomp($arch);
> > $qrsh="${SGE_ROOT}/bin/${arch}/qrsh";
> >
> > # add LAM and SGE to path
> > $ENV{'PATH'}.=":${SGE_ROOT}/bin/${arch}";
> > $ENV{'PATH'}.=":${LAMHOME}/bin";
> >
> > #debug_print("TMPDIR = $ENV{TMPDIR}");
> > debug_print("LAMHOME = $LAMHOME");
> > debug_print("SGE_ROOT = $SGE_ROOT");
> > debug_print("PATH = $ENV{PATH}");
> > debug_print("qrsh = $qrsh");
> > debug_print("ARGV = \"".join("\" \"", at ARGV)."\"");
> > debug_print("sgelamconf = $sgelamconf");
> >
> > if("$func" eq "start"){
> > debug_print("func=start");
> > print "Starting SGE + LAM Integration\n";
> > print "\t using tight integration scheme\n";
> > start_proc_args();
> > }elsif("$func" eq "stop"){
> > debug_print("func=stop");
> > print "Stoping SGE + LAM Integration\n";
> > stop_proc_args();
> > }elsif("$func" eq "qrsh-remote"){
> > debug_print("func=qrsh-remote");
> > qrsh_remote();
> > }elsif("$func" eq "qrsh-local"){
> > debug_print("func=qrsh-local");
> > qrsh_local();
> > }else{
> > print STDERR "\nusage: $0 {start|stop}\n\n";
> > exit(-1);
> > }
> >
> >
> > sub start_proc_args()
> > {
> >
> > # we currently place the LAM host file in the
> > TMPDIR that SGE uses.
> > # if we place it elsewhere we need to clean it up
> > $lamhostsfile="$ENV{TMPDIR}/lamhostfile";
> >
> > # flags and options for lamboot (-x, -s and -np
> > may be useful in some envs)
> >
> >
> @lambootargs=("-nn","-ssi","boot","rsh","-ssi","boot_rsh_agent","$SGE_ROOT/lam/sge-lam
> > qrsh-remote","-c","$sgelamconf");
> > if($verbose){ push(@lambootargs,"-v"); }
> > if($debug){ push(@lambootargs,"-d"); }
> > push(@lambootargs,"$lamhostsfile");
> > debug_print("LAMBOOT ARGS: @lambootargs
> > $lamhostsfile");
> >
> > ### Need to convert the SGE hostfile to a LAM
> > hostfile format
> > # open and read the PE hostfile
> > #system("cp $pe_hostfile /tmp");
> >
> > open(SGEHOSTFILE,"< $ENV{PE_HOSTFILE}");
> > # convert to LAM bhost file format
> > @lamhostslist=();
> > while(<SGEHOSTFILE>){
> > ($host,$ncpu,$junk)=split(/\s+/);
> > push( @lamhostslist,"$host cpu=$ncpu");
> > }
> > close(SGEHOSTFILE);
> >
> > debug_print("LAMHOSTSLIST: @lamhostslist");
> > # create the new lam bhost file
> > open(LAMHOSTFILE,"> $lamhostsfile");
> > print LAMHOSTFILE join("\n", at lamhostslist);
> > print LAMHOSTFILE "\n";
> > close(LAMHOSTFILE);
> >
> >
> > if($debug){ close(SGEDEBUG); }
> > debug_print("Exec Lamboot: $lamboot
> > @lambootargs");
> > exec($lamboot, at lambootargs);
> > }
> >
> >
> > sub stop_proc_args(){
> >
> > if($verbose){ push(@lamhaltargs,"-v"); }
> > if($debug){ push(@lamhaltargs,"-d"); }
> >
> > # if($debug){ close(SGEDEBUG); }
> > debug_print("Exec Lamhalt: $lamhalt
> > @lamhaltargs");
> > exec($lamhalt, at lamhaltargs);
> > }
> >
> >
> >
> === message truncated ===>
> _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or
> > unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> >
>
> -----------------------------------------------------------------
> Yahoo!©_¼¯Messenger6.0
> §Y®É³q°e§A¤Ú¨½®q¤»¤H¦æ¡I
> http://tw.messenger.yahoo.com/promo/2004/mgm/index.html
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
More information about the Beowulf
mailing list