[Beowulf] SGE + LAM
Andrew Wang
andrewxwang at yahoo.com.tw
Mon Aug 16 19:58:37 PDT 2004
Did you try the SGE mailing list? There are several
people using SGE+LAM on Linux.
Andrew.
--- "C.L. Lai [ALAN]" <clai33 at uwo.ca> 的訊息:
>
> I have been trying to do an SGE6+LAM7 integration,
> but no luck so far.
>
> After a long conversation to LAM mailing list, I
> still don't know
> whether the problem is from my setting, LAM, SGE, or
> SGE+LAM, but some
> people pointed out an error about the rsh/rshd from
> SGE didn't work
> properly.
>
> I am not getting any useful SGE log, here is some
> log generated by the
> sge-lam script:
>
> This is 'sge-lam start'
>
> SGE-LAM DEBUG: LAMHOME = /usr
> SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
> SGE-LAM DEBUG: PATH =
>
/tmp/537.1.all.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin::/home/compute/sge/bin/lx26-amd64:/usr/bin
> SGE-LAM DEBUG: qrsh =
> /home/compute/sge/bin/lx26-amd64/qrsh
> SGE-LAM DEBUG: ARGV = ""
> SGE-LAM DEBUG: sgelamconf =
> /home/compute/sge/lam/sge-lam-conf.lamd
> SGE-LAM DEBUG: func=start
> SGE-LAM DEBUG: LAMBOOT ARGS: -nn -ssi boot rsh -ssi
> boot_rsh_agent
> /home/compute/sge/lam/sge-lam qrsh-remote -c
> /home/compute/sge/lam/sge-lam-conf.lamd -v -d
> /tmp/537.1.all.q/lamhostfile
> /tmp/537.1.all.q/lamhostfile
> SGE-LAM DEBUG: LAMHOSTSLIST: rational.math.uwo.ca
> cpu=2
> n0<24778> ssi:boot: Opening
> n0<24778> ssi:boot: looking for module named rsh
> n0<24778> ssi:boot: opening module rsh
> n0<24778> ssi:boot: initializing module rsh
> n0<24778> ssi:boot:rsh: module initializing
> n0<24778> ssi:boot:rsh:agent:
> /home/compute/sge/lam/sge-lam qrsh-remote
> n0<24778> ssi:boot:rsh:username: <same>
> n0<24778> ssi:boot:rsh:verbose: 1000
> n0<24778> ssi:boot:rsh:algorithm: linear
> n0<24778> ssi:boot:rsh:priority: 10
> n0<24778> ssi:boot: Selected boot module rsh
> n0<24778> ssi:boot:base: looking for boot schema in
> following directories:
> n0<24778> ssi:boot:base: <current directory>
> n0<24778> ssi:boot:base: $TROLLIUSHOME/etc
> n0<24778> ssi:boot:base: $LAMHOME/etc
> n0<24778> ssi:boot:base: /etc/lam
> n0<24778> ssi:boot:base: looking for boot schema
> file:
> n0<24778> ssi:boot:base:
> /tmp/537.1.all.q/lamhostfile
> n0<24778> ssi:boot:base: found boot schema:
> /tmp/537.1.all.q/lamhostfile
> n0<24778> ssi:boot:rsh: found the following hosts:
> n0<24778> ssi:boot:rsh: n0 rational.math.uwo.ca
> (cpu=2)
> n0<24778> ssi:boot:rsh: resolved hosts:
> n0<24778> ssi:boot:rsh: n0 rational.math.uwo.ca
> --> 129.100.75.80
> n0<24778> ssi:boot:rsh: starting RTE procs
> n0<24778> ssi:boot:base:linear: starting
> n0<24778> ssi:boot:base:server: opening server TCP
> socket
> n0<24778> ssi:boot:base:server: opened port 35804
> n0<24778> ssi:boot:base:linear: booting n0
> (rational.math.uwo.ca)
> n0<24778> ssi:boot:rsh: starting lamd on
> (rational.math.uwo.ca)
> n0<24778> ssi:boot:rsh: starting on n0
> (rational.math.uwo.ca): hboot -t -c
> /home/compute/sge/lam/sge-lam-conf.lamd -d -v
> -sessionsuffix sge-537-0 -I
> -H 129.100.75.80 -P 35804 -n 0 -o 0
> n0<24778> ssi:boot:rsh: launching locally
> n0<24778> ssi:boot:rsh: successfully launched on n0
> (rational.math.uwo.ca)
> n0<24778> ssi:boot:base:server: expecting connection
> from finite list
> n0<24778> ssi:boot:base:server: got connection from
> 0.0.0.0
>
-----------------------------------------------------------------------------
> The lamboot agent timed out while waiting for the
> newly-booted process
> to call back and indicated that it had successfully
> booted.
>
> As far as LAM could tell, the remote process started
> properly, but
> then never called back. Possible reasons that this
> may happen:
>
> - There are network filters between the
> lamboot agent host and
> the remote host such that communication on
> random TCP ports
> is blocked
> - Network routing from the remote host to
> the local host isn't
> properly configured (this is uncommon)
>
> You can check these things by watching the output
> from "lamboot -d".
>
> 1. On the command line for hboot, there are two
> important parameters:
> one is the IP address of where the lamboot agent
> was invoked, the
> other is the port number that the lamboot agent
> is expecting the
> newly-booted process to call back on (this will
> be a random
> integer).
>
> 2. Manually login to the remote machine and try to
> telnet to the port
> indicated on the hboot command line. For
> example,
> telnet <ipnumber> <portnumber>
> If all goes well, you should get a "Connection
> refused" error. If
> you get any other kind of error, it could
> indicate either of the
> two conditions above. Consult with your
> system/network
> administrator.
>
-----------------------------------------------------------------------------
> n0<24778> ssi:boot:base:server: failed to connect to
> remote lamd!
> n0<24778> ssi:boot:base:server: closing server
> socket
> n0<24778> ssi:boot:base:linear: aborted!
>
-----------------------------------------------------------------------------
> lamboot encountered some error (see above) during
> the boot process,
> and will now attempt to kill all nodes that it was
> previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this
> process, you may
> have LAM daemons still running on remote nodes.
>
-----------------------------------------------------------------------------
> lamboot did NOT complete successfully
>
>
>
> This is 'sge-lam qrsh-local'
>
> SGE-LAM DEBUG: LAMHOME = /usr
> SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
> SGE-LAM DEBUG: PATH =
>
/tmp/537.1.all.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin::/home/compute/sge/bin/lx26-amd64:/usr/bin:/home/compute/sge/bin/lx26-amd64:/usr/bin
> SGE-LAM DEBUG: qrsh =
> /home/compute/sge/bin/lx26-amd64/qrsh
> SGE-LAM DEBUG: ARGV =
> "/usr/bin/lamd" "-H" "129.100.75.80" "-P" "35804"
> "-n" "0" "-o" "0" "-d" "-sessionsuffix" "sge-537-0"
> SGE-LAM DEBUG: sgelamconf =
> /home/compute/sge/lam/sge-lam-conf.lamd
> SGE-LAM DEBUG: func=qrsh-local
> SGE-LAM DEBUG: QRSH LOCAL CONFIG: -inherit -nostdin
> -V
> rational.math.uwo.ca /usr/bin/lamd -H 129.100.75.80
> -P 35804 -n 0 -o 0 -d
> -sessionsuffix sge-537-0
> SGE-LAM DEBUG: Exec qrsh-local:
> /home/compute/sge/bin/lx26-amd64/qrsh
> -inherit -nostdin -V rational.math.uwo.ca
> /usr/bin/lamd -H 129.100.75.80
> -P 35804 -n 0 -o 0 -d -sessionsuffix sge-537-0
> rcmd: socket: Permission denied
>
>
>
> The last line above is the line people think it's
> qrsh/rsh/rshd related.
>
>
>
> %qconf -sp lam
> pe_name lam
> slots 100
> user_lists NONE
> xuser_lists NONE
> start_proc_args /home/compute/sge/lam/sge-lam
> start
> stop_proc_args /home/compute/sge/lam/sge-lam stop
> allocation_rule $fill_up
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
>
>
> Thanks,
>
=== message truncated ===> #!/usr/bin/perl
>
> ### INSTALL DIRECTIONS:
> #
> # 1. Install this PERL executable, sge-lam inside
> the LAM bin dir.
> # Make sure it is executable.
> # 2. Modify the following variables: LAMHOME below
> to fit your site setup.
> #
>
> $LAMHOME="/usr";
>
> # 3. Create an SGE PE that can be used to submit
> lam jobs. The following
> # is an example assuming the scripts exist in
> /usr/local/lam/bin.
> # You should replace the queue_list and slots
> with your site specific
> # values or set it to "all" to use all the
> queues.
> #
> # % qconf -sp lammpi
> # pe_name lammpi
> # queue_list all
> # slots 6
> # user_lists NONE
> # xuser_lists NONE
> # start_proc_args /usr/local/lam/bin/sge-lam
> start
> # stop_proc_args /usr/local/lam/bin/sge-lam
> stop
> # allocation_rule $fill_up
> # control_slaves TRUE
> # job_is_first_task FALSE
> #
> # NOTE: It is probably easiest to use the qmon
> GUI to create the PE.
> #
> # 4. Add a new LAM node process schema into the
> $LAMHOME/etc area
> # named sge-lam-conf.lamd. This should be a
> single line that
> # adds the "sge-lam qrsh-local" prefix to the
> lamd startup.
> #
> # % cat /usr/local/lam/etc/sge-lam-conf.lamd
> # /usr/local/lam/bin/sge-lam qrsh-local
> /usr/local/lam/bin/lamd
> # $inet_topo $debug $session_prefix
> $session_suffix
> #
> #### Submitting SGE JOBS
> #
> # Once this is setup users can submit jobs as
> normal and should not need to
> # lamboot on their own. Users need only call
> mpirun for their MPI programs.
> # Here is an example job:
> #
> # % cat lamjob.csh
> # #$ -cwd
> # set path=(/usr/local/lam/bin $path)
> # echo "Starting my LAM MPI job"
> # mpirun C conn-60
> # echo "LAM MPI job done"
> #
> #
> #
> #### Comments/Issues email:
> christopher.duncan at xxxxxxx
> #
> # END INSTALL
>
>
> $verbose=1;
> #$debug=0;
> $debug=1;
>
> # close STDIN to avoid stdio race conditions and tty
> issues
> close(STDIN);
>
> if( $debug eq 1){
> open(SGEDEBUG,"> /tmp/sgedebug.$ENV{JOB_ID}.$$");
> select(SGEDEBUG); $|=1;
> open(STDERR,">> /tmp/sgedebug.$ENV{JOB_ID}.$$");
> }
>
> # set output for stderr and stdout to be unbuffered
> select(STDERR); $|=1;
> select(STDOUT); $|=1;
>
> $lamboot="$LAMHOME/bin/lamboot";
> $lamhalt="$LAMHOME/bin/lamhalt";
> #$sgelamconf="${SGE_ROOT}/lam/sge-lam-conf.lamd";
>
> # read in the args to figure out our task
> $func=shift @ARGV;
>
> $SGE_ROOT="$ENV{SGE_ROOT}";
> $sgelamconf="$SGE_ROOT/lam/sge-lam-conf.lamd";
>
>
> $arch=`${SGE_ROOT}/util/arch`;
> chomp($arch);
> $qrsh="${SGE_ROOT}/bin/${arch}/qrsh";
>
> # add LAM and SGE to path
> $ENV{'PATH'}.=":${SGE_ROOT}/bin/${arch}";
> $ENV{'PATH'}.=":${LAMHOME}/bin";
>
> #debug_print("TMPDIR = $ENV{TMPDIR}");
> debug_print("LAMHOME = $LAMHOME");
> debug_print("SGE_ROOT = $SGE_ROOT");
> debug_print("PATH = $ENV{PATH}");
> debug_print("qrsh = $qrsh");
> debug_print("ARGV = \"".join("\" \"", at ARGV)."\"");
> debug_print("sgelamconf = $sgelamconf");
>
> if("$func" eq "start"){
> debug_print("func=start");
> print "Starting SGE + LAM Integration\n";
> print "\t using tight integration scheme\n";
> start_proc_args();
> }elsif("$func" eq "stop"){
> debug_print("func=stop");
> print "Stoping SGE + LAM Integration\n";
> stop_proc_args();
> }elsif("$func" eq "qrsh-remote"){
> debug_print("func=qrsh-remote");
> qrsh_remote();
> }elsif("$func" eq "qrsh-local"){
> debug_print("func=qrsh-local");
> qrsh_local();
> }else{
> print STDERR "\nusage: $0 {start|stop}\n\n";
> exit(-1);
> }
>
>
> sub start_proc_args()
> {
>
> # we currently place the LAM host file in the
> TMPDIR that SGE uses.
> # if we place it elsewhere we need to clean it up
> $lamhostsfile="$ENV{TMPDIR}/lamhostfile";
>
> # flags and options for lamboot (-x, -s and -np
> may be useful in some envs)
>
>
@lambootargs=("-nn","-ssi","boot","rsh","-ssi","boot_rsh_agent","$SGE_ROOT/lam/sge-lam
> qrsh-remote","-c","$sgelamconf");
> if($verbose){ push(@lambootargs,"-v"); }
> if($debug){ push(@lambootargs,"-d"); }
> push(@lambootargs,"$lamhostsfile");
> debug_print("LAMBOOT ARGS: @lambootargs
> $lamhostsfile");
>
> ### Need to convert the SGE hostfile to a LAM
> hostfile format
> # open and read the PE hostfile
> #system("cp $pe_hostfile /tmp");
>
> open(SGEHOSTFILE,"< $ENV{PE_HOSTFILE}");
> # convert to LAM bhost file format
> @lamhostslist=();
> while(<SGEHOSTFILE>){
> ($host,$ncpu,$junk)=split(/\s+/);
> push( @lamhostslist,"$host cpu=$ncpu");
> }
> close(SGEHOSTFILE);
>
> debug_print("LAMHOSTSLIST: @lamhostslist");
> # create the new lam bhost file
> open(LAMHOSTFILE,"> $lamhostsfile");
> print LAMHOSTFILE join("\n", at lamhostslist);
> print LAMHOSTFILE "\n";
> close(LAMHOSTFILE);
>
>
> if($debug){ close(SGEDEBUG); }
> debug_print("Exec Lamboot: $lamboot
> @lambootargs");
> exec($lamboot, at lambootargs);
> }
>
>
> sub stop_proc_args(){
>
> if($verbose){ push(@lamhaltargs,"-v"); }
> if($debug){ push(@lamhaltargs,"-d"); }
>
> # if($debug){ close(SGEDEBUG); }
> debug_print("Exec Lamhalt: $lamhalt
> @lamhaltargs");
> exec($lamhalt, at lamhaltargs);
> }
>
>
>
=== message truncated ===>
_______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-----------------------------------------------------------------
Yahoo!奇摩Messenger6.0
即時通送你巴里島六人行!
http://tw.messenger.yahoo.com/promo/2004/mgm/index.html
More information about the Beowulf
mailing list