[Beowulf] SGE + LAM

C.L. Lai [ALAN] clai33 at uwo.ca
Thu Aug 12 09:37:20 PDT 2004

I have been trying to do an SGE6+LAM7 integration, but no luck so far.

After a long conversation to LAM mailing list, I still don't know
whether the problem is from my setting, LAM, SGE, or SGE+LAM, but some
people pointed out an error about the rsh/rshd from SGE didn't work

I am not getting any useful SGE log, here is some log generated by the
sge-lam script:

This is 'sge-lam start'

SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
SGE-LAM DEBUG: qrsh = /home/compute/sge/bin/lx26-amd64/qrsh
SGE-LAM DEBUG: sgelamconf = /home/compute/sge/lam/sge-lam-conf.lamd
SGE-LAM DEBUG: func=start
SGE-LAM DEBUG: LAMBOOT ARGS: -nn -ssi boot rsh -ssi boot_rsh_agent
/home/compute/sge/lam/sge-lam qrsh-remote -c
/home/compute/sge/lam/sge-lam-conf.lamd -v -d /tmp/537.1.all.q/lamhostfile
SGE-LAM DEBUG: LAMHOSTSLIST: rational.math.uwo.ca cpu=2
n0<24778> ssi:boot: Opening
n0<24778> ssi:boot: looking for module named rsh
n0<24778> ssi:boot: opening module rsh
n0<24778> ssi:boot: initializing module rsh
n0<24778> ssi:boot:rsh: module initializing
n0<24778> ssi:boot:rsh:agent: /home/compute/sge/lam/sge-lam qrsh-remote
n0<24778> ssi:boot:rsh:username: <same>
n0<24778> ssi:boot:rsh:verbose: 1000
n0<24778> ssi:boot:rsh:algorithm: linear
n0<24778> ssi:boot:rsh:priority: 10
n0<24778> ssi:boot: Selected boot module rsh
n0<24778> ssi:boot:base: looking for boot schema in following directories:
n0<24778> ssi:boot:base:   <current directory>
n0<24778> ssi:boot:base:   $TROLLIUSHOME/etc
n0<24778> ssi:boot:base:   $LAMHOME/etc
n0<24778> ssi:boot:base:   /etc/lam
n0<24778> ssi:boot:base: looking for boot schema file:
n0<24778> ssi:boot:base:   /tmp/537.1.all.q/lamhostfile
n0<24778> ssi:boot:base: found boot schema: /tmp/537.1.all.q/lamhostfile
n0<24778> ssi:boot:rsh: found the following hosts:
n0<24778> ssi:boot:rsh:   n0 rational.math.uwo.ca (cpu=2)
n0<24778> ssi:boot:rsh: resolved hosts:
n0<24778> ssi:boot:rsh:   n0 rational.math.uwo.ca -->
n0<24778> ssi:boot:rsh: starting RTE procs
n0<24778> ssi:boot:base:linear: starting
n0<24778> ssi:boot:base:server: opening server TCP socket
n0<24778> ssi:boot:base:server: opened port 35804
n0<24778> ssi:boot:base:linear: booting n0 (rational.math.uwo.ca)
n0<24778> ssi:boot:rsh: starting lamd on (rational.math.uwo.ca)
n0<24778> ssi:boot:rsh: starting on n0 (rational.math.uwo.ca): hboot -t -c
/home/compute/sge/lam/sge-lam-conf.lamd -d -v -sessionsuffix sge-537-0 -I
-H -P 35804 -n 0 -o 0
n0<24778> ssi:boot:rsh: launching locally
n0<24778> ssi:boot:rsh: successfully launched on n0 (rational.math.uwo.ca)
n0<24778> ssi:boot:base:server: expecting connection from finite list
n0<24778> ssi:boot:base:server: got connection from
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.

As far as LAM could tell, the remote process started properly, but
then never called back.  Possible reasons that this may happen:

        - There are network filters between the lamboot agent host and
          the remote host such that communication on random TCP ports
          is blocked
        - Network routing from the remote host to the local host isn't
          properly configured (this is uncommon)

You can check these things by watching the output from "lamboot -d".

1. On the command line for hboot, there are two important parameters:
   one is the IP address of where the lamboot agent was invoked, the
   other is the port number that the lamboot agent is expecting the
   newly-booted process to call back on (this will be a random

2. Manually login to the remote machine and try to telnet to the port
   indicated on the hboot command line.  For example, 
       telnet <ipnumber> <portnumber>
   If all goes well, you should get a "Connection refused" error.  If
   you get any other kind of error, it could indicate either of the
   two conditions above.  Consult with your system/network
n0<24778> ssi:boot:base:server: failed to connect to remote lamd!
n0<24778> ssi:boot:base:server: closing server socket
n0<24778> ssi:boot:base:linear: aborted!
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
lamboot did NOT complete successfully

This is 'sge-lam qrsh-local'

SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
SGE-LAM DEBUG: qrsh = /home/compute/sge/bin/lx26-amd64/qrsh
"/usr/bin/lamd" "-H" "" "-P" "35804" "-n" "0" "-o" "0" "-d" "-sessionsuffix" "sge-537-0"
SGE-LAM DEBUG: sgelamconf = /home/compute/sge/lam/sge-lam-conf.lamd
SGE-LAM DEBUG: func=qrsh-local
SGE-LAM DEBUG: QRSH LOCAL CONFIG: -inherit -nostdin -V
rational.math.uwo.ca /usr/bin/lamd -H -P 35804 -n 0 -o 0 -d
-sessionsuffix sge-537-0
SGE-LAM DEBUG: Exec qrsh-local: /home/compute/sge/bin/lx26-amd64/qrsh
-inherit -nostdin -V rational.math.uwo.ca /usr/bin/lamd -H
-P 35804 -n 0 -o 0 -d -sessionsuffix sge-537-0
rcmd: socket: Permission denied

The last line above is the line people think it's qrsh/rsh/rshd related.

%qconf -sp lam
pe_name           lam
slots             100
user_lists        NONE
xuser_lists       NONE
start_proc_args   /home/compute/sge/lam/sge-lam start
stop_proc_args    /home/compute/sge/lam/sge-lam stop
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min


-------------- next part --------------

#  1. Install this PERL executable, sge-lam inside the LAM bin dir. 
#     Make sure it is executable.
#  2. Modify the following variables: LAMHOME below to fit your site setup. 


#  3. Create an SGE PE that can be used to submit lam jobs. The following 
#     is an example assuming the scripts exist in /usr/local/lam/bin. 
#     You should replace the queue_list and slots with your site specific 
#     values or set it to "all" to use all the queues.  
#        % qconf -sp lammpi 
#        pe_name lammpi
#        queue_list all
#        slots 6
#        user_lists NONE
#        xuser_lists NONE
#        start_proc_args /usr/local/lam/bin/sge-lam start
#        stop_proc_args /usr/local/lam/bin/sge-lam stop
#        allocation_rule $fill_up
#        control_slaves TRUE
#        job_is_first_task FALSE
#    NOTE: It is probably easiest to use the qmon GUI to create the PE.
#   4. Add a new LAM node process schema into the $LAMHOME/etc area
#      named sge-lam-conf.lamd. This should be a single line that
#      adds the "sge-lam qrsh-local" prefix to the lamd startup.
#       % cat /usr/local/lam/etc/sge-lam-conf.lamd
#       /usr/local/lam/bin/sge-lam qrsh-local /usr/local/lam/bin/lamd  
#         $inet_topo $debug $session_prefix $session_suffix
#### Submitting SGE JOBS
#   Once this is setup users can submit jobs as normal and should not need to 
#   lamboot on their own. Users need only call mpirun for their MPI programs. 
#   Here is an example job:
#        % cat lamjob.csh
#        #$ -cwd
#        set path=(/usr/local/lam/bin $path)
#        echo "Starting my LAM MPI job"
#        mpirun C conn-60
#        echo "LAM MPI job done"
#### Comments/Issues email: christopher.duncan at xxxxxxx


# close STDIN to avoid stdio race conditions and tty issues

if( $debug eq 1){
	open(SGEDEBUG,"> /tmp/sgedebug.$ENV{JOB_ID}.$$");
	select(SGEDEBUG); $|=1;
	open(STDERR,">> /tmp/sgedebug.$ENV{JOB_ID}.$$");

# set output for stderr and stdout to be unbuffered
select(STDERR); $|=1;
select(STDOUT); $|=1;


# read in the args to figure out our task
$func=shift @ARGV;



# add LAM and SGE to path

#debug_print("TMPDIR = $ENV{TMPDIR}");
debug_print("LAMHOME = $LAMHOME");
debug_print("SGE_ROOT = $SGE_ROOT");
debug_print("PATH = $ENV{PATH}");
debug_print("qrsh = $qrsh");
debug_print("ARGV = \"".join("\" \"", at ARGV)."\"");
debug_print("sgelamconf = $sgelamconf");

if("$func" eq "start"){
	print "Starting SGE + LAM Integration\n";
	print "\t using tight integration scheme\n";
}elsif("$func" eq "stop"){
	print "Stoping SGE + LAM Integration\n";
}elsif("$func" eq "qrsh-remote"){
}elsif("$func" eq "qrsh-local"){
	print STDERR "\nusage: $0 {start|stop}\n\n";	

sub start_proc_args()

  # we currently place the LAM host file in the TMPDIR that SGE uses.
  # if we place it elsewhere we need to clean it up

  # flags and options for lamboot (-x, -s and -np may be useful in some envs)
  @lambootargs=("-nn","-ssi","boot","rsh","-ssi","boot_rsh_agent","$SGE_ROOT/lam/sge-lam qrsh-remote","-c","$sgelamconf");
  if($verbose){ push(@lambootargs,"-v"); }
  if($debug){ push(@lambootargs,"-d"); }
  debug_print("LAMBOOT ARGS: @lambootargs $lamhostsfile");

  ### Need to convert the SGE hostfile to a LAM hostfile format
  # open and read the PE hostfile
  #system("cp $pe_hostfile /tmp");

  # convert to LAM bhost file format
	push( @lamhostslist,"$host cpu=$ncpu");

  debug_print("LAMHOSTSLIST: @lamhostslist");
  # create the new lam bhost file
  open(LAMHOSTFILE,"> $lamhostsfile");
  print LAMHOSTFILE join("\n", at lamhostslist);
  print LAMHOSTFILE "\n";

  if($debug){ close(SGEDEBUG); }
  debug_print("Exec Lamboot: $lamboot @lambootargs");
  exec($lamboot, at lambootargs);

sub stop_proc_args(){

  if($verbose){ push(@lamhaltargs,"-v"); }
  if($debug){ push(@lamhaltargs,"-d"); }

#  if($debug){ close(SGEDEBUG); }
  debug_print("Exec Lamhalt: $lamhalt @lamhaltargs");
  exec($lamhalt, at lamhaltargs);

sub qrsh_remote()

  @myargs=("-inherit","-nostdin","-V", at ARGV);

  debug_print("QRSH REMOTE CONFIG: @myargs");
#  if($debug){ close(SGEDEBUG); }
  debug_print("Exec qrsh-remote: $qrsh @myargs");
  exec($qrsh, at myargs);

sub qrsh_local()
  # we are running a local qrsh to add the lamd into the current job
  # on the local node using the LAM boot schema

  # get the hostname to pass to qrsh

  # tell SGE to add this command into the JOB_ID job by using qrsh -inherit
  # the hostname is not passed in this case in ARGV by lamboot
  @myargs=("-inherit","-nostdin","-V","$hostname", at ARGV);

  debug_print("QRSH LOCAL CONFIG: @myargs");
#  if($debug){ close(SGEDEBUG); }
  debug_print("Exec qrsh-local: $qrsh @myargs");
  exec($qrsh, at myargs);

sub debug_print()
    print SGEDEBUG "SGE-LAM DEBUG: @_\n";

More information about the Beowulf mailing list