pbslam: PBS Interface Script for LAM

Tom Crockett tom at compsci.wm.edu
Wed Feb 21 08:17:04 PST 2001


Hi,

Attached to this message is a script and accompanying HTML man page for
launching LAM MPI jobs under PBS.  Perhaps some of you will find it
useful.  It provides the following services:

  - Mapping of processes onto assigned processors (by constructing
    a LAM application schema).

  - Initialization of the LAM runtime environment (lamboot).

  - Execution of the LAM program (via mpirun).
 
  - Shutting down the LAM runtime system (equivalent to wipe, only
    faster).
 
  - Intercepting abnormal termination conditions (qdel requests,
    over-limit conditions, keyboard interrupts, etc.) in order to
    clean up LAM processes before aborting a PBS job. 

This script was originally developed a couple of years ago for the Coral
cluster, a Linux/Pentium system located at ICASE.  I recently did a
complete rewrite in Perl 5 for William and Mary's SciClone cluster, a
Solaris/UltraSPARC system.  The primary improvements in the new version
include:

  - More flexibility in mapping LAM processes onto PBS virtual
    processors. This is particularly useful in clusters with
    multi-processor nodes or nodes with differing numbers of
    processors.

  - Faster and more robust cleanup of LAM processes following either
    normal or abnormal termination of the job.  In particular, LAM's
    "wipe" command has been replaced with multiple concurrent
    remote invocations of "tkill".  On SciClone, this reduced the
    time to clean up 108 nodes from 200 seconds to 6 seconds.

  - Option to abort the job if one or more nodes is busier than some
    user-specified threshold.

  - Potentially more informative exit status.

The version attached here has been tested with LAM 6.3.2 and OpenPBS
2.3.8 under Solaris 7.  Porting it to Linux or other platforms shouldn't
be too difficult.  The things I know about that are either site-specific
or platform-dependent include:

  - A mechanism for detecting CPU load on all of the nodes belonging
    to the PBS job.  Most systems provide several different ways of
    obtaining this information.  On SciClone, I wrote a program called
    "cpubusy" that accepts as arguments a list of nodes and then uses
    the rstat protocol to measure CPU utilization on each of them over
    some interval (I use 5 seconds).  If anyone wants this, I will be
    happy to send it to you.  If you can't figure out how to get this
    information efficiently on your system, you can just eliminate the
    -C and -X options from pbslam.

  - The specific set of signals (and corresponding signal numbers)
    which need to be intercepted will vary with the OS.  Basically,
    you need to catch any signal which could otherwise cause the
    pbslam script to exit without cleaning up the nodes.  SIGTERM
    gets special handling because PBS uses it to notify jobs that
    it's getting ready to kill them.

  - The procedure for checking to see that pbslam is in the right
    place in the process hierarchy relative to pbs_mom depends on
    the behavior of your "ps" command.  Based on very limited
    testing, I think the options I use under Solaris will also
    work under Linux with minor adjustments.  The same information
    can also be obtained on Linux by sifting through /proc.

  - Pathnames of various system commands (pbs_mom, rsh, ps, date,
    etc.) may have to be adjusted from one platform to the next.

  - The default value of LAMHOME depends on where your LAM software
    is installed.

  - Probably some other stuff I've overlooked.


If you find bugs or know of a better way to do something, please let me
know.

-Tom

-- 
Tom Crockett

College of William and Mary               email:  tom at compsci.wm.edu
Computational Science Cluster             phone:  (757) 221-2762
Savage House                              fax:    (757) 221-2023
P.O. Box 8795
Williamsburg, VA  23187-8795

Home Page:  http://www.compsci.wm.edu/~tom/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20010221/4229e976/attachment.html>
-------------- next part --------------
#!/usr/local/bin/perl
#
#  Runs a LAM 6.3 job under PBS.
#  Based on a script originally developed at ICASE.
#
#  Revised:
#     04/15/99 tom    - Original version.
#     04/28/99 tom    - Add PBS info to job header in verbose mode.
#     06/01/99 tom    - Use full path to pbsdsh; eliminate -v from mpirun;
#                       export LAMHOME.
#     05/15/00 josip  - Allow multiple hosts in PBS_NODEFILE, use
#                       /usr/local/lam as default.
#     02/09/01 tom    - Adapt for use on SciClone under Solaris 7:
#                     - Rewrite in PERL for more flexibility.
#                     - Generate LAM schemas for better control of process
#                       placement with PBS.
#                     - Support process placement on virtual processors in
#                       either PBS or round-robin order.
#     02/15/01 tom    - Supply more informative return codes.  If pbslam
#                       terminates by catching a signal, return the signal
#                       number; otherwise return the exit status from lamboot
#                       or mpirun.
#     02/19/01 tom    - Change of terminology: "PBS order" is now "VP order"
#                       and "round-robin" is now "node order".  -r option is
#                       changed to -n accordingly.
#

#
#  External packages
#
use Cwd;

#
# Disable buffering to avoid incomplete or out-of-order output.
#
select(STDERR);
$| = 1;
select(STDOUT);
$| = 1;


#
#  Construct help messages
#
$Usage = "Usage:  exec pbslam [-dfghnOtTvx] [-c <#>] [-C|-X load] [-D | -W dir] <program> [<args>]\n";

$Help = 
"Synopsis:     exec pbslam [-dfghnOtTvx] [-c <#>] [-C|-X load] \\
                           [-D | -W dir] <program> [<args>]

Description:  Run a LAM MPI application under PBS.
              For proper cleanup, must be exec'ed from top-level shell.

Options:      -c          Run # copies of the program on the allocated nodes.
              -C          Check if processor activity exceeds \"load\".
              -d          Use indirect communication via LAM daemons.
              -D          Use location of <program> as working directory.
              -f          Do not configure stdio descriptors.
              -g          Enable Guaranteed Envelope Resources (GER) mode.
              -h          Print this help message.
              -n          Use node-order process schema (vs. VP order).
              -O          System is heterogeneous; enable data conversion.
              -t          Enable tracing with generation initially off.
              -T          Enable tracing with generation initially on.
              -v          Verbose mode.
              -W          Use \"dir\" as working directory.
              -x          Fault tolerant (heartbeat) mode.
              -X          Abort if processor activity exceeds \"load\". Implies -C.
              <program>   Executable MPI application.
              <args>      Arguments for application program.

Defaults:     Configure stdio; heartbeat off; don't check processor load;
              data conversion off; GER off; direct communication (daemons off);
              tracing disabled; one process on each PBS virtual processor in
              VP order.

Example:      [prompt] qsub -l nodes=4:dual:ppn=2 -l walltime=600
              cd ~/mydir
              exec pbslam -vx -X 0.01 ./prog1 arg1 arg2 arg3
              ^D
\n";

#
#  Initialization and defaults
#
$0 =~ s|^.*/||;  # remove directory prefix to reduce verbosity in err msgs
($LAMHOME = $ENV{'LAMHOME'}) || ($LAMHOME = $ENV{'LAMHOME'} = '/usr/local/lam');
$Nnode = 1;
$Ncpu = 1;
$Nproc = 0;
$C = "";
$D = "";
$WDir = "";
$v = "";
$x = "";
$O = "-O";
$c2c = "-c2c";
$f = "";
$ger = "-nger";
$n = "";
$t = "";
$CpuBusy = 0.01;  # maximum benign processor activity level
$Ts = '%a %b %d %Y %X %Z';
$Mom = "/usr/local/pbs/sbin/pbs_mom";
$Rsh = "/bin/rsh";
$TimeOut = 45;
$MaxActive = 30;
$BootSchema = "/tmp/pbslam.boot_schema.$$";
$AppSchema = "/tmp/pbslam.app_schema.$$";
$rc = 0;
$sep = "----------------------------------------";
$sep = "$sep$sep\n";

#
#  Process command line options
#
while ($arg = shift)
   {

   # single-letter options
   if ($arg =~ /^-[dDfghnOtTvx]+/)
      {
      foreach $key (split(//, $arg))
         {
         if ($key eq '-') { next; } 
         if ($key eq 'd') { $c2c .= " -lamd"; next; } 
         if ($key eq 'D') { $D = "-D";  next; } 
	 if ($key eq 'f') { $f = " -f"; next; } 
	 if ($key eq 'g') { $ger = " -ger"; next; } 
	 if ($key eq 'h') { print $Help; exit 0; } 
	 if ($key eq 'n') { $n = "-n"; next; } 
	 if ($key eq 'O') { $O = ""; next; } 
	 if ($key eq 't') { $t = "-toff"; next; } 
	 if ($key eq 'T') { $t = "-ton"; next; }
	 if ($key eq 'v') { $v = "-v"; next; }
         if ($key eq 'x') { $x = "-x"; next; }
         }
      next;
      }

   # no. of processes
   if ($arg eq '-c')
      {
      (@ARGV > 0) || die $Usage;
      $Nproc = shift;
      (($Nproc =~ /^\d+$/) && ($Nproc > 0)) ||
         die "$0: -c: no. of processes must be a positive integer.\n";
      next;
      }

   # load threshold checks
   if (($arg eq '-C') || ($arg eq '-X'))
      {
      (@ARGV > 0) || die $Usage;
      $C = $arg;
      $load = shift;
      (($load =~ /^\d+$|^\d+\.\d+$/) && (($load >= 0.0) && ($load <= 1.0))) ||
         die "$0: $arg: load threshold must be a decimal number between 0.0 and 1.0.\n";
     next;
      }

   # working directory
   if ($arg eq '-W')
      {
      (@ARGV > 0) || die $Usage;
      $D = $arg;
      $WDir = shift;
      next;
      }

   # unknown option
   if ($arg =~ /^-.*/) { die $Usage; }

   # no more options
   last; 
   }

#
#  Get program and program args
#
if (! $arg)  # no args left
   { die $Usage; }

$Cmd = $arg;
($Prog = $arg) =~ s|^.*/||;
$Args = join(" ", @ARGV);

#
#  Find the directory containing the executable
#
$Dir = "";
if ($Prog eq $Cmd)  # need to find the executable
   {
   @pathdirs = split(':', $ENV{'PATH'});
   SRCHDIRS: foreach $dir (@pathdirs)
      {
      opendir(DIR, $dir) || next;
      @files = readdir(DIR);
      foreach $file (@files)
         {
         if ((-x "$dir/$file") && ($file eq $Prog))  # found it
	    {
	    $Dir = $dir;
            closedir(DIR);
            last SRCHDIRS;
            }
         }
      closedir(DIR);
      }
   $Dir || die "$0: $Cmd not found or is not executable.\n";
   }
else  # use directory prefix to locate the executable
   {
   ($Dir = $Cmd) =~ s|/[^/]*$||;
   (-x "$Dir/$Prog") ||
      die "$0: $Dir/$Prog does not exist or is not executable.\n";
   }
if ($Dir eq ".") { $Dir = getcwd; }

#
#  Change to alternate working directory if requested.
#
if ($D eq "-W")
   {
   (-d $WDir) || die "$O: invalid working directory: $WDir\n";
   Cwd::chdir($WDir);
   }
elsif ($D eq "-D")
   {
   $WDir = $Dir;
   Cwd::chdir($WDir);
   }

#
#  Make sure we're running under PBS
#
$ENV{'PBS_ENVIRONMENT'} || die "$0: not executing within a PBS environment.\n";
$ENV{'PBS_NODEFILE'} ||
   die "$0: PBS_NODEFILE is undefined.\n";

#
#  For proper signal handling, pbslam must run in place of the top-level shell.
#  For interactive jobs, this means that pbs_mom must be the parent.
#  For batch jobs, pbs_mom must be the grandparent.
#
$ppid = getppid;
if ($ENV{'PBS_ENVIRONMENT'} eq "PBS_INTERACTIVE")  # look at parent
   {
   chop($Parent = `/usr/bin/ps -p $ppid -o comm=`);
   }
elsif ($ENV{'PBS_ENVIRONMENT'} eq "PBS_BATCH")  # look at grandparent
   {
   chop($gppid = `/usr/bin/ps -p $ppid -o ppid=`);
   chop($Parent = `/usr/bin/ps -p $gppid -o comm=`);
   }
else
   {
   die "$0: unexpected value for PBS_ENVIRONMENT: $ENV{'PBS_ENVIRONMENT'}\n"; 
   }
if ($Parent ne $Mom)
   {
   warn "$0 must be exec'ed from the top-level shell.\n";
   die "$Usage";
   }

#
#  Print out job info
#
if ($v)
   {
   print "\n$sep";
   print "\n";
   chop($date = `/bin/date +\'$Ts\'`);
   print "W&M SciClone Cluster:  $date\n";
   print "Working directory:     $ENV{'PWD'}\n";
   print "LAM directory:         $LAMHOME\n";
   print "PBS job name:          $ENV{'PBS_JOBNAME'}\n";
   print "PBS job id:            $ENV{'PBS_JOBID'}\n";
   print "PBS queue:             $ENV{'PBS_QUEUE'}\n";
   print "\n";
   }

#
#  Figure out how many nodes, processors, and processes:
#    Ncpus = no. of PBS virtual processors assigned to job
#    Nnode = no. of distinct nodes
#    Nproc = no. of processes
#
#  Get list of virtual processors from PBS
#
if ($v)
   {
   print "$sep\n";
   print "PBS Virtual Processor Allocation\n\n";
   }
open(PBS_NODEFILE, $ENV{'PBS_NODEFILE'}) ||
    die "$0: open failed for $ENV{'PBS_NODEFILE'}.\n";
(@vplist = <PBS_NODEFILE>) || die "$0: $ENV{'PBS_NODEFILE'} is empty.\n" ;
close(PBS_NODEFILE);
$v && print @vplist, "\n";
chop(@vplist);
$Ncpu = @vplist; 

#
#  Nproc = Ncpu unless -c option requests otherwise.
#
if ($Nproc <= 0)  # default is one process per virtual processor
   { $Nproc = $Ncpu; }

#
#  Build node list and determine no. of VP's per node
#
$Nnode = 0;
foreach $vp (@vplist)
   {
   if ($nodecpus{$vp})  # already have this node, increment vp count
      {
      $nodecpus{$vp}++;
      }
   else  # first time we've seen this node
      {
      $nodenum{$vp} = $Nnode;
      $nodecpus{$vp} = 1;
      push(@nodelist, $vp);
      $Nnode++;
      }
   }

($Nnode <= $Ncpu) || die "$0: internal error: Nnode > Ncpu\n";

#
#  Create LAM boot schema (host list).
#
open(BOOTSCHEMA, ">$BootSchema") ||
   die "$0: open failed for boot schema file $BootSchema\n";
foreach (@nodelist) { print BOOTSCHEMA "$_\n"; }
close(BOOTSCHEMA);

#
#  Build LAM application (process) schema using one of two options for
#  process placement:
#    - VP order: Processes are assigned one per VP in the order listed
#      in PBS_NODEFILE.  If there are more processes than VP's, wrap
#      around and start over at the beginning of the VP list, assigning
#      one process per VP.  Repeat until all processes have been assigned
#      to VP's.
#    - Node order:  Processes are assigned one per node, wrapping
#      around until all VP slots on all nodes are filled.  If there are
#      more processes than nodes, start over at the beginning of the node
#      list, assigning one process per node.  Repeat until all processes have
#      been assigned to nodes.

if ($v)
   {
   print "$sep\n";
   print "LAM Process Mapping\n\n";
   print "Process   Node No.   Node Name\n";
   print "-------   --------   ---------\n";
   }

if ($n eq "-n")
   { &rr_schema; }
else
   { &pbs_schema; }

$v && print "\n";

#
#  Check node activity levels.
#  Uses locally-developed "cpubusy" utility based on Sun rstat RPC.
#  For accurate stats, local host should appear first in nodelist.
#
if ($C)
   {
   if ($v)
      {
      print "$sep\n";
      printf "Checking node cpu activity with threshold of %.4f\n\n", $load;
      }
   $nodes = join(" ", @nodelist);
   $tmp = `/usr/local/bin/cpubusy 5 $nodes 2>/dev/null`;
   $tmp =~ s/^\s*//;
   $tmp =~ s/\s*$//;
   @nodebusy = split(/\s/, $tmp);
   for ($i = 0, $over = 0; $i < $Nnode; $i++)
      {
      if ($nodebusy[$i] > $load)
         {
	 $over++;
	 printf "%-11s cpu load %.4f exceeds threshold\n",
	         "$nodelist[$i]:", $nodebusy[$i];
         }
      elsif ($nodebusy[$i] == "-1.0")  # something is probably broken
         {
	 $over++;
	 printf "%16s: load info unavailable\n", $nodelist[$i];
         }
      }
   if ($v)
      {
      if ($over)
         { print "\n"; }
      else
         { print "Cpu load is within tolerance on all nodes.\n\n"; }
      }
   ($over && ($C eq "-X")) &&
       die "$0: cpu load exceeds threshold, job aborted.\n";
   }

#
#  Build LAM commands
#
$Exec = "$LAMHOME/bin/mpirun -pty -w -wd $ENV{'PWD'} $v $O $f $c2c $ger $t $AppSchema";
$Boot = "$LAMHOME/bin/lamboot $v $x $BootSchema";

#
#  Catch termination requests and other aborts.
#  Let the first SIGTERM go by so that the application will have time to do a
#  graceful shutdown if it so desires.  Catch the second SIGTERM to clean up
#  the nodes.  Try to catch all other signals which might cause PBSLAM to
#  abort.  Don't catch SIGCHLD!
#
@Signals = (HUP, INT, QUIT, ILL, TRAP, ABRT, EMT, FPE, BUS, SEGV, SYS, PIPE, ALRM, USR1, USR2, POLL, STOP, TSTP, TTIN, TTOU, VTALRM, PROF, XCPU, XFSZ);
%SigNum = (HUP, 1, INT, 2, QUIT, 3, ILL, 4, TRAP, 5, ABRT, 6, EMT, 7, FPE, 8,
           KILL, 9, BUS, 10, SEGV, 11, SYS, 12, PIPE, 13, ALRM, 14, TERM, 15,
           USR1, 16, USR2, 17, CHLD, 18, PWR, 19, WINCH, 20, URG, 21, POLL, 22,
           STOP, 23, TSTP, 24, CONT, 25, TTIN, 26, TTOU, 27, VTALRM, 28,
           PROF, 29, XCPU, 30, XFSZ, 31, WAITING, 32, LWP, 33, FREEZE, 34,
           THAW, 35, CANCEL, 36);
foreach $sig (@Signals)
   { $SIG{$sig} = 'cleanup'; }
$SIG{TERM} = 'term1';

#
#  Boot LAM on the allocated nodes
#
$v && print "\n$sep\n";
chop($date = `/bin/date +\'$Ts\'`);
print "$date: booting LAM ...\n";
$v && print "\n$Boot\n\n";
$rc = system($Boot);

#
#  Run the program
#
$v && print "\n$sep\n";
chop($date = `/bin/date +\'$Ts\'`);
print "$date: executing $Prog ...\n";
$v && print "\n$Exec\n\n$sep\n";
$rc = system("$Exec");

#
#  Shutdown LAM, clean up temp files, and exit
#
&cleanup;

#
#  Should never get here.
#
exit 1;


#
#  Build LAM process schema in PBS virtual processor order
#
sub pbs_schema
{

my $i, $j;

open(APPSCHEMA, ">$AppSchema") ||
   die "$0: open failed for process schema file $AppSchema\n";

for ($i = 0, $j = 0; $i < $Nproc; $i++)
   {
   print APPSCHEMA "n$nodenum{$vplist[$j]} $Dir/$Prog $Args\n";
   ($v) && printf "p%-6d   n%-7d   %s\n", $i, $nodenum{$vplist[$j]},
                  $vplist[$j];
   if (++$j == $Ncpu) { $j = 0; }
   }

close(APPSCHEMA);
return $Nproc;
}


#
#  Build LAM process schema in round-robin node order.
#
sub rr_schema
{

my $i, $j;
my %vpavail = %nodecpus;

open(APPSCHEMA, ">$AppSchema") ||
   die "$0: open failed for process schema file $AppSchema\n";

for ($i = 0, $j = 0; $i < $Nproc; $i++)
   {
   if ($i < $Nnode)  # first pass, assign one process per node
      { $vpavail{$nodelist[$j]}--; }
   elsif ($i < $Ncpu)  # fill in any free VP slots
      {
      while ($vpavail{$nodelist[$j]} == 0)  # find the next free VP
	 { if (++$j == $Nnode) { $j = 0; } }
      $vpavail{$nodelist[$j]}--;
      }
   else  # remaining processes use strict round-robin by node
      {
      if ($i == $Ncpu) { $j = 0; }  # start over with node 0
      }
   print APPSCHEMA "n$j $Dir/$Prog $Args\n";
   ($v) && printf "p%-6d   n%-7d   %s\n", $i, $j, $nodelist[$j];
   if (++$j == $Nnode) { $j = 0; }  # wrap around to beginning of node list
   }

close(APPSCHEMA);
return $Nproc;
}


#
#  Intercept the first SIGTERM and redirect subsequent SIGTERMs to the
#  cleanup routine.  This strategy gives the application a queue-dependent
#  amount of time to clean up before we shut down LAM.  (Maximum PBS grace
#  period for over-limit jobs is two minutes.)
#
sub term1
{
my ($caught) = @_;
chop($date = `/bin/date +\'$Ts\'`);
print "$0: ($date) caught SIG$caught; waiting for application to shut down...\n";
$SIG{$caught} = 'cleanup';
return 0;
}


#
#  Function to cleanup after normal or abnormal termination.
#  If an argument is present, it is assumed to be the signal number which
#  caused a trap to "cleanup".
#  
#  With large numbers of nodes (beyond about 40), LAM's "wipe" command takes
#  so long to run that PBS may generate a SIGKILL to forcibly terminate
#  the job before LAM can be shut down on all of the nodes.  To speed things
#  up, we bypass "wipe" and run numerous copies of "tkill" concurrently.
#
sub cleanup
{

my $pid, $caught;

($caught) = @_;
 
# Ignore signals while we shut down LAM.
# Need SIGCHLD to detect completion of remote commands.
foreach $sig (@Signals, TERM)
   {
   if ($sig eq 'CHLD') # safety check just in case
      { $SIG{$sig} = 'DEFAULT'; }
   else
      { $SIG{$sig} = 'IGNORE'; }
   }

chop($date = `/bin/date +\'$Ts\'`);
if ($caught)
   {
   print "$0: ($date) caught SIG$caught; shutting down LAM...\n";
   $rc = $SigNum{$caught};
   }

$v && print "\n$sep\n";
print "$date: cleaning up...\n";
$v && print "\n$Rsh -n ... $LAMHOME/bin/tkill $v\n\n";

# Shutdown LAM
$active = 0;
foreach $node (@nodelist)
   {
   $pid = fork;
   defined($pid) || print "$0: fork failed for $node\n";
   if ($pid == 0)  # child
      {
      $SIG{ALRM} = 'timeout';
      alarm $TimeOut;
      open(REMOTE, "$Rsh -n $node $LAMHOME/bin/tkill $v 2>&1 |") || 
         die "$0: open failed for $Rsh to $node\n";
      while (<REMOTE>)
         { print "$node: $_"; }
      close(REMOTE);  # close returns status from remote command
      $? && print "$node: exit status = $?\n";
      exit 0;
      }
   else  # parent
      {
      $procs{$pid} = $node;
      $active++;
      }
   if ($active == $MaxActive)
      { &reap; }  # wait for somebody to complete before spawning more children
   }

# Wait for all nodess to finish.
while (scalar(%procs))
   { &reap; }

#  Delete temp files
unlink($BootSchema, $AppSchema);

($v) && print "\n";
chop($date = `/bin/date +\'$Ts\'`);
print "$date: cleanup complete; exiting with status = $rc.\n";
($v) && print "\n$sep\n";

exit $rc;
}


#
#  wait for a child process to finish
#
sub reap
{
my $pid, $node;
$pid = wait;
if ($pid == -1) 
   { die "$0: waiting for non-existent processes!\n"; }
if (exists($procs{$pid}))
   {
   $node = delete $procs{$pid};
   $active--;
   }
return 0;
}


#
#  alarm handler for remote command timeouts
#
sub timeout
{
die "$node: tkill timed out\n";
}


More information about the Beowulf mailing list