data transfer and condor

Fri Aug 3 08:53:55 PDT 2001

On Fri, 3 Aug 2001, Steven Berukoff wrote:

>
> Hi all,
>
> We're looking at using Condor on our cluster for its checkpointing and
> job-handling abilities, as the routines we're running don't require much
> in the way of internode communication.  We have an NFS file server which
> contains our entire fileset (something on the order of 100s of GB), a
> master node for the cluster, and several nodes.  Outside of Condor, our
> algorithm requires that each of the nodes get some subset of the data (on
> the order of perhaps 100MB) and runs the analysis code on this data
> segment.  Obviously, each node must gather its share of the data from the
> NFS file server; this of course requires a large amount of network
> traffic.
>
> Does anyone have a clever idea about scheduling the data
> transfers so that it is accomplished in a reasonable fashion?  We were
> hoping Condor provides this functionality to some degree, but it doesn't
> seem to.

Seems like a trival thing to code in an e.g. perl or bash script on the
server without even using NFS (which will definitely slow your code down
if you are crunching through 100 MB).

Let's assume that your nodes each have a working directory
/mybigdatadir/nodeXX in NFS space where they create a "ready4newslice"
timestamp file when they start a job and that they also have a /tmp
directory on a local disk space or RAMDISK big enough to hold the 100 MB
slice.

Then a server side script something like (in pseudocode -- convert to
your favorite scripting language):

cd /workdir
# ..an NFS shared working directory containing one subdirectory per
# node, /workdir/node1, /workdir/node2, /workdir/node3...

nodelist = node*
#..make an array or list of all the node names however you like.

# This just makes the next slice to be distributed, however you
# do that, and returns true on success, as long as there are slices
# left to run and then returns false.
while(createslice mybigfileset newslice){

  foreach node (nodelist) {

     if (exists node/ready4newslice) {

        scp newslice node:/tmp/nextslice
        # copy the new slice to the node.  scp, rcp, even trigger
        # a script to NFS copy (yuk).

        rm newslice
        # Or not -- if you want to get fancy, save the slices
	# mv newslice newslice.node
        # and only remove them when the node reports that it has
        # successfully finished a slice.  Or whatever -- make it
        # robust to the degree that your don't want to have go back
        # and run makeup jobs by hand.

        mv node/ready4newslice node/newsliceready
        # This basically sets a flag for the node that it's ok to
        # proceed (see below).

        break
     }

  }
  sleep awhile
  # ...depending on how long it takes to create a slice and the time
  # granularity you expect slices to be completed with.  Polling too
  # fast is "expensive", too slow and your nodes might end up waiting
  # idle for a slice.
}

mailamessagetome("Gee, all out of work")

On the nodes, (e.g. nodeXX) run the job from a wrapper/loop script like:

while(){
  # loop forever (or add something to check for "all done").

  if(-f /mybigdatadir/nodeXX/newsliceready){
    mv /tmp/newslice /tmp/runslice
    # ...so we cannot overwrite it from the server while running

    mv node/newsliceready node/ready4newslice
    # ... SHOULD arrive while we are running, so it is ready to go
    # when we are done.

    runthejob /tmp/runslice
    # so off we go, working from local disk or even ramdisk if you have
    # enough memory and the inclination.
  } else {
    mailamessagetome("Whoa!  nodeXX didn't get a slice in time!")
  }
  sleep awhile	# Or get hammered with messages and burn mindless CPU

}

This is just intended to give you the idea and hence is sloppy -- you
should really add some checks and warnings (as I learned the hard way in
the old days when I ran a lot of jobs in perpetuity distributed all over
the place (including places that didn't share an NFS mount with my
primary host).  With e.g. expect (in perl or TCL) you can actually run
this sort of script-managed job distribution over a network of systems
you have access to totally automagically even if you have to login to
the systems by hand with a password to start or manage the jobs.

There are quite probably tools nowadays that will do a lot of this for
you, but there is something to be said for scripting it yourself and
knowing exactly what is going on.  If you know perl or bash fairly well,
implementing the general idea above and testing/debugging it should be a
morning's work, no more, and gives you the opportunity to add all sorts
of run checking and logging as you go along.  Basically you are creating
a very simple state engine where the jobs run according to the state
left by the server and record their own state for the server to check in
turn.  The lockfiles just keep things from happening until it is
time/safe for them to happen.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu