[Beowulf] copying big files

Sun Aug 10 21:25:27 PDT 2008

On Fri, Aug 08, 2008 at 05:52:40PM +0200, Jan Heichler wrote:
>    Hallo Henning,
>    HF> Hi everybody,
> 
>    HF> One needs basically a daemon which handles copying requests and
>    establishes
> 
>    HF> the connection to next node in the chain.
> 
>    Why a daemon? Just MPI that starts up the processes on the remote nodes
>    during programm startup. Advantage is that you can use any
>    high-speed-interconnect which you have an MPI for.
> 
>    HF> Has somebody written such a tool?

  -?- Is this an administrative tool or an MPI application need?
  -?- If MPI, is this the executable file itself or a common data file?

Administrative tools could leverage torrent ideas, scp or rsync
trees with modest scripting to distribute and check the file.
I have seen a handful of solutions good and bad, slow and fast,
reliable and fragile...    QLogic has a tool "scpall" in their
Fast fabric tools to address this, Rocks has additional tools ....

MPI is interesting because of the power of MPI and that most MPI
clusters have VERY FAST links available to MPI.  However it can be unclear
where the original file resides, where it will go, how to manage  multi
core complications, file naming convention, and clean up.  

Assuming that the file is visible to one rank and only needs to be
deposited on the nodes involved in the MPI job a standard MPI library
using MPI data transfers could be used to move data.  The internals
of the library could use trees, rings or tree rings to move the data;
who cares once a clean interface is established.

One classic MPI problem is the user launching "mpirun ./justbuilt.exe"
on his local system but ./ on the compute nodes does not have a copy. 
Batch systems could help here...

If the problem is that a data file must be predistributed say for a dusty
deck that will only open a single fixed path to data then the batch
system may need to be ideal for managing the transfer in a %pre
launch task.   In such a case the administrative tool could be leveraged
but again must be multi core safe/ aware.

Another problem might be that the executable and libraries needs to be
predistributed so the execution start up and paging is improved.  On a
1000 node cluster running a 8000 rank MPI program that lives on a taxed
NFS resource the 8000 startup reads could improve 1000 fold by executing
something like /localscratch/my.exe, IFF the %pre could distribute it in
a x8 deep tree in *8 time.   This can be important for start up time....
The batch system could quickly check a look up table to check N ranks
to see if ./my.exe is NFS and should be pre distributed and launched
as /local2nodeCache/my-unique-something.exe.   Policy on some large
clusters is such that this issue has a forced solution that only permits
the launching of /opt/blessedbymanagement/bin

Another permutation is large (sparse) data files where each
rank is responsible (read and or write) for a region that is a
function(of-rank)....  Such applications might be come from developers
trained on large SMP systems with many IO channels like a old but big
SGI Orign system.

Next is the topic of size.   With some applications the data sets
are vast and cannot or should not be distributed.   In the 8000 rank
case how is it possible to know what portion of a data file
is exclusive input or output for a rank.   Smart parallel file systems
can improve things here.

I am sure I missed some topics....
Others should to add to the check list 

Summary: one size does not (currently) fit all.
What problem is being addressed?

-- 
	T o m  M i t c h e l l 
	Got a great hat... now what.