[Beowulf] copying big files (Henning Fehrmann)

Mon Aug 11 09:45:02 PDT 2008

Hi,

I found some time to play with dolly and nettee. They do what I was
looking for. Thank you for the hints.

> > I will say that my dream would be for something like dolly to get some
> sort
> > of transfer recovery mechanism, though I realize that would be quite
> > difficult in such a topology. 
> 
> nettee has some failover and continuation capabilities at different
> points - but not what I think you want. The development version has a
> few extra modes for cases where data is being merged, but that isn't
> relevant to this discussion. When setting up the initial chain nettee
> can connect to an alternate node (from a list of failovers) if the
> target node will not answer.  It also has the ability to keep going if
> the local disk becomes unwritable, and it can continue a download on a
> chain down to the node above the point of failure. 
> 
> However, nettee cannot at present rewire around a failed node to
> continue a download to the node(s) below it.  That would indeed be quite
> difficult, since one could have a situation like this:
> 
>   A -> B  (A knows it has sent 100MB)
>   B -> C  (B knows it has sent  98MB, then it blows up)
>   C       (C knows it has received 98 MB)
> 
> A and C will eventually figure out that B has died, and they could
> conceivably negotiate a new connection, but A may no longer have the
> missing 2 MB (it might have been sent out a pipe, processed, and not
> stored in the raw state anywhere.)  On the other hand, the development
> version uses ring buffers, and one could set those to be very large,
> enabling a certain level of "redo" from A.  So if C comes back and says
> "I only have 98MB" A can see if it has the missing parts and go on if it
> does.  It still might not though.  If B has stalled for long enough
> the ring buffer on A may have completely filled from the previous node,
> overwriting the data needed to recover.  I guess it would be possible to
> implement a "safety region" in the ring buffer which could not be
> overwritten.
> 

I spread successfully a 10G file to 50 nodes. The rate was 140Mb/s for nettee and a bit slower using  dolly.
I guess it was due to a busy node somewhere in the chain.  
Increasing the number of clients up to 100 failed in both cases.

For nettee I got:
nettee: fatal error writing to child: Connection reset by peer

for dolly:
Sent MB: 40, MB/s: 66.752, Current MB/s: 35.710      movebytes
read/write: Connection
 reset by peer
         errno = 104

I will do more systematic test the next days. 
David Mathog, are you interested in bug reports?

Cheers,
Henning Fehrmann