[Beowulf] Torrents for HPC

Mon Jun 11 11:02:43 PDT 2012

On Mon, Jun 11, 2012 at 01:49:23PM -0400, Joshua Baker-LePain wrote:
>On Fri, 8 Jun 2012 at 5:06pm, Bill Broadley wrote
>
>> Do you think it's worth bundling up for others to use?
>>
>> This is how it works:
>> 1) User runs publish <directory> <name> before they start submitting
>>    jobs.
>> 2) The publish command makes a torrent of that directory and starts
>>    seeding that torrent.
>> 3) The user submits an arbitrary number of jobs that needs that
>>    directory.  Inside the job they "$ subscribe <name>"
>> 4) The subscribe command launches one torrent client per node (not per j
>>    job) and blocks until the directory is completely downloaded
>> 5) /scratch/<user>/<name> has the users data
>>
>> Not nearly as convenient as having a fast parallel filesystem, but seems
>> potentially useful for those who have large read only datasets, GigE and
>> NFS.
>>
>> Thoughts?
>
>I would definitely be interested in a tool like this.  Our situation is
>about as you describe -- we don't have the budget or workload to justify
>any interconnect higher-end than GigE, but have folks who pound our
>central storage to get at DBs stored there.

I looked into doing something like this on 50-node cluster to
synchronize several hundred GB of semi-static data used in /scratch.
I found that the time to build the torrent files--calculating checksums
and such--was *far* more time consuming than the actual file
distribution.  This is on top of the rather severe IO hit on the "seed"
box as well.  

I fought with it for a while, but came to the conclusion that *for
_this_ data*, and how quickly it changed, torrents weren't the way to
go--largely because of the cost of creating the torrent in the first
place.

However, I do think that similar systems could be very useful, if
perhaps a bit less strict in their tests.  The peer-to-peer model is
uselful, and (in some cases) simple size/date check could be enough to
determine when (re)copying a file.

One thing torrent's don't handle are file deletions, which opens up a
few new problems.

Eventually, I moved to a distrbuted rsync tree, which worked for a
while, but was slightly fragile.  Eventually, we dropped the whole
thing when we purchased a sufficiently fast storage system.

-- 
Jesse Becker
NHGRI Linux support (Digicon Contractor)