[Beowulf] Torrents for HPC
Bill Broadley
bill at cse.ucdavis.edu
Tue Jun 12 15:42:47 PDT 2012
Many thanks for the online and offline feedback.
I've been reviewing the mentioned alternatives. From what I can tell
none of them allow nodes to join/leave at random. Our problem is that a
user might submit 500-50,000 jobs that depend on a particular dataset
and have a variable number of jobs/nodes running at any given time. So
ideally each node that a job lands on would do something like:
1) Is this node subscribed to this dataset? If not start a client.
2) Is the dataset completely downloaded? If not wait.
Because of the node churn we didn't want the send <file/dir> <list of
nodes> approach.
We also wanted to handle multiple file transfers of multiple directories
for multiple users at once. From what I tell, most (all?) other
approaches assume a mostly idle network and don't robustly handle cases
where 1/3rd of the nodes have highly contended links.
Because we are using the links for MPI, NFS, and torrents we didn't want
to use an approach that wasn't robust with highly variable per node
bandwidth. Any comments on how well the various alternatives work with
a busy network? Seems like any tree based approach would have problems.
As far as the torrent creation process. My small 5 disk RAID manages
300-400MB/sec and manages around 80% of that for creating torrents. It
looks single threaded, parallel friendly, and easy to parallelize. But
from what I can tell torrent creation is I/O limited at least for us. I
already have some parallel checksumming code around for another project,
I could likely tweak it to create torrents if people out there thing
this is a real bottleneck. I like the torrent behavior of guaranteed
file integrity and self-healing files.
Using MPI does make quite a bit of sense for clusters with high speed
interconnects. Although I suspect that being network bound for IO is
less of a problem. I'd consider it though, I do have sdr/ddr/qdr
clusters around, but so far (knock on wood) not IO limited. I've done a
fair bit of MPI programming, but I'm not sure it's easy/possible to have
nodes dynamically join/leave. Worst case I guess you could launch a
thread/process for each pair of peers that wanted to trade blocks and
still use TCP for swapping metadata about what peers to connect to and
block to trade.
More information about the Beowulf
mailing list