[Beowulf] Doing i/o at a small cluster
Ellis H. Wilson III
ellis at cse.psu.edu
Sat Aug 18 06:19:20 PDT 2012
On 08/17/2012 12:04 PM, Vincent Diepeveen wrote:
> Yes i realize that. In principle you're looking at a 1050 files or so
> that get effectively generated
> and to generate each file half a dozen of huge files get created.
>
> Now in theory generating them is embarrassingly parallel except that to
> generate 1 large Set of EGTBs
> requires around a 3TB of working set size.
>
> Now comes the achillesheel. The start of the generation it needs to
> access real quickly some earlier
> generated sets; the initial generation of the conversion bitmap needs to
> access other EGTBs,
> as pieces can promote especially. Lucky this is a single pass, but it's
> a slow and intensive pass.
>
> In such case accessing over the network is important.
>
> So there is a huge locality except for 1 pass. The real fast generation
> that hammers onto the drives and reads quick and writes
> fast, that can be done entirely local. The first pass generating a
> single 'exchange bitmap', needs to lookup to
> EGTBs earlier generate. For example if we have the EGTB KQRP KRP then it
> has to lookup to the much larger
> EGTB that holds KQRB KRP and a few others.
>
> So the cohesion to the other nodes drives is limited to say a few
> percent of the total i/o getting done.
>
> As we speak about complex file management here, it's difficult to do
> this by hand.
>
> In other words, efficient usage of the available harddrive space is
> important.
Without knowing more about your workload (can you avoid
read-modify-writes? how small or large are your individual I/Os and can
you adjust them?) this really seems to be ideal for Hadoop...
I've dug into the code and have a reasonably firm understanding of the
nitty-gritty for PVFS 1 and 2, PanFS, NFS and HDFS, and I can promise
you this, on the surface at least, appears to be well suited for the
MapReduce paradigm.
I understand your hesitation about Java (not the security ones since
this cluster should not be exposed directly to the internet anyhow...but
that's beside the point), but let me put it to you this way: I've got a
cluster with 50 individual drives all in separate 50, just dual core
machines, which are just connected via plain old 1Gb ethernet, and I can
push close to 4.5GB/s when doing embarrassingly parallel, mainly
sequential workloads with them. Obviously if I start to do anything
more random or less embarrassingly parallel this number is halved if not
worse. If your work allows you to operate in the environment that
MapReduce and HDFS allows and encourages, I would strongly suggest you
pursue that route. That's the only distributed environment I can think
of off of the top of my head that can properly handle (out of the box)
the division you strike between local and remote accesses.
> Compress it with 7-zip and move it away indeed. It'll compress to 3TB
As a side-note -- Hadoop provides support for compression on transfers
that might help you immensely. You can pick from a few, but LZO tends
to be the best one for speed/compression for my workloads. This could
really help you when you need to do that 1 pass where all nodes are
exchanging with each other.
Best of luck!
ellis
More information about the Beowulf
mailing list