[Beowulf] Doing i/o at a small cluster

Sat Aug 18 06:19:20 PDT 2012

On 08/17/2012 12:04 PM, Vincent Diepeveen wrote:
> Yes i realize that. In principle you're looking at a 1050 files or so
> that get effectively generated
> and to generate each file half a dozen of huge files get created.
>
> Now in theory generating them is embarrassingly parallel except that to
> generate 1 large Set of EGTBs
> requires around a 3TB of working set size.
>
> Now comes the achillesheel. The start of the generation it needs to
> access real quickly some earlier
> generated sets; the initial generation of the conversion bitmap needs to
> access other EGTBs,
> as pieces can promote especially. Lucky this is a single pass, but it's
> a slow and intensive pass.
>
> In such case accessing over the network is important.
>
> So there is a huge locality except for 1 pass. The real fast generation
> that hammers onto the drives and reads quick and writes
> fast, that can be done entirely local. The first pass generating a
> single 'exchange bitmap', needs to lookup to
> EGTBs earlier generate. For example if we have the EGTB KQRP KRP then it
> has to lookup to the much larger
> EGTB that holds KQRB KRP and a few others.
>
> So the cohesion to the other nodes drives is limited to say a few
> percent of the total i/o getting done.
>
> As we speak about complex file management here, it's difficult to do
> this by hand.
>
> In other words, efficient usage of the available harddrive space is
> important.

Without knowing more about your workload (can you avoid 
read-modify-writes?  how small or large are your individual I/Os and can 
you adjust them?) this really seems to be ideal for Hadoop...

I've dug into the code and have a reasonably firm understanding of the 
nitty-gritty for PVFS 1 and 2, PanFS, NFS and HDFS, and I can promise 
you this, on the surface at least, appears to be well suited for the 
MapReduce paradigm.

I understand your hesitation about Java (not the security ones since 
this cluster should not be exposed directly to the internet anyhow...but 
that's beside the point), but let me put it to you this way:  I've got a 
cluster with 50 individual drives all in separate 50, just dual core 
machines, which are just connected via plain old 1Gb ethernet, and I can 
push close to 4.5GB/s when doing embarrassingly parallel, mainly 
sequential workloads with them.  Obviously if I start to do anything 
more random or less embarrassingly parallel this number is halved if not 
worse.  If your work allows you to operate in the environment that 
MapReduce and HDFS allows and encourages, I would strongly suggest you 
pursue that route.  That's the only distributed environment I can think 
of off of the top of my head that can properly handle (out of the box) 
the division you strike between local and remote accesses.

> Compress it with 7-zip and move it away indeed. It'll compress to 3TB

As a side-note -- Hadoop provides support for compression on transfers 
that might help you immensely.  You can pick from a few, but LZO tends 
to be the best one for speed/compression for my workloads.  This could 
really help you when you need to do that 1 pass where all nodes are 
exchanging with each other.

Best of luck!

ellis