[Beowulf] Doing i/o at a small cluster

Ellis H. Wilson III ellis at cse.psu.edu
Fri Aug 17 07:42:23 PDT 2012

On 08/17/12 08:03, Vincent Diepeveen wrote:
> hi,
> Which free or very cheap distributed file system choices do i have
> for a 8 node cluster that has QDR infiniband (mellanox)?
> Each node could have a few harddrives. Up to 8 or so SATA2. Could
> also use some raid cards.

Lots of choices, but are you talking about putting a bunch of disks in 
all those PCs or having one I/O server?  The latter is the classic 
solution but there are ways to do the former.

Short answer is there are complicated ways to fling your hdds into 
distributed machines using PVFS and get good performance provided you 
are okay with those non-posix semantics and guarantees.  There are also 
ways to get decent performance from the Hadoop Distributed File System, 
which can handle a distributed set of nodes and internal HDDs well, but 
for a /constrained set of applications./  Based on your previous posts 
about GPUs and whatnot, I'm going to assume you will have little to zero 
interest in Hadoop.  Last, there's a new NFS version out (pNFS, or NFS 
v4.1) that you can probably use to great impact with proper tuning.  No 
comments on tuning it however, as I haven't yet tried myself.  That may 
be your best out of the box solution.

Also, I assume you're talking about QDR 1X here, so just 8Gb/s per node. 
Correct me if that's wrong.

> And i'm investigating what i need.
> I'm investigating to generate the 7 men EGTBs at the cluster. This is
> a big challenge.

For anyone who doesn't know (probably many who aren't into chess, I had 
to look this up myself), EGTB is end game table bases, and more info is 
available at:

Basically it's just a giant dump of exhaustive moves for N men left on 
the board.

> To generate it is high i/o load. I'm looking at around a 4 GB/s i/o
> from which a tad more than
> 1GB/s is write and a tad less than 3GB/s is readspeed from harddrives.
> This for 3+ months nonstop. Provided the CPU's can keep up with that.
> Otherwise a few months more.
> This 4GB/s i/o is aggregated speed.

I would LOVE to hear what Joe has to say on this, but going out on a 
limb here, it will be almost impossible to get that much out of your 
HDDs with 8 nodes without serious planning and an extremely narrow 
use-case.  I assume you are talking about putting drives in each node at 
this point, because with just QDR you cannot feed aggregate 4GB/s 
without bonding from one node.

We need to know more about generating this tablebase -- I can only 
assume you are planning to do analyses on it after you generate all 
possible combinations, right?  We need to know more about how that 
follow-up analysis can be divided before commenting on possible storage 
solutions.  If everything is totally embarrassingly parallel you're in a 
good spot to not bother with a parallel filesystem.  In that case you 
just might be able to squeeze 4GB/s out of your drives.

But with all the nodes accessing all the disks at once, hitting 4GB/s 
with just strung together FOSS software is really tough for anything but 
the most basic and most embarrassingly parallel stuff.  It requires 
serious tuning over months or buying a product that has already done 
this (e.g. a solution like Joe's company Scalable Informatics makes or 
Panasas, the company I work for, makes).  People always love to say, 
"Oh, that's 100MB/s per drive!  So with 64 drives I should be able to 
get 6.4GB/s!  Yea!"  Sadly, that's really only the case when these 
drives are accessed completely sequentially and completely separately 
(i.e. not put together into a distributed filesystem).

> What raid system you'd recommend here?

Uh, you looking for software or hardware or object RAID?

> A problem is the write speed + read speed i need. From what i
> understand at the edges of drives the speed is
> roughly 133MB/s SATA2 moving down to a 33MB/s at the innersides.
> Is that roughly correct?

I hate this as much as anybody, but........ It Depends (TM).
You talking plain-jane "dd".  Sure, that might be reasonable for certain 

> Of course there will be many solutions. I could use some raid cards
> or i could equip each node with some drives.
> Raid card is probably sata-3 nowadays. Didn't check speeds there.
> Total storage is some dozen to a few dozens of terabytes.
> Does the filesystem automatically optimize for writing at the edges
> instead of starting at the innerside?
> which 'raid' level would you recommend for this if any is appropriate
> at all :)

Again, depends on RAID card and whatnot.  Some do, some don't.

> How many harddrives would i need? What failure rate can i expect with
> modern SATA drives there?
> I had several fail at a raid0+1 system before when generating some
> EGTBs some years ago.

Yup, things will break especially during the shakeout (first few days or 
weeks).  I assume you're buying commodity drives here, not enterprise, 
so you should prepare for upwards of, /after the shakeout/, maybe 4-8 of 
your drives to fail or start throwing SMART errors in the first year 
(ball-parking it here based solely on experience).  Rebuilds will suck 
for you with lots of data unless you have really thought that out 
(typically limited to speed of a single disk -- therefore 2TB drive 
rebuilding itself at 50MB/s (that's best case scenario) is like 11 
hours.  I hope you haven't bought all your drives from the same batch 
from the same manufacturer as well -- that often results in very similar 
failure times (i.e. concurrent failures in a day).  Very non-uniform.

> Note there is more questions. Like which buffer size i must read/
> write. Most files get streamed.
>   From 2 files that i do reading from, i read big blocks from a random
> spot in that file. Each file is
> a couple of hundreds of gigabyte.
> I used to grab chunks of 64KB from each file, but don't see how to
> get to gigabytes a second i/o with
> todays hardware that manner.
> Am considering now to read blocks of 10MB. Which size will get me
> there to the maximum bandwidth the i/o
> can deliver?

I actually do wonder if Hadoop won't work for you.  This sounds like a 
very Hadoop-like workload, assuming you are OK with write-once read-many 
semantics.  But I need to know way more about what you want to do with 
the data afterwards.  Moving data off of HDFS sucks.



More information about the Beowulf mailing list