[Beowulf] Doing i/o at a small cluster

Fri Aug 17 12:32:20 PDT 2012

The idea someone brought me on by means of a private email is
to use a distributed file system and split each drive in 2 partitions.

the outside which is fastest for local storage and the inside for a  
global distributed partition for long term
storage of endresults and automatically compressing with a scripts  
results and decompressing when it
seems soon a specific EGTB is needed.

Then using 3 disks a node i can get a 133MB-150MB /s on the outside  
of the drives in a raid-0.
That'll be around a 3TB the minimum needed for generation.

And the inside then gets a partition that uses redundancy, maybe  
raid-6 ?
any thoughts there.

So say a node or 4 i can dedicate to this. that's 12 drives.

Then i'll take 6 months instead of 3 months to generate but i have 4  
other nodes free for other jobs.

Also i need to pay less to harddrives then. Question now is whether  
i'll go for the 3TB then or the 2TB.

As for the filesystem that's most interesting to do this. Is gluster  
a good idea for this?

Can it handle this split between partitions in local and global? Does  
it have raid-6 or maybe some other
sort of redundancy you'd advice?

As for hadoop that's a java thing you know. If i want to get my  
cluster hacked from India i know an easier way to get that done :)

Thanks in Advance,
Vincent

On Aug 17, 2012, at 4:42 PM, Ellis H. Wilson III wrote:

> On 08/17/12 08:03, Vincent Diepeveen wrote:
>> hi,
>>
>> Which free or very cheap distributed file system choices do i have
>> for a 8 node cluster that has QDR infiniband (mellanox)?
>> Each node could have a few harddrives. Up to 8 or so SATA2. Could
>> also use some raid cards.
>
> Lots of choices, but are you talking about putting a bunch of disks in
> all those PCs or having one I/O server?  The latter is the classic
> solution but there are ways to do the former.
>
> Short answer is there are complicated ways to fling your hdds into
> distributed machines using PVFS and get good performance provided you
> are okay with those non-posix semantics and guarantees.  There are  
> also
> ways to get decent performance from the Hadoop Distributed File  
> System,
> which can handle a distributed set of nodes and internal HDDs well,  
> but
> for a /constrained set of applications./  Based on your previous posts
> about GPUs and whatnot, I'm going to assume you will have little to  
> zero
> interest in Hadoop.  Last, there's a new NFS version out (pNFS, or NFS
> v4.1) that you can probably use to great impact with proper  
> tuning.  No
> comments on tuning it however, as I haven't yet tried myself.  That  
> may
> be your best out of the box solution.
>
> Also, I assume you're talking about QDR 1X here, so just 8Gb/s per  
> node.
> Correct me if that's wrong.
>
>> And i'm investigating what i need.
>>
>> I'm investigating to generate the 7 men EGTBs at the cluster. This is
>> a big challenge.
>
> For anyone who doesn't know (probably many who aren't into chess, I  
> had
> to look this up myself), EGTB is end game table bases, and more  
> info is
> available at:
> http://en.wikipedia.org/wiki/Endgame_tablebase
>
> Basically it's just a giant dump of exhaustive moves for N men left on
> the board.
>
>> To generate it is high i/o load. I'm looking at around a 4 GB/s i/o
>> from which a tad more than
>> 1GB/s is write and a tad less than 3GB/s is readspeed from  
>> harddrives.
>>
>> This for 3+ months nonstop. Provided the CPU's can keep up with that.
>> Otherwise a few months more.
>>
>> This 4GB/s i/o is aggregated speed.
>
> I would LOVE to hear what Joe has to say on this, but going out on a
> limb here, it will be almost impossible to get that much out of your
> HDDs with 8 nodes without serious planning and an extremely narrow
> use-case.  I assume you are talking about putting drives in each  
> node at
> this point, because with just QDR you cannot feed aggregate 4GB/s
> without bonding from one node.
>
> We need to know more about generating this tablebase -- I can only
> assume you are planning to do analyses on it after you generate all
> possible combinations, right?  We need to know more about how that
> follow-up analysis can be divided before commenting on possible  
> storage
> solutions.  If everything is totally embarrassingly parallel you're  
> in a
> good spot to not bother with a parallel filesystem.  In that case you
> just might be able to squeeze 4GB/s out of your drives.
>
> But with all the nodes accessing all the disks at once, hitting 4GB/s
> with just strung together FOSS software is really tough for  
> anything but
> the most basic and most embarrassingly parallel stuff.  It requires
> serious tuning over months or buying a product that has already done
> this (e.g. a solution like Joe's company Scalable Informatics makes or
> Panasas, the company I work for, makes).  People always love to say,
> "Oh, that's 100MB/s per drive!  So with 64 drives I should be able to
> get 6.4GB/s!  Yea!"  Sadly, that's really only the case when these
> drives are accessed completely sequentially and completely separately
> (i.e. not put together into a distributed filesystem).
>
>> What raid system you'd recommend here?
>
> Uh, you looking for software or hardware or object RAID?
>
>> A problem is the write speed + read speed i need. From what i
>> understand at the edges of drives the speed is
>> roughly 133MB/s SATA2 moving down to a 33MB/s at the innersides.
>>
>> Is that roughly correct?
>
> I hate this as much as anybody, but........ It Depends (TM).
> You talking plain-jane "dd".  Sure, that might be reasonable for  
> certain
> vendors.
>
>> Of course there will be many solutions. I could use some raid cards
>> or i could equip each node with some drives.
>> Raid card is probably sata-3 nowadays. Didn't check speeds there.
>>
>> Total storage is some dozen to a few dozens of terabytes.
>>
>> Does the filesystem automatically optimize for writing at the edges
>> instead of starting at the innerside?
>> which 'raid' level would you recommend for this if any is appropriate
>> at all :)
>
> Again, depends on RAID card and whatnot.  Some do, some don't.
>
>> How many harddrives would i need? What failure rate can i expect with
>> modern SATA drives there?
>> I had several fail at a raid0+1 system before when generating some
>> EGTBs some years ago.
>
> Yup, things will break especially during the shakeout (first few  
> days or
> weeks).  I assume you're buying commodity drives here, not enterprise,
> so you should prepare for upwards of, /after the shakeout/, maybe  
> 4-8 of
> your drives to fail or start throwing SMART errors in the first year
> (ball-parking it here based solely on experience).  Rebuilds will suck
> for you with lots of data unless you have really thought that out
> (typically limited to speed of a single disk -- therefore 2TB drive
> rebuilding itself at 50MB/s (that's best case scenario) is like 11
> hours.  I hope you haven't bought all your drives from the same batch
> from the same manufacturer as well -- that often results in very  
> similar
> failure times (i.e. concurrent failures in a day).  Very non-uniform.
>
>> Note there is more questions. Like which buffer size i must read/
>> write. Most files get streamed.
>>   From 2 files that i do reading from, i read big blocks from a  
>> random
>> spot in that file. Each file is
>> a couple of hundreds of gigabyte.
>>
>> I used to grab chunks of 64KB from each file, but don't see how to
>> get to gigabytes a second i/o with
>> todays hardware that manner.
>>
>> Am considering now to read blocks of 10MB. Which size will get me
>> there to the maximum bandwidth the i/o
>> can deliver?
>
> I actually do wonder if Hadoop won't work for you.  This sounds like a
> very Hadoop-like workload, assuming you are OK with write-once read- 
> many
> semantics.  But I need to know way more about what you want to do with
> the data afterwards.  Moving data off of HDFS sucks.
>
> Best,
>
> ellis
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf