[Beowulf] Doing i/o at a small cluster

Fri Aug 17 13:03:27 PDT 2012

The homepage looks very commercial and they have a free trial on it.
You refer to the free trial?

I'll leave it at that.

Putting everything in 1 basket means extra machine that burns juice
of course. That's the first disadvantage. Not the most serious one of  
course
as i could equip one of the nodes with it.

Means buy raid controller. That's extra cost. That depends upon what  
it costs.

The mellanox infiniband can on paper handle it, it's 8 GB/s.

So if i would use it for 4 nodes 2GB/s from which majority reads,
should be possible maybe. Of course reads or writes no big difference  
to the network.

Don't think the mellanox has any problems there. It'll do it handsdown.

Asking for trouble though for the motherboards to load a network that  
much i would
  suppose.

But it does mean that every node and every diskread and write you do,
that they all hammer at the same time at that single basket.

I don't see how you can get out of 1 cheap box good performance like  
that.

What's write latency for a diskwrite?

I'm no expert there. 7 milliseconds or so?

If i read 10MB blocks from disk from a semi-random location at a time  
and i have 32-64 cores doing  that,
it's gonna get at against 2GB/s around a 200 packets of 10MB a second  
both read and
write. Majority will be reads, say 70-30 or so.

Which raid controller can handle that?

Probably not a $200 one i suppose? Start adding zero's?

On Aug 17, 2012, at 9:45 PM, Andrew Holway wrote:

> How about something like putting all your disks in one basket and
> getting a ZFS / NFSoRDMA solution such as nexenta.
>
> They have a nice open source distribution.
>
> 2012/8/17 Vincent Diepeveen <diep at xs4all.nl>:
>> The idea someone brought me on by means of a private email is
>> to use a distributed file system and split each drive in 2  
>> partitions.
>>
>> the outside which is fastest for local storage and the inside for a
>> global distributed partition for long term
>> storage of endresults and automatically compressing with a scripts
>> results and decompressing when it
>> seems soon a specific EGTB is needed.
>>
>> Then using 3 disks a node i can get a 133MB-150MB /s on the outside
>> of the drives in a raid-0.
>> That'll be around a 3TB the minimum needed for generation.
>>
>> And the inside then gets a partition that uses redundancy, maybe
>> raid-6 ?
>> any thoughts there.
>>
>> So say a node or 4 i can dedicate to this. that's 12 drives.
>>
>> Then i'll take 6 months instead of 3 months to generate but i have 4
>> other nodes free for other jobs.
>>
>> Also i need to pay less to harddrives then. Question now is whether
>> i'll go for the 3TB then or the 2TB.
>>
>> As for the filesystem that's most interesting to do this. Is gluster
>> a good idea for this?
>>
>> Can it handle this split between partitions in local and global? Does
>> it have raid-6 or maybe some other
>> sort of redundancy you'd advice?
>>
>> As for hadoop that's a java thing you know. If i want to get my
>> cluster hacked from India i know an easier way to get that done :)
>>
>> Thanks in Advance,
>> Vincent
>>
>>
>> On Aug 17, 2012, at 4:42 PM, Ellis H. Wilson III wrote:
>>
>>> On 08/17/12 08:03, Vincent Diepeveen wrote:
>>>> hi,
>>>>
>>>> Which free or very cheap distributed file system choices do i have
>>>> for a 8 node cluster that has QDR infiniband (mellanox)?
>>>> Each node could have a few harddrives. Up to 8 or so SATA2. Could
>>>> also use some raid cards.
>>>
>>> Lots of choices, but are you talking about putting a bunch of  
>>> disks in
>>> all those PCs or having one I/O server?  The latter is the classic
>>> solution but there are ways to do the former.
>>>
>>> Short answer is there are complicated ways to fling your hdds into
>>> distributed machines using PVFS and get good performance provided  
>>> you
>>> are okay with those non-posix semantics and guarantees.  There are
>>> also
>>> ways to get decent performance from the Hadoop Distributed File
>>> System,
>>> which can handle a distributed set of nodes and internal HDDs well,
>>> but
>>> for a /constrained set of applications./  Based on your previous  
>>> posts
>>> about GPUs and whatnot, I'm going to assume you will have little to
>>> zero
>>> interest in Hadoop.  Last, there's a new NFS version out (pNFS,  
>>> or NFS
>>> v4.1) that you can probably use to great impact with proper
>>> tuning.  No
>>> comments on tuning it however, as I haven't yet tried myself.  That
>>> may
>>> be your best out of the box solution.
>>>
>>> Also, I assume you're talking about QDR 1X here, so just 8Gb/s per
>>> node.
>>> Correct me if that's wrong.
>>>
>>>> And i'm investigating what i need.
>>>>
>>>> I'm investigating to generate the 7 men EGTBs at the cluster.  
>>>> This is
>>>> a big challenge.
>>>
>>> For anyone who doesn't know (probably many who aren't into chess, I
>>> had
>>> to look this up myself), EGTB is end game table bases, and more
>>> info is
>>> available at:
>>> http://en.wikipedia.org/wiki/Endgame_tablebase
>>>
>>> Basically it's just a giant dump of exhaustive moves for N men  
>>> left on
>>> the board.
>>>
>>>> To generate it is high i/o load. I'm looking at around a 4 GB/s i/o
>>>> from which a tad more than
>>>> 1GB/s is write and a tad less than 3GB/s is readspeed from
>>>> harddrives.
>>>>
>>>> This for 3+ months nonstop. Provided the CPU's can keep up with  
>>>> that.
>>>> Otherwise a few months more.
>>>>
>>>> This 4GB/s i/o is aggregated speed.
>>>
>>> I would LOVE to hear what Joe has to say on this, but going out on a
>>> limb here, it will be almost impossible to get that much out of your
>>> HDDs with 8 nodes without serious planning and an extremely narrow
>>> use-case.  I assume you are talking about putting drives in each
>>> node at
>>> this point, because with just QDR you cannot feed aggregate 4GB/s
>>> without bonding from one node.
>>>
>>> We need to know more about generating this tablebase -- I can only
>>> assume you are planning to do analyses on it after you generate all
>>> possible combinations, right?  We need to know more about how that
>>> follow-up analysis can be divided before commenting on possible
>>> storage
>>> solutions.  If everything is totally embarrassingly parallel you're
>>> in a
>>> good spot to not bother with a parallel filesystem.  In that case  
>>> you
>>> just might be able to squeeze 4GB/s out of your drives.
>>>
>>> But with all the nodes accessing all the disks at once, hitting  
>>> 4GB/s
>>> with just strung together FOSS software is really tough for
>>> anything but
>>> the most basic and most embarrassingly parallel stuff.  It requires
>>> serious tuning over months or buying a product that has already done
>>> this (e.g. a solution like Joe's company Scalable Informatics  
>>> makes or
>>> Panasas, the company I work for, makes).  People always love to say,
>>> "Oh, that's 100MB/s per drive!  So with 64 drives I should be  
>>> able to
>>> get 6.4GB/s!  Yea!"  Sadly, that's really only the case when these
>>> drives are accessed completely sequentially and completely  
>>> separately
>>> (i.e. not put together into a distributed filesystem).
>>>
>>>> What raid system you'd recommend here?
>>>
>>> Uh, you looking for software or hardware or object RAID?
>>>
>>>> A problem is the write speed + read speed i need. From what i
>>>> understand at the edges of drives the speed is
>>>> roughly 133MB/s SATA2 moving down to a 33MB/s at the innersides.
>>>>
>>>> Is that roughly correct?
>>>
>>> I hate this as much as anybody, but........ It Depends (TM).
>>> You talking plain-jane "dd".  Sure, that might be reasonable for
>>> certain
>>> vendors.
>>>
>>>> Of course there will be many solutions. I could use some raid cards
>>>> or i could equip each node with some drives.
>>>> Raid card is probably sata-3 nowadays. Didn't check speeds there.
>>>>
>>>> Total storage is some dozen to a few dozens of terabytes.
>>>>
>>>> Does the filesystem automatically optimize for writing at the edges
>>>> instead of starting at the innerside?
>>>> which 'raid' level would you recommend for this if any is  
>>>> appropriate
>>>> at all :)
>>>
>>> Again, depends on RAID card and whatnot.  Some do, some don't.
>>>
>>>> How many harddrives would i need? What failure rate can i expect  
>>>> with
>>>> modern SATA drives there?
>>>> I had several fail at a raid0+1 system before when generating some
>>>> EGTBs some years ago.
>>>
>>> Yup, things will break especially during the shakeout (first few
>>> days or
>>> weeks).  I assume you're buying commodity drives here, not  
>>> enterprise,
>>> so you should prepare for upwards of, /after the shakeout/, maybe
>>> 4-8 of
>>> your drives to fail or start throwing SMART errors in the first year
>>> (ball-parking it here based solely on experience).  Rebuilds will  
>>> suck
>>> for you with lots of data unless you have really thought that out
>>> (typically limited to speed of a single disk -- therefore 2TB drive
>>> rebuilding itself at 50MB/s (that's best case scenario) is like 11
>>> hours.  I hope you haven't bought all your drives from the same  
>>> batch
>>> from the same manufacturer as well -- that often results in very
>>> similar
>>> failure times (i.e. concurrent failures in a day).  Very non- 
>>> uniform.
>>>
>>>> Note there is more questions. Like which buffer size i must read/
>>>> write. Most files get streamed.
>>>>   From 2 files that i do reading from, i read big blocks from a
>>>> random
>>>> spot in that file. Each file is
>>>> a couple of hundreds of gigabyte.
>>>>
>>>> I used to grab chunks of 64KB from each file, but don't see how to
>>>> get to gigabytes a second i/o with
>>>> todays hardware that manner.
>>>>
>>>> Am considering now to read blocks of 10MB. Which size will get me
>>>> there to the maximum bandwidth the i/o
>>>> can deliver?
>>>
>>> I actually do wonder if Hadoop won't work for you.  This sounds  
>>> like a
>>> very Hadoop-like workload, assuming you are OK with write-once read-
>>> many
>>> semantics.  But I need to know way more about what you want to do  
>>> with
>>> the data afterwards.  Moving data off of HDFS sucks.
>>>
>>> Best,
>>>
>>> ellis
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>> Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
>> Computing
>> To change your subscription (digest mode or unsubscribe) visit  
>> http://www.beowulf.org/mailman/listinfo/beowulf