[Beowulf] Doing i/o at a small cluster

Fri Aug 17 09:04:33 PDT 2012

On Aug 17, 2012, at 4:42 PM, Ellis H. Wilson III wrote:

> On 08/17/12 08:03, Vincent Diepeveen wrote:
>> hi,
>>
>> Which free or very cheap distributed file system choices do i have
>> for a 8 node cluster that has QDR infiniband (mellanox)?
>> Each node could have a few harddrives. Up to 8 or so SATA2. Could
>> also use some raid cards.
>

Thanks for your quick response!

> Lots of choices, but are you talking about putting a bunch of disks in
> all those PCs or having one I/O server?  The latter is the classic
> solution but there are ways to do the former.
>
> Short answer is there are complicated ways to fling your hdds into
> distributed machines using PVFS and get good performance provided you
> are okay with those non-posix semantics and guarantees.  There are  
> also

I need a solution just for this workload, doesn't need to work for  
anything else.
So if i have to run around the church to get something done with it,
yet it is delivering me the speed i need, then that's cceptable here.

> ways to get decent performance from the Hadoop Distributed File  
> System,
> which can handle a distributed set of nodes and internal HDDs well,  
> but
> for a /constrained set of applications./  Based on your previous posts
> about GPUs and whatnot, I'm going to assume you will have little to  
> zero
> interest in Hadoop.  Last, there's a new NFS version out (pNFS, or NFS
> v4.1) that you can probably use to great impact with proper  
> tuning.  No
> comments on tuning it however, as I haven't yet tried myself.  That  
> may
> be your best out of the box solution.
>
> Also, I assume you're talking about QDR 1X here, so just 8Gb/s per  
> node.
> Correct me if that's wrong.

This is correct. the motherboards are pci-e 2.0.

Each node is a $200 machine with 2 x L5420, seaburg chipset.

You will realize i don't have money for professional solutions.

>
>> And i'm investigating what i need.
>>
>> I'm investigating to generate the 7 men EGTBs at the cluster. This is
>> a big challenge.
>
> For anyone who doesn't know (probably many who aren't into chess, I  
> had
> to look this up myself), EGTB is end game table bases, and more  
> info is
> available at:
> http://en.wikipedia.org/wiki/Endgame_tablebase
>
> Basically it's just a giant dump of exhaustive moves for N men left on
> the board.

Correct. Basically you need to do a lot of calculations to produce 1  
byte of useful
positions. In the end 1 byte stores 5 positions.

The endresult will be a 82TB of data when it's uncompressed, i'll  
store it compressed.
To generate it is at least 3 months of nonstop i/o at 4GB/s, or more  
months if that i/o is less.

Simple compressed it will easily fit at a 3TB drive. Yet in  
compressed state it's not usable
to generate.

Historic attempts such as the famous attempts from Ken Thompson  
stored in less efficient manners which also
compresses not so well.

>
>> To generate it is high i/o load. I'm looking at around a 4 GB/s i/o
>> from which a tad more than
>> 1GB/s is write and a tad less than 3GB/s is readspeed from  
>> harddrives.
>>
>> This for 3+ months nonstop. Provided the CPU's can keep up with that.
>> Otherwise a few months more.
>>
>> This 4GB/s i/o is aggregated speed.
>
> I would LOVE to hear what Joe has to say on this, but going out on a
> limb here, it will be almost impossible to get that much out of your
> HDDs with 8 nodes without serious planning and an extremely narrow
> use-case.  I assume you are talking about putting drives in each  
> node at
> this point, because with just QDR you cannot feed aggregate 4GB/s
> without bonding from one node.

Yes i realize that. In principle you're looking at a 1050 files or so  
that get effectively generated
and to generate each file half a dozen of huge files get created.

Now in theory generating them is embarrassingly parallel except that  
to generate 1 large Set of EGTBs
requires around a 3TB of working set size.

So with a core or 64 that would mean one needs the immense size of  
some TB's times 64 cores.

Therefore generating it all embarrassingly parallel is pretty  
expensive sort of i/o i would need.

I intend to have each node of 8 cores work on 1 or 2 sets at the same  
time. Probably at 1.
So 8 cores require then a maximum of 3TB.

So far it seemed maybe that i could get away equip each node with 4  
drives of 1.5TB in raid10,
or 2 drives in raid0 (and then just restart it when a drive fails).

yet there is a catch.

Now comes the achillesheel. The start of the generation it needs to  
access real quickly some earlier
generated sets; the initial generation of the conversion bitmap needs  
to access other EGTBs,
as pieces can promote especially. Lucky this is a single pass, but  
it's a slow and intensive pass.

In such case accessing over the network is important.

So there is a huge locality except for 1 pass. The real fast  
generation that hammers onto the drives and reads quick and writes
fast, that can be done entirely local. The first pass generating a  
single 'exchange bitmap', needs to lookup to
EGTBs earlier generate. For example if we have the EGTB  KQRP KRP  
then it has to lookup to the much larger
EGTB that holds KQRB KRP and a few others.

So the cohesion to the other nodes drives is limited to say a few  
percent of the total i/o getting done.

As we speak about complex file management here, it's difficult to do  
this by hand.

In other words, efficient usage of the available harddrive space is  
important.

>
> We need to know more about generating this tablebase -- I can only
> assume you are planning to do analyses on it after you generate all
> possible combinations, right?  We need to know more about how that
> follow-up analysis can be divided before commenting on possible  
> storage
> solutions.  If everything is totally embarrassingly parallel you're  
> in a
> good spot to not bother with a parallel filesystem.  In that case you
> just might be able to squeeze 4GB/s out of your drives.
>
> But with all the nodes accessing all the disks at once, hitting 4GB/s
> with just strung together FOSS software is really tough for  
> anything but
> the most basic and most embarrassingly parallel stuff.  It requires
> serious tuning over months or buying a product that has already done
> this (e.g. a solution like Joe's company Scalable Informatics makes or
> Panasas, the company I work for, makes).  People always love to say,
> "Oh, that's 100MB/s per drive!  So with 64 drives I should be able to
> get 6.4GB/s!  Yea!"  Sadly, that's really only the case when these
> drives are accessed completely sequentially and completely separately
> (i.e. not put together into a distributed filesystem).
>
>> What raid system you'd recommend here?
>
> Uh, you looking for software or hardware or object RAID?

I'm open to something that can work for me.

Didn't buy the harddrives yet and doubt i will. Just investigating
what i need to get the job done.

After that i will know the price of the drives.

>
>> A problem is the write speed + read speed i need. From what i
>> understand at the edges of drives the speed is
>> roughly 133MB/s SATA2 moving down to a 33MB/s at the innersides.
>>
>> Is that roughly correct?
>
> I hate this as much as anybody, but........ It Depends (TM).
> You talking plain-jane "dd".  Sure, that might be reasonable for  
> certain
> vendors.
>
>> Of course there will be many solutions. I could use some raid cards
>> or i could equip each node with some drives.
>> Raid card is probably sata-3 nowadays. Didn't check speeds there.
>>
>> Total storage is some dozen to a few dozens of terabytes.
>>
>> Does the filesystem automatically optimize for writing at the edges
>> instead of starting at the innerside?
>> which 'raid' level would you recommend for this if any is appropriate
>> at all :)
>
> Again, depends on RAID card and whatnot.  Some do, some don't.
>
>> How many harddrives would i need? What failure rate can i expect with
>> modern SATA drives there?
>> I had several fail at a raid0+1 system before when generating some
>> EGTBs some years ago.
>
> Yup, things will break especially during the shakeout (first few  
> days or
> weeks).  I assume you're buying commodity drives here, not enterprise,
> so you should prepare for upwards of, /after the shakeout/, maybe  
> 4-8 of
> your drives to fail or start throwing SMART errors in the first year
> (ball-parking it here based solely on experience).  Rebuilds will suck
> for you with lots of data unless you have really thought that out
> (typically limited to speed of a single disk -- therefore 2TB drive
> rebuilding itself at 50MB/s (that's best case scenario) is like 11
> hours.  I hope you haven't bought all your drives from the same batch
> from the same manufacturer as well -- that often results in very  
> similar
> failure times (i.e. concurrent failures in a day).  Very non-uniform.

Yes i always have had many failures of harddrives if you do EGTB work.

Usually 100% breaks within a few years. Only 1 old SCSI drive  
survived actually.
all the parallel ATA's and SATA's they one day all die.

Biggest failure rate is obvously first few weeks indeed, after that  
most die
after 1 or 2 year of intensive usage.

>
>> Note there is more questions. Like which buffer size i must read/
>> write. Most files get streamed.
>>   From 2 files that i do reading from, i read big blocks from a  
>> random
>> spot in that file. Each file is
>> a couple of hundreds of gigabyte.
>>
>> I used to grab chunks of 64KB from each file, but don't see how to
>> get to gigabytes a second i/o with
>> todays hardware that manner.
>>
>> Am considering now to read blocks of 10MB. Which size will get me
>> there to the maximum bandwidth the i/o
>> can deliver?
>
> I actually do wonder if Hadoop won't work for you.  This sounds like a
> very Hadoop-like workload, assuming you are OK with write-once read- 
> many
> semantics.  But I need to know way more about what you want to do with
> the data afterwards.  Moving data off of HDFS sucks.

Compress it with 7-zip and move it away indeed. It'll compress to 3TB  
that 82 TB uncompressed i guess.
Probably during the generation of that 82TB data i already must start  
compressing to save
space.

Majority of those chessfiles is useless crap actually. Just a few are  
really interesting to have.
The interesting ones are actually a few smaller ones with many pawns,  
but to generate them you need to first do all that work :)

Then after that supercompress them using chesstechnical knowledge,  
that'll get that 82TB to a 100GB or so, but it will take
many years to do so as that's very compute intensive and not high  
priority.

>
> Best,
>
> ellis
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf