[Beowulf] hadoop

Tue Nov 27 08:50:00 PST 2012

On 11/27/2012 11:34 AM, Eugen Leitl wrote:
> On Tue, Nov 27, 2012 at 11:13:25AM -0500, Ellis H. Wilson III wrote:
>
>> Are these problems EP such that they could be entirely Map tasks?
>
> Not at all. This particular application is to derive optimal
> feature extraction algorithms from high-resolution volumetric data
> (mammal or primate connectome). At ~8 nm, even a mouse will
> produce a mountain of structural data.

Pardon my possible naiveté on the applied science here, but it's unclear 
to me why the state space explosion is tied to it being embarrassingly 
parallel or not.  Perhaps, to reword my question, can you describe if, 
and if, at what frequency, the extraction algorithms will need to 
barrier sync to communicate?  If this is indeed "not at all" EP, then 
you will likely have a serious communication problem and 1GbE will not 
work if you need to transmit some or all of the data you are reading 
locally to some other remote node.

>> Because otherwise you are going to have a fairly significant shuffle
>> stage in your MapReduce application that will lead to overheads moving
>> the data over the network and in and out of memory/disk/etc.  Shuffling
>> can be a real PITA, but it tends to be present in most real-world
>> applications I've run into.
>
> The extracted feature set would be much more compact than the
> raw dataset (at least 10^3 to 10^6 more compact), and could
> be loaded over the GBit/s network into the main cluster with
> no problems.

How are you getting the raw data onto the cluster?  This time may become 
the dominant one if it is not a write-once read-very-many type of 
situation.  Maybe you have lots of different feature extraction 
algorithms to use on that raw data?

>> Maybe you weren't referring to using Hadoop, in which case this
>> basically looks just like the FAWN project I had mentioned in the past
>> that came out of CMU (with the addition of tiered storage).
>
> http://www.cs.cmu.edu/~fawnproj/ ?

Yep, that's the one.

> Cute, and probably the right application for the
> Adapteva project. If the boards are credit-card
> sized you can mount them on a rackmount tray
> along with a 24-port switch, with a couple of
> fans.
>
> However, I'm thinking about a board you directly plug
> your SATA or SAS hard drive into, probably using
> the hard drive itself (which should be 5k rpm then)
> as a heatsink.

Why do you want the HDD to be a heatsink (i.e. why is that better in any 
way than just having the HDD right there and using a normal passive 
sink)?  And can you expound upon the differences between the FAWN setup 
if it had a HDD saddled right next to it against what you are 
describing?  I feel like you're saying the exact same thing except just 
connect a HDD for capacity reasons and use the onboard flash for cache 
instead, both of which are reasonably trivial.

Just trying to get a handle on your (interesting IMHO) idea here, no 
non-constructive criticism intended,

ellis