[Beowulf] hadoop

Tue Nov 27 08:13:25 PST 2012

On 11/27/2012 08:59 AM, Eugen Leitl wrote:
> On Tue, Nov 27, 2012 at 09:10:32AM +0100, Jonathan Aquilina wrote:
>> Hey guys I was looking at the hadoop page and it got me wondering. is it
>> possible to cluster together storage servers? If so how efficient would a
>> cluster of them be?
>
> An interesting problem would be to use reasonably powerful but
> cheap ARM SoCs in few GBytes onboard RAM and some flash
> for hybrid filesystems for each hard drive, and cluster
> them via GBit Ethernet on a very large scale.
>
> That would be a custom Beowulf for more storage-related
> tasks. E.g. an application I have in mind are volumetric
> datasets with e.g. 8 nm - voxels for biological systems,
> which are way too large to process in memory.

Are these problems EP such that they could be entirely Map tasks? 
Because otherwise you are going to have a fairly significant shuffle 
stage in your MapReduce application that will lead to overheads moving 
the data over the network and in and out of memory/disk/etc.  Shuffling 
can be a real PITA, but it tends to be present in most real-world 
applications I've run into.

Maybe you weren't referring to using Hadoop, in which case this 
basically looks just like the FAWN project I had mentioned in the past 
that came out of CMU (with the addition of tiered storage).

Best,

ellis