[Beowulf] hadoop

Ellis H. Wilson III ellis at cse.psu.edu
Tue Nov 27 05:30:35 PST 2012

On 11/27/2012 08:14 AM, Vincent Diepeveen wrote:
> Don't post something ridicioulous like that.

You both are right, so lets stop antagonizing the antagonist here. 
Laptops are a reasonable place to toy around with and educate oneself 
about Hadoop, but they are also not (obviously, I don't think this was 
the intention of the poster) the ideal environment for Hadoop.

However, this is not because of infiniband or some such thing -- most 
production Hadoop setups use 1Gb Ethernet or 10GbE if they are very 
lucky.   Only a few use infiniband, and any efforts to use Hadoop over 
RDMA are very, very recent (saw a few at SC12) and the benefits are to 
be determined IMHO.

But this isn't the real reason why Hadoop and portable commodity HW 
won't work well together -- the real reason is to really take advantage 
of Hadoop you should have multiple HDDs per box and most importantly, 
you need a LOT of RAM to get best performance (complicated explanation 
revolving around data spilling during computation and whatnot).  Laptops 
tend not to have lots of RAM packed in.

So, to get back to your original query here Jonathan, yes, you can run 
Hadoop (with some effort) on a "storage server," which I interpret in 
this case to be a NAS box of some sort.  However, please note that 
typical Hadoop already couples the "compute server" and the "storage 
server" by co-locating both the compute and storage daemons on the same 
box.  The MapReduce scheduler attempts to (sometimes poorly, this is in 
need of serious improvement) schedule jobs on machines which already 
have the data, which mitigates the need for super huge network pipes in 
order to push the data to that node.  So in a way, by using Hadoop in 
it's traditional incarnation you already achieve this.

I've done some testing of Hadoop within NAS, although most of my 
research has been on Hadoop /on/ NAS, or in other words, getting rid of 
the storage aspect of Hadoop and solely leveraging the MapReduce 
framework atop existing NAS storage.  This can be done efficiently, but 
it's easy to mistakenly head down an inefficient path.

Hopefully this (finally) answers some part of your original question,


More information about the Beowulf mailing list