[Beowulf] hadoop
Ellis H. Wilson III
ellis at cse.psu.edu
Tue Nov 27 05:30:35 PST 2012
On 11/27/2012 08:14 AM, Vincent Diepeveen wrote:
> Don't post something ridicioulous like that.
You both are right, so lets stop antagonizing the antagonist here.
Laptops are a reasonable place to toy around with and educate oneself
about Hadoop, but they are also not (obviously, I don't think this was
the intention of the poster) the ideal environment for Hadoop.
However, this is not because of infiniband or some such thing -- most
production Hadoop setups use 1Gb Ethernet or 10GbE if they are very
lucky. Only a few use infiniband, and any efforts to use Hadoop over
RDMA are very, very recent (saw a few at SC12) and the benefits are to
be determined IMHO.
But this isn't the real reason why Hadoop and portable commodity HW
won't work well together -- the real reason is to really take advantage
of Hadoop you should have multiple HDDs per box and most importantly,
you need a LOT of RAM to get best performance (complicated explanation
revolving around data spilling during computation and whatnot). Laptops
tend not to have lots of RAM packed in.
So, to get back to your original query here Jonathan, yes, you can run
Hadoop (with some effort) on a "storage server," which I interpret in
this case to be a NAS box of some sort. However, please note that
typical Hadoop already couples the "compute server" and the "storage
server" by co-locating both the compute and storage daemons on the same
box. The MapReduce scheduler attempts to (sometimes poorly, this is in
need of serious improvement) schedule jobs on machines which already
have the data, which mitigates the need for super huge network pipes in
order to push the data to that node. So in a way, by using Hadoop in
it's traditional incarnation you already achieve this.
I've done some testing of Hadoop within NAS, although most of my
research has been on Hadoop /on/ NAS, or in other words, getting rid of
the storage aspect of Hadoop and solely leveraging the MapReduce
framework atop existing NAS storage. This can be done efficiently, but
it's easy to mistakenly head down an inefficient path.
Hopefully this (finally) answers some part of your original question,
ellis
More information about the Beowulf
mailing list