[Beowulf] hadoop

Mon Feb 9 10:48:26 PST 2015

Time for my two cents...

For the best understanding of Hadoop, I  think Google's original papers 
on MapReduce and GFS (Google File System)  are still the best starting 
source. If for no other reason, they were written before the the Hadoop 
hype-train left the station, so they don't claim that MapReduce can fix 
every problem.

http://research.google.com/archive/mapreduce.html
http://research.google.com/archive/gfs.html

On 02/07/2015 01:38 PM, Douglas Eadline wrote:
>> Hello Jonathan.
>> Here it is a good document to get you thinking.
>> http://www.cs.berkeley.edu/~rxin/db-papers/WarehouseScaleComputing.pdf
>>
>> Although Doug said "Oh, and Hadoop clusters are not going to supplant your
>> HPC
>> cluster"
> I should have continued, ... and there will be overlap.

Definitely. Glen Lockwood has written a great article on exactly this, 
which I think sums up the issues perfectly and has been discussed on 
this list in the past:

http://glennklockwood.blogspot.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html

In my opinion, a common source of confusion is that people are using the 
term 'big data' to refer to ALL data that takes up GBs, TBs, or PBs. My 
definition includes 'unstructured data from disparate sources', which 
means data from different sources, in different formats. A relational 
database, like Oracle or MySQL, regardless of its size doesn't fit this 
definition, since it's structured. I also don't consider the output of 
HPC simulations to be 'big data' since the output files of those 
simulations are usually structured in some sort of way (HDF5, or 
whatever). Are those example a LOT of data? Yes, but I don't consider 
them to be 'big data'.

I hope I didn't just start a religious war.

--
Prentice