[Beowulf] hadoop
Prentice Bisbal
prentice.bisbal at rutgers.edu
Mon Feb 9 10:48:26 PST 2015
Time for my two cents...
For the best understanding of Hadoop, I think Google's original papers
on MapReduce and GFS (Google File System) are still the best starting
source. If for no other reason, they were written before the the Hadoop
hype-train left the station, so they don't claim that MapReduce can fix
every problem.
http://research.google.com/archive/mapreduce.html
http://research.google.com/archive/gfs.html
On 02/07/2015 01:38 PM, Douglas Eadline wrote:
>> Hello Jonathan.
>> Here it is a good document to get you thinking.
>> http://www.cs.berkeley.edu/~rxin/db-papers/WarehouseScaleComputing.pdf
>>
>> Although Doug said "Oh, and Hadoop clusters are not going to supplant your
>> HPC
>> cluster"
> I should have continued, ... and there will be overlap.
Definitely. Glen Lockwood has written a great article on exactly this,
which I think sums up the issues perfectly and has been discussed on
this list in the past:
http://glennklockwood.blogspot.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html
In my opinion, a common source of confusion is that people are using the
term 'big data' to refer to ALL data that takes up GBs, TBs, or PBs. My
definition includes 'unstructured data from disparate sources', which
means data from different sources, in different formats. A relational
database, like Oracle or MySQL, regardless of its size doesn't fit this
definition, since it's structured. I also don't consider the output of
HPC simulations to be 'big data' since the output files of those
simulations are usually structured in some sort of way (HDF5, or
whatever). Are those example a LOT of data? Yes, but I don't consider
them to be 'big data'.
I hope I didn't just start a religious war.
--
Prentice
More information about the Beowulf
mailing list