[Beowulf] hadoop

Sat Feb 7 00:45:19 PST 2015

Hi Jonathan,

> On 7 Feb 2015, at 6:20 pm, Jonathan Aquilina <jaquilina at eagleeyet.net> wrote:
> 
> Can someone explain to me what exactly the purpose of hadoop is and what we mean when we say big data? Is this for data storage and retrieval? Number crunching?

Hadoop can be thought of as HTPC, High Throughput Computing, over a collection of simple servers. Where in HPC you might have hundreds of nodes with a shared file system working on the same copy of the data, Hadoop distributes the data to local storage in each node of the cluster using the Hadoop Filesystem, and then collects the output at the end. I believe it has built in redundancy, allowing you to distribute the same job to 2 or 3 nodes for fault tolerance. It means your "cluster" can be very simple, no complex parallel filesystems, no specialised networks, no redundancy at the hardware level.

Originally built to work with MapReduce as it's core application, there are a number of other applications that can be found on the Apache website. 

As for big data, this is basically about taking things like 10 billion tweets, breaking them up into chunks of 500,000 or so, and doing analytics on them. Things like that break up very easily for distribution, as there is usually very little linkage between each tweet. 

Hadoop came out of the need for places like Google, Yahoo, Paypal and eBay to process terabytes of transaction logs an hour. They already had the servers, but they were in data centres all over the world. Rather than hook them all up to some common file server, just build a system to package up the data and the application and send it where ever can process it the quickest. Send it 3 times to make sure it gets done, then pull back the results at the end.

Matt.