[Beowulf] Hadoop

Mon Dec 29 14:09:34 PST 2008

On Fri, Dec 26, 2008 at 05:16:04PM -0600, Gerry Creager wrote:

> We've a user who has requested its installation on one of our clusters,  
> a high-throughput system.

You didn't say anything about what they wanted to do. Hadoop is
designed to store a lot of data, and then enable what we HPC people
would call nearly-embarrassingly-parallel computation with good
locality -- it takes shards of mapreduce computation to run on the
same system as the disk shards being processed.

This means you'll have to dedicate systems over the long term to store
the data (much like PVFS), and all of these systems will have to be a
part of their mapreduce jobs. So if your queue system can run
whole-cluster jobs easily, no problem.

If, instead, they're just looking for a simple way to do
embarrassingly parallel computations, without lots of persistent data,
then you can probably point them at something easier and more friendly
to your queue system.

-- greg