[Beowulf] Hadoop's Uncomfortable Fit in HPC

Mon May 19 17:48:26 PDT 2014

On 05/19/2014 03:26 PM, Douglas Eadline wrote:
>> Great write-up by Glenn  Lockwood about the state of Hadoop in HPC. It
>> pretty much nails it, and offers an nice overview of the current
>> ongoing efforts to make it relevant in that field.
>>
>> http://glennklockwood.blogspot.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html
>>
>> Most spot on thing I've read in a while. Thanks Glenn.
>
> I concur. A good assessment of Hadoop. Most HPC users would
> benefit from reading Glenn's post. I would offer
> a few other thoughts (and a free pdf).

The write-up is interesting, and I know Doug's PDF (and full book for 
that matter, as I was honored to be asked to help review it) to be worth 
reading if you want to understand the many subcomponents of the Hadoop 
project beyond the buzzwords.  Very enlightening history in Chapter 1.

However, I do take a few issues with the original write-up (not book) in 
general:

1. I wish "Hadoop" would die.  The term that is.  Hadoop exists less and 
less by the year.  HDFS exists.  MapReduce exists.  The YARN scheduler 
exists.  As far as I'm concerned, "Hadoop" exists as much as "Big Data" 
exists.  It's too much wrapped into one thing, and only leads to 
idealogical conversations (best left for the suits).  It's there for 
historical reasons, and needs to be cut out of our language ASAP.  It 
leads to more confusion than anything else.

2. You absolutely do not need to use all of the Hadoop sub-projects to 
use MapReduce, which is the real purpose of using "Hadoop" in HPC at 
all.  There are already perfectly good, high-bandwidth, low-latency, 
scalable, semantically-rich file systems in-place and far more mature 
than HDFS.  So why even bother with HOD (or myhadoop) at all?  Just use 
MapReduce on your existing files.  You don't need HDFS, just a java 
installation and non-default URIs.  Running a MR job via Torque/PBS/et. 
al. is reasonably trivial.  FAR more trivial than importing "lakes" of 
data as Doug refers to them from your HPC instance to your "Hadoop" 
(HDFS) instance anyhow, which is what you have to do with these inane 
"on-demand" solutions.  I will be addressing this in a paper at ICDCS I 
present this June in Spain, if any are going.  Just let me know if 
interested and I'll share a copy of the paper and times.

3. Java really doesn't matter for MR or for YARN.  For HDFS (which, as 
mentioned, you shouldn't be using in HPC anyhow), yea, I'm not happy 
with it being in Java, but for MR, if you are experiencing issues 
relating to Java, you shouldn't be coding in MR.  It's not a MR-ready 
job.  You're being lazy.  MR should really only be used (in the HPC 
context) for pre- and post-computation analytics.  For all I care, it 
could be written in BASIC (or Visual BASIC, to bring the point home). 
Steady-state bandwidth to disk in Java is nearly equivalent to C.  Ease 
of coding and scalability is what makes MR great.

Example:
Your 6-week run of your insane climate framework completed on 
ten-thousand machines and gobbled up a petabyte of intermediate data. 
All you really need to know is where temperatures in some arctic region 
are rising faster than some specified value.  Spending another two weeks 
writing a C+MPI/etc program from scratch to do a fancy grep is a total 
waste of time and capacity.  This is where MR shines.  Half-day code-up, 
scale-up very fast, get your results, delete the intermediate data. 
Science Complete.

4. Although I'm not the biggest fan of HDFS, but this post misses the 
/entire/ point of HDFS: reliability in the face of (numerous) failures. 
  HDFS (which has a heritage in the Google FS, which the post also fails 
to mention, despite mentioning MR's heritage out of the same shop) 
really was designed to be put on crappy hardware and provide really nice 
throughput.  Responsiveness, POSIX semantics, etc, are all really 
inappropriate remarks.  It's like complaining about your dump truck not 
doing 0-60 faster than a Lamborghini.  Not the intention here and thus, 
I continue to believe it should not be used in HPC environments when 
most of them demand the Lamborghini for 90% of executions.

Just my (beer-laden) 2c,

ellis

-- 
Ph.D. Candidate
Department of Computer Science and Engineering
The Pennsylvania State University
www.ellisv3.com