[Beowulf] Hadoop's Uncomfortable Fit in HPC
Ellis H. Wilson III
ellis at cse.psu.edu
Mon May 19 17:48:26 PDT 2014
On 05/19/2014 03:26 PM, Douglas Eadline wrote:
>> Great write-up by Glenn Lockwood about the state of Hadoop in HPC. It
>> pretty much nails it, and offers an nice overview of the current
>> ongoing efforts to make it relevant in that field.
>>
>> http://glennklockwood.blogspot.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html
>>
>> Most spot on thing I've read in a while. Thanks Glenn.
>
> I concur. A good assessment of Hadoop. Most HPC users would
> benefit from reading Glenn's post. I would offer
> a few other thoughts (and a free pdf).
The write-up is interesting, and I know Doug's PDF (and full book for
that matter, as I was honored to be asked to help review it) to be worth
reading if you want to understand the many subcomponents of the Hadoop
project beyond the buzzwords. Very enlightening history in Chapter 1.
However, I do take a few issues with the original write-up (not book) in
general:
1. I wish "Hadoop" would die. The term that is. Hadoop exists less and
less by the year. HDFS exists. MapReduce exists. The YARN scheduler
exists. As far as I'm concerned, "Hadoop" exists as much as "Big Data"
exists. It's too much wrapped into one thing, and only leads to
idealogical conversations (best left for the suits). It's there for
historical reasons, and needs to be cut out of our language ASAP. It
leads to more confusion than anything else.
2. You absolutely do not need to use all of the Hadoop sub-projects to
use MapReduce, which is the real purpose of using "Hadoop" in HPC at
all. There are already perfectly good, high-bandwidth, low-latency,
scalable, semantically-rich file systems in-place and far more mature
than HDFS. So why even bother with HOD (or myhadoop) at all? Just use
MapReduce on your existing files. You don't need HDFS, just a java
installation and non-default URIs. Running a MR job via Torque/PBS/et.
al. is reasonably trivial. FAR more trivial than importing "lakes" of
data as Doug refers to them from your HPC instance to your "Hadoop"
(HDFS) instance anyhow, which is what you have to do with these inane
"on-demand" solutions. I will be addressing this in a paper at ICDCS I
present this June in Spain, if any are going. Just let me know if
interested and I'll share a copy of the paper and times.
3. Java really doesn't matter for MR or for YARN. For HDFS (which, as
mentioned, you shouldn't be using in HPC anyhow), yea, I'm not happy
with it being in Java, but for MR, if you are experiencing issues
relating to Java, you shouldn't be coding in MR. It's not a MR-ready
job. You're being lazy. MR should really only be used (in the HPC
context) for pre- and post-computation analytics. For all I care, it
could be written in BASIC (or Visual BASIC, to bring the point home).
Steady-state bandwidth to disk in Java is nearly equivalent to C. Ease
of coding and scalability is what makes MR great.
Example:
Your 6-week run of your insane climate framework completed on
ten-thousand machines and gobbled up a petabyte of intermediate data.
All you really need to know is where temperatures in some arctic region
are rising faster than some specified value. Spending another two weeks
writing a C+MPI/etc program from scratch to do a fancy grep is a total
waste of time and capacity. This is where MR shines. Half-day code-up,
scale-up very fast, get your results, delete the intermediate data.
Science Complete.
4. Although I'm not the biggest fan of HDFS, but this post misses the
/entire/ point of HDFS: reliability in the face of (numerous) failures.
HDFS (which has a heritage in the Google FS, which the post also fails
to mention, despite mentioning MR's heritage out of the same shop)
really was designed to be put on crappy hardware and provide really nice
throughput. Responsiveness, POSIX semantics, etc, are all really
inappropriate remarks. It's like complaining about your dump truck not
doing 0-60 faster than a Lamborghini. Not the intention here and thus,
I continue to believe it should not be used in HPC environments when
most of them demand the Lamborghini for 90% of executions.
Just my (beer-laden) 2c,
ellis
--
Ph.D. Candidate
Department of Computer Science and Engineering
The Pennsylvania State University
www.ellisv3.com
More information about the Beowulf
mailing list