[Beowulf] Hadoop's Uncomfortable Fit in HPC
Prentice Bisbal
prentice.bisbal at rutgers.edu
Tue May 20 07:50:30 PDT 2014
On 05/19/2014 08:48 PM, Ellis H. Wilson III wrote:
> On 05/19/2014 03:26 PM, Douglas Eadline wrote:
>>> Great write-up by Glenn Lockwood about the state of Hadoop in HPC. It
>>> pretty much nails it, and offers an nice overview of the current
>>> ongoing efforts to make it relevant in that field.
>>>
>>> http://glennklockwood.blogspot.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html
>>>
>>>
>>> Most spot on thing I've read in a while. Thanks Glenn.
>>
>> I concur. A good assessment of Hadoop. Most HPC users would
>> benefit from reading Glenn's post. I would offer
>> a few other thoughts (and a free pdf).
>
> The write-up is interesting, and I know Doug's PDF (and full book for
> that matter, as I was honored to be asked to help review it) to be
> worth reading if you want to understand the many subcomponents of the
> Hadoop project beyond the buzzwords. Very enlightening history in
> Chapter 1.
>
> However, I do take a few issues with the original write-up (not book)
> in general:
>
> 1. I wish "Hadoop" would die. The term that is. Hadoop exists less
> and less by the year. HDFS exists. MapReduce exists. The YARN
> scheduler exists. As far as I'm concerned, "Hadoop" exists as much as
> "Big Data" exists. It's too much wrapped into one thing, and only
> leads to idealogical conversations (best left for the suits). It's
> there for historical reasons, and needs to be cut out of our language
> ASAP. It leads to more confusion than anything else.
>
> 2. You absolutely do not need to use all of the Hadoop sub-projects to
> use MapReduce, which is the real purpose of using "Hadoop" in HPC at
> all. There are already perfectly good, high-bandwidth, low-latency,
> scalable, semantically-rich file systems in-place and far more mature
> than HDFS. So why even bother with HOD (or myhadoop) at all? Just
> use MapReduce on your existing files. You don't need HDFS, just a
> java installation and non-default URIs. Running a MR job via
> Torque/PBS/et. al. is reasonably trivial. FAR more trivial than
> importing "lakes" of data as Doug refers to them from your HPC
> instance to your "Hadoop" (HDFS) instance anyhow, which is what you
> have to do with these inane "on-demand" solutions. I will be
> addressing this in a paper at ICDCS I present this June in Spain, if
> any are going. Just let me know if interested and I'll share a copy of
> the paper and times.
Ellis, it sounds like this would be a good thing to write up as a
tutorial and share with the list. I'd be interest in getting a copy of
that paper when it's available.
>
> 3. Java really doesn't matter for MR or for YARN. For HDFS (which, as
> mentioned, you shouldn't be using in HPC anyhow), yea, I'm not happy
> with it being in Java, but for MR, if you are experiencing issues
> relating to Java, you shouldn't be coding in MR. It's not a MR-ready
> job. You're being lazy. MR should really only be used (in the HPC
> context) for pre- and post-computation analytics. For all I care, it
> could be written in BASIC (or Visual BASIC, to bring the point home).
> Steady-state bandwidth to disk in Java is nearly equivalent to C.
> Ease of coding and scalability is what makes MR great.
>
> Example:
> Your 6-week run of your insane climate framework completed on
> ten-thousand machines and gobbled up a petabyte of intermediate data.
> All you really need to know is where temperatures in some arctic
> region are rising faster than some specified value. Spending another
> two weeks writing a C+MPI/etc program from scratch to do a fancy grep
> is a total waste of time and capacity. This is where MR shines.
> Half-day code-up, scale-up very fast, get your results, delete the
> intermediate data. Science Complete.
>
> 4. Although I'm not the biggest fan of HDFS, but this post misses the
> /entire/ point of HDFS: reliability in the face of (numerous)
> failures. HDFS (which has a heritage in the Google FS, which the post
> also fails to mention, despite mentioning MR's heritage out of the
> same shop) really was designed to be put on crappy hardware and
> provide really nice throughput. Responsiveness, POSIX semantics, etc,
> are all really inappropriate remarks. It's like complaining about
> your dump truck not doing 0-60 faster than a Lamborghini. Not the
> intention here and thus, I continue to believe it should not be used
> in HPC environments when most of them demand the Lamborghini for 90%
> of executions.
>
> Just my (beer-laden) 2c,
I drunk man is a sober man telling the truth.
>
> ellis
>
More information about the Beowulf
mailing list