[Beowulf] Hadoop's Uncomfortable Fit in HPC

Tue May 20 07:50:30 PDT 2014

On 05/19/2014 08:48 PM, Ellis H. Wilson III wrote:
> On 05/19/2014 03:26 PM, Douglas Eadline wrote:
>>> Great write-up by Glenn  Lockwood about the state of Hadoop in HPC. It
>>> pretty much nails it, and offers an nice overview of the current
>>> ongoing efforts to make it relevant in that field.
>>>
>>> http://glennklockwood.blogspot.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html 
>>>
>>>
>>> Most spot on thing I've read in a while. Thanks Glenn.
>>
>> I concur. A good assessment of Hadoop. Most HPC users would
>> benefit from reading Glenn's post. I would offer
>> a few other thoughts (and a free pdf).
>
> The write-up is interesting, and I know Doug's PDF (and full book for 
> that matter, as I was honored to be asked to help review it) to be 
> worth reading if you want to understand the many subcomponents of the 
> Hadoop project beyond the buzzwords.  Very enlightening history in 
> Chapter 1.
>
> However, I do take a few issues with the original write-up (not book) 
> in general:
>
> 1. I wish "Hadoop" would die.  The term that is.  Hadoop exists less 
> and less by the year.  HDFS exists.  MapReduce exists.  The YARN 
> scheduler exists.  As far as I'm concerned, "Hadoop" exists as much as 
> "Big Data" exists.  It's too much wrapped into one thing, and only 
> leads to idealogical conversations (best left for the suits).  It's 
> there for historical reasons, and needs to be cut out of our language 
> ASAP.  It leads to more confusion than anything else.
>
> 2. You absolutely do not need to use all of the Hadoop sub-projects to 
> use MapReduce, which is the real purpose of using "Hadoop" in HPC at 
> all.  There are already perfectly good, high-bandwidth, low-latency, 
> scalable, semantically-rich file systems in-place and far more mature 
> than HDFS.  So why even bother with HOD (or myhadoop) at all?  Just 
> use MapReduce on your existing files.  You don't need HDFS, just a 
> java installation and non-default URIs.  Running a MR job via 
> Torque/PBS/et. al. is reasonably trivial.  FAR more trivial than 
> importing "lakes" of data as Doug refers to them from your HPC 
> instance to your "Hadoop" (HDFS) instance anyhow, which is what you 
> have to do with these inane "on-demand" solutions.  I will be 
> addressing this in a paper at ICDCS I present this June in Spain, if 
> any are going. Just let me know if interested and I'll share a copy of 
> the paper and times.

Ellis, it sounds like this would be a good thing to write up as a 
tutorial and share with the list. I'd be interest in getting a copy of 
that paper when it's available.

>
> 3. Java really doesn't matter for MR or for YARN.  For HDFS (which, as 
> mentioned, you shouldn't be using in HPC anyhow), yea, I'm not happy 
> with it being in Java, but for MR, if you are experiencing issues 
> relating to Java, you shouldn't be coding in MR.  It's not a MR-ready 
> job.  You're being lazy.  MR should really only be used (in the HPC 
> context) for pre- and post-computation analytics.  For all I care, it 
> could be written in BASIC (or Visual BASIC, to bring the point home). 
> Steady-state bandwidth to disk in Java is nearly equivalent to C.  
> Ease of coding and scalability is what makes MR great.
>
> Example:
> Your 6-week run of your insane climate framework completed on 
> ten-thousand machines and gobbled up a petabyte of intermediate data. 
> All you really need to know is where temperatures in some arctic 
> region are rising faster than some specified value. Spending another 
> two weeks writing a C+MPI/etc program from scratch to do a fancy grep 
> is a total waste of time and capacity. This is where MR shines.  
> Half-day code-up, scale-up very fast, get your results, delete the 
> intermediate data. Science Complete.
>
> 4. Although I'm not the biggest fan of HDFS, but this post misses the 
> /entire/ point of HDFS: reliability in the face of (numerous) 
> failures.  HDFS (which has a heritage in the Google FS, which the post 
> also fails to mention, despite mentioning MR's heritage out of the 
> same shop) really was designed to be put on crappy hardware and 
> provide really nice throughput.  Responsiveness, POSIX semantics, etc, 
> are all really inappropriate remarks.  It's like complaining about 
> your dump truck not doing 0-60 faster than a Lamborghini.  Not the 
> intention here and thus, I continue to believe it should not be used 
> in HPC environments when most of them demand the Lamborghini for 90% 
> of executions.
>
> Just my (beer-laden) 2c,

I drunk man is a sober man telling the truth.
>
> ellis
>