[Beowulf] Hadoop's Uncomfortable Fit in HPC
jlowey at gmail.com
Mon May 19 20:28:57 PDT 2014
I think looking at technology such as MapR is using could address the suboptimal HDFS, there are opportunities to be had with this framework. As for Java, I could pontificate, but to this group I sense this would be pointless... The right tool for the job will trump in the end.
> On May 19, 2014, at 5:48 PM, "Ellis H. Wilson III" <ellis at cse.psu.edu> wrote:
> On 05/19/2014 03:26 PM, Douglas Eadline wrote:
>>> Great write-up by Glenn Lockwood about the state of Hadoop in HPC. It
>>> pretty much nails it, and offers an nice overview of the current
>>> ongoing efforts to make it relevant in that field.
>>> Most spot on thing I've read in a while. Thanks Glenn.
>> I concur. A good assessment of Hadoop. Most HPC users would
>> benefit from reading Glenn's post. I would offer
>> a few other thoughts (and a free pdf).
> The write-up is interesting, and I know Doug's PDF (and full book for that matter, as I was honored to be asked to help review it) to be worth reading if you want to understand the many subcomponents of the Hadoop project beyond the buzzwords. Very enlightening history in Chapter 1.
> However, I do take a few issues with the original write-up (not book) in general:
> 1. I wish "Hadoop" would die. The term that is. Hadoop exists less and less by the year. HDFS exists. MapReduce exists. The YARN scheduler exists. As far as I'm concerned, "Hadoop" exists as much as "Big Data" exists. It's too much wrapped into one thing, and only leads to idealogical conversations (best left for the suits). It's there for historical reasons, and needs to be cut out of our language ASAP. It leads to more confusion than anything else.
> 2. You absolutely do not need to use all of the Hadoop sub-projects to use MapReduce, which is the real purpose of using "Hadoop" in HPC at all. There are already perfectly good, high-bandwidth, low-latency, scalable, semantically-rich file systems in-place and far more mature than HDFS. So why even bother with HOD (or myhadoop) at all? Just use MapReduce on your existing files. You don't need HDFS, just a java installation and non-default URIs. Running a MR job via Torque/PBS/et. al. is reasonably trivial. FAR more trivial than importing "lakes" of data as Doug refers to them from your HPC instance to your "Hadoop" (HDFS) instance anyhow, which is what you have to do with these inane "on-demand" solutions. I will be addressing this in a paper at ICDCS I present this June in Spain, if any are going. Just let me know if interested and I'll share a copy of the paper and times.
> 3. Java really doesn't matter for MR or for YARN. For HDFS (which, as mentioned, you shouldn't be using in HPC anyhow), yea, I'm not happy with it being in Java, but for MR, if you are experiencing issues relating to Java, you shouldn't be coding in MR. It's not a MR-ready job. You're being lazy. MR should really only be used (in the HPC context) for pre- and post-computation analytics. For all I care, it could be written in BASIC (or Visual BASIC, to bring the point home). Steady-state bandwidth to disk in Java is nearly equivalent to C. Ease of coding and scalability is what makes MR great.
> Your 6-week run of your insane climate framework completed on ten-thousand machines and gobbled up a petabyte of intermediate data. All you really need to know is where temperatures in some arctic region are rising faster than some specified value. Spending another two weeks writing a C+MPI/etc program from scratch to do a fancy grep is a total waste of time and capacity. This is where MR shines. Half-day code-up, scale-up very fast, get your results, delete the intermediate data. Science Complete.
> 4. Although I'm not the biggest fan of HDFS, but this post misses the /entire/ point of HDFS: reliability in the face of (numerous) failures. HDFS (which has a heritage in the Google FS, which the post also fails to mention, despite mentioning MR's heritage out of the same shop) really was designed to be put on crappy hardware and provide really nice throughput. Responsiveness, POSIX semantics, etc, are all really inappropriate remarks. It's like complaining about your dump truck not doing 0-60 faster than a Lamborghini. Not the intention here and thus, I continue to believe it should not be used in HPC environments when most of them demand the Lamborghini for 90% of executions.
> Just my (beer-laden) 2c,
> Ph.D. Candidate
> Department of Computer Science and Engineering
> The Pennsylvania State University
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf