[Beowulf] Hadoop's Uncomfortable Fit in HPC

Mon May 19 20:26:37 PDT 2014

I appreciate your commentary, Ellis.  I agree and disagree with you on your various points, and a lot of comes from what (I suspect) is the difference in our perspectives:

On May 19, 2014, at 5:48 PM, Ellis H. Wilson III <ellis at cse.psu.edu> wrote:
> 1. I wish "Hadoop" would die.  The term that is.  Hadoop exists less and less by the year.  HDFS exists.  MapReduce exists.  The YARN scheduler exists.  As far as I'm concerned, "Hadoop" exists as much as "Big Data" exists.  

I agree with you on your last point to a large degree; I use "Hadoop" to refer to the Hadoop product (which includes a MapReduce implementation and HDFS) but deliberately avoid using it to describe Hadoop-like (but not Hadoop) applications.  Ultimately though, Hadoop is a product whereas "Big Data" is a truly vacuous term.  "MapReduce" is a generic term, and "Hadoop's implementation of MapReduce (MIOMR)" makes people look at you funny.

> It's too much wrapped into one thing, and only leads to idealogical conversations (best left for the suits).  

Maybe I'm guilty of being a suit, or maybe I work for suits.  But it's what people understand, and it's not an imprecise term to use in many contexts.

> It's there for historical reasons, and needs to be cut out of our language ASAP.  It leads to more confusion than anything else.

I would argue that using some all-inclusive generic term leads to more confusion.  I say this from a pedagogical perspective; when you first learn what an atom is in grade school, you are told that it's the smallest indivisible piece of matter.  This, of course, is not true, but it helps people at the fringes of the community understand what you're talking about just a little better.  I am very deliberate in my use of "Hadoop" as a generic term.  "Big Data," on the other hand, is universally imprecise and confusing.  I also refrain from using it in print (proposals aside) for this reason.

> 2. You absolutely do not need to use all of the Hadoop sub-projects to use MapReduce, which is the real purpose of using "Hadoop" in HPC at all.  There are already perfectly good, high-bandwidth, low-latency, scalable, semantically-rich file systems in-place and far more mature than HDFS.  So why even bother with HOD (or myhadoop) at all?  Just use MapReduce on your existing files.  

This isn't a bad idea, and if it works for a specific problem, run with it.  I don't advocate this for four reasons though:
1. Configuration and debugging can get very obscure very quickly as you try using other tools on top of mapreduce.
2. The performance still isn't good--the original implementation of MapReduce-on-Lustre goes into nice detail about this.  It's still not as scalable as using a bunch of independent disks.
3. It's a weird way of doing it.  It's hard to summarize this succinctly, but we've found that the farther away we are from the standard way of doing anything (e.g., using Hadoop), the more complaints we get from users.
4. If there was no myHadoop, I'd have far less to to on my Saturday mornings.

> You don't need HDFS, just a java installation and non-default URIs.  Running a MR job via Torque/PBS/et. al. is reasonably trivial.  FAR more trivial than importing "lakes" of data as Doug refers to them from your HPC instance to your "Hadoop" (HDFS) instance anyhow,

It's not hard to do this.  One command and a little bit of time.  We're also developing ways around forcing users do even this.

> which is what you have to do with these inane "on-demand" solutions.

Be nice.

These "inane" solutions were missing from the national cyberinfrastructure portfolio here in the US, they were requested by users, and they were requested by our federal government.  Being able to run Hadoop on a conventional HPC platform is a capability, and it uniquely enables us to do a variety of things (teaching, exploratory work, etc) that cannot be done on any other system in the US CI portfolio.

This is where my perspective from the operational HPC angle becomes apparent.  Is this the best way of doing it?  Absolutely not.  Does it work, does it work reasonably well, does it provide a unique capability for the national research community, and does it satisfy our funding body?  Yes, yes, yes, and yes.

> 3. Java really doesn't matter for MR or for YARN.  For HDFS (which, as mentioned, you shouldn't be using in HPC anyhow), yea, I'm not happy with it being in Java, but for MR, if you are experiencing issues relating to Java, you shouldn't be coding in MR.  It's not a MR-ready job.  You're being lazy.  

I'm not clear on the point being made here.  Should such a user not use Hadoop's MapReduce at all and defer to something else?

> MR should really only be used (in the HPC context) for pre- and post-computation analytics.  

Says who?  I don't disagree, but you're sounding like a computer scientist here.  From the operational perspective, one never proscribes the tools/languages/libraries a researcher should use to solve a problem.  This was one of the most insulting things I was told by HPC staff when I was a researcher.

> For all I care, it could be written in BASIC (or Visual BASIC, to bring the point home). Steady-state bandwidth to disk in Java is nearly equivalent to C.  Ease of coding and scalability is what makes MR great.

Now is this generic MapReduce, or Hadoop's implementation of MapReduce (HIOMR)?  Perhaps we should just call it "Hadoop."  All jesting aside, this point isn't as clear to me.  I will say, though, that I'm a big advocate of using non-Java languages to teach Hadoop to researchers.  There are many cases where one language makes a lot more sense than another when mapreducing from a practical level.

> 4. Although I'm not the biggest fan of HDFS, but this post misses the /entire/ point of HDFS: reliability in the face of (numerous) failures.  

This is not the entire point of HDFS.  MapReduce does not work without a distributed file system.  Thus, I would argue that giving MapReduce a scalable foundation from which it can read its input data outstrips reliability in terms of its primary feature.  You can HDFS without replication, but you can't HDFS without the DFS part.

> HDFS (which has a heritage in the Google FS, which the post also fails to mention, despite mentioning MR's heritage out of the same shop)

It sounds like you're just nitpicking here.  Please be nice.  I'm not that bad of a guy, honest.

> really was designed to be put on crappy hardware and provide really nice throughput.  Responsiveness, POSIX semantics, etc, are all really inappropriate remarks.  

Not from the user perspective.  HDFS is a pain to use for our naive users.  Again, from an operational perspective, 99% of our users want something that's easy to use.  HDFS is not exactly there, so users get turned off by it.  Users getting turned off means adoption is low.  Adoption being low was the whole point of the post, so I maintain that it was not an inappropriate remark at all.

> It's like complaining about your dump truck not doing 0-60 faster than a Lamborghini.

And this is why I was careful to explain that there are reasons why the HDFS dump truck doesn't run like a Lamborghini.  As above, this doesn't really make the user any happier, but that 1% will appreciate why it's clunky and take that into consideration as they figure out if it will solve their problem.

> Not the intention here and thus, I continue to believe it should not be used in HPC environments when most of them demand the Lamborghini for 90% of executions.

This is a bit silly.  That's like saying accelerators "should not be used in HPC" because 90% of people don't use them.  This is the difference between capability and capacity.  If you provide it, you enable new things that could never be done before for a small subset of users.  

Again, operationally, as long as you aren't enabling a capability at great expense to serving the capacity of your established user demand, what's wrong with letting people use Hadoop in an HPC environment?  Sure, it's less pure, but operational HPC is a dirty business.

Glenn