<div dir="ltr"><div dir="ltr">On Wed, Oct 14, 2020 at 11:32 AM Douglas Eadline <<a href="mailto:deadline@eadline.org">deadline@eadline.org</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

IMO, both Hadoop and Spark did not use MPI because they had<br>

a highly defined algorithm with specific performance goals.<br>

Many MR jobs, like those with Hadoop are dynamic, requiring a<br>

varied resource load over the course of their lifetime.<br>

(Mapping uses a lot of resources, Reducing usually uses much less)<br>

<br>

Thus, the Hadoop scheduler, YARN, can dynamically reduce or<br>

increase the resources assigned to a running job. MPI does not<br>

provide such a dynamic resource allocation.<br>

Basically, MPI did not address their project goals.<br>

The authors were certainly aware of MPI (I worked with<br>

some of them on a book about YARN)<br>

</blockquote><div><br></div><div>Doug, I agree. Just to clarify, I did not ask why Spark or Hadoop did not start with MPI but why the whole data science/ML/AI thing did not look at MPI first and try to use it as the underlying mechanism (you answered that as well). I have nothing against MPI. If you look at the world of DS/ML/AI, you also have things like Akka, which are basically message passing but with the added bonus of being usable in "actor" settings which can be persisted through time (think of models that just use new information to add onto existing knowledge derived from previous information). Things like Akka also have the added bonus of things like Scala - strong typing, correctness, lazy evaluation, reasoning about code etc. Of course, something like Akka would never be applicable to the traditional HPC world, we still live in the timeline and setting dictated by the 1970s (1960s?) concept of a job. Nothing wrong with the job concept either (Spark/Hadoop also live in that context), just thought I mention it.<br></div></div></div>