[Beowulf] [OT] MPI-haters

Justin Y. Shi shi at temple.edu
Sun Mar 6 14:58:17 PST 2016


Thanks for the questions.

The impossibility was theoretically proved that it is impossible to
implement reliable communication in the face of [either sender or receiver]
crashes.  Therefore, any parallel or distributed computing API that will
force the runtime system to generate fixed program-processor assignments
are theoretically incorrect. This answer is also related to your second
question: the impossibility means 100% reliable communication is impossible.

Ironically, 100% reliable packet transmission is theoretically and
practically possible as proved by John and Nancy for John's dissertation.
These two seemingly conflicting results are in fact complementary. They
basically say that distributed and parallel application programming cannot
rely on the reliable packet transmission as all of our current distributed
and parallel programming APIs assume.

Thus, MPI cannot be cost-ineffective in proportion to reliability, because
of the impossibility. The same applies to all other APIs that allows direct
program-program communications. We have found that the <key, value> APIs
are the only exceptions for they allow the runtime system to generate
dynamic program-device bindings, such as Hadoop and Spark. To solve the
problem completely, the application programming logic must include the
correct retransmission discipline. I call this Statistic Multiplexed
Computing or SMC. The Hadoop and Spark implementations did not go this far.
If we do complete the paradigm shift, then there will be no single point
failure regardless how the application scales. This claim covers all
computing and communication devices. This is the ultimate extreme scale
computing paradigm.

These answers are rooted in the statistic multiplexing protocol research
(packet switching). They have been proven in theory and practice that 100%
reliable and scalable communications are indeed possible. Since all HPC
applications must deploy large number of computing units via some sort of
interconnect (HP's The Machine may be the only exception), the only correct
API for extreme scale HPC is the ones that allow for complete
program-processor decoupling at runtime. Even the HP machine will benefit
from this research. Please note that the 100% reliability is conditioned by
the availability of the "minimal viable set of resources". In computing and
communication, the minimal set size is 1 for every critical path.

My critics argued that there is no way statistic multiplexed computing
runtime can compete against bare metal programs, such as MPI. We have
evidences to prove the opposite. In fact SMC runtime allows dynamic
adjustments of processing granularity without reprogramming. Not only we
can prove faster performances using heterogeneous processor but also
homogeneous processors. We see this capability is critical for extracting
efficiency out of HPC clouds.

SERC 310
Temple University
shi at temple.edu

On Sun, Mar 6, 2016 at 5:02 PM, Peter St. John <peter.st.john at gmail.com>

> Justin,
> I'm unsure just what you mean by some of what you said.
> "Any fixed program-processor binding is a single point failure"
> I'm troubled by the word "any". What about running two copies of a
> program, each with its own copy of the same data, on two processors (e.g.
> on a Tandem machine)? Surely that is not a single point of failure; is it
> not a "fixed program-processor binding"?
> "... it is impossible to implement reliable communication in the face
> of..."
> If by "reliable" you mean "perfectly reliable" then the thesis is trivial
> and does not require proof. Reliability is a metrical value with costs; the
> cost is space (e.g. for error-correcting codes) or time (e.g. for
> re-transmissions) or whatever. Do you mean that MPI is cost-ineffective in
> proportion to reliability? If so, why?
> Thanks,
> Peter
> On Sun, Mar 6, 2016 at 11:10 AM, Justin Y. Shi <shi at temple.edu> wrote:
>> Actually my interest in your group is not much between "hate" and "love"
>> of MPI or any other APIs. I am more interested in the "correctness" of
>> parallel APIs.
>> Three decades ago, not doing "bare metal" computing was impossible for
>> effective parallel processing. Today, insisting on "bare metal" computing
>> is detrimental to extreme scale efforts.
>> Any fixed program-processor binding is a single point failure. The
>> problem only shows when the application scales. And it is impossible to
>> implement reliable communication in the face of crashes [Alan Fekete, Nancy
>> Lynch and John Spinelli's 93 JACM paper proved this theoretically].
>> Therefore, any direct program-program communication API are theoretically
>> incorrect for extreme scaling applications.
>> The <key, value> pair API seems the only theoretically correct parallel
>> programming API that can take us out of the abyss of impossibilities.
>> However, systems like Hadoop and Spark have only showed the great promises
>> of program-device decoupling, they were not really designed for tackling
>> HPC applications. And their decoupling is incomplete by their runtime
>> implementations.
>> I proposed a Statistic Multiplexed Computing idea leveraging the
>> successes of <key, value> api systems and old Tuple Space semantics. My
>> github contribution is called Synergy3.0+.  You are welcome to check it out
>> and do a "bare metal" comparison against MPI and any other.
>> Our latest development is AnkaCom that was designed to tackling data
>> intensive HPC without scaling limits.
>> My apologies in advance for my shameless self-advertising.  I am looking
>> for serious collaborators who are interested in breaking this decade-old
>> barrier.
>> Justin Y. Shi
>> shi at temple.edu
>> SERC 310
>> Temple University
>> +1(215)204-6437
>> On Fri, Mar 4, 2016 at 10:14 AM, C Bergström <cbergstrom at pathscale.com>
>> wrote:
>>> A few people have subscribed and it's great to see some interest -
>>> hopefully we can start some interesting discussions. Actually - my
>>> background is more on the "web" side of HPC. I took a big jump when I
>>> started working @pathscale - Over the past 6 years I've cringed more
>>> than once when I see design that looks ***worse*** (I didn't think
>>> possible) than hibernate with tons of outer joins and evil xml
>>> configs.. (Java references for anyone unfortunate enough to get what
>>> I'm saying)
>>> On Fri, Mar 4, 2016 at 10:05 PM, Justin Y. Shi <shi at temple.edu> wrote:
>>> > Thank you for creating the list. I have subscribed.
>>> >
>>> > Justin
>>> >
>>> > On Fri, Mar 4, 2016 at 5:43 AM, C Bergström <cbergstrom at pathscale.com>
>>> > wrote:
>>> >>
>>> >> Sorry for the shameless self indulgence, but there seems to be a
>>> >> growing trend of love/hate around MPI. I'll leave my opinions aside,
>>> >> but at the same time I'd love connect and host a list where others who
>>> >> are passionate about scalability can vent and openly discuss ideas.
>>> >>
>>> >> Despite the comical name, I've created mpi-haters mailing list
>>> >>
>>> http://lists.pathscale.com/mailman/listinfo/mpi-haters_lists.pathscale.com
>>> >>
>>> >> To start things off - Some of the ideas I've been privately bouncing
>>> >> around
>>> >>
>>> >> Can current directive based approaches (OMP/ACC) be extended to scale
>>> >> out. (I've seen some research out of Japan on this or similar)
>>> >>
>>> >> Is Chapel c-like syntax similar enough to easily implement in clang
>>> >>
>>> >> Can one low level library succeed at creating a clean interface across
>>> >> all popular industry interconnects (libfabrics vs UCX)
>>> >>
>>> >> Real world success or failure of "exascale" runtimes? (What's your
>>> >> experience - lets not pull any punches)
>>> >>
>>> >> I won't claim to see ridiculous scalability in most web applications
>>> >> I've worked on, but they had so many tools available - Why have I
>>> >> never heard of memcache being used in a supercomputer and or why isn't
>>> >> sharding ever mentioned...
>>> >>
>>> >> Everyone is welcome and lets keep it positive and fun - invite your
>>> >> friends
>>> >>
>>> >>
>>> >> ./C
>>> >>
>>> >> ps - Apologies if you get this message more than once.
>>> >> _______________________________________________
>>> >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>> Computing
>>> >> To change your subscription (digest mode or unsubscribe) visit
>>> >> http://www.beowulf.org/mailman/listinfo/beowulf
>>> >
>>> >
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20160306/36501e42/attachment-0001.html>

More information about the Beowulf mailing list