[Beowulf] Redundant Array of Independent Memory - fork(Re: Checkpointing using flash)

Justin YUAN SHI shi at temple.edu
Mon Sep 24 04:10:06 PDT 2012


I think the Redundant Memory paper was really mis-configured. It uses
a storage solution, trying to solve a volatle memory problem but
insisting on eliminating volatility. It looks very much messed up.

My early comment on the OSI model still stands, even though MPI
implementation is far down the stack that may not fit the OSI model
well. The MPI implementation, even at the transport layer does NOT
re-transmit messages.

As you know there are semantic differences between an MPI message and
a packet. Reliable packet transmission does not equal to reliable
message transamission. When machine hangs running MPI protocol stack,
the entire app hangs. Therefore, this is the root cause for all our
fault tolerance problems.

It also seems hard to fix this. This is caused by the MPI direct
messaging interface design (except for the group communication). The
current group communication protocol implementation still does not
handle the issue.

Justin

On Mon, Sep 24, 2012 at 4:52 AM, Andrew Holway <andrew.holway at gmail.com> wrote:
>> I made a sketch :) http://bit.ly/TlkHpH
>
> Really? scheduled downtime? on a monday morning?
>
> new link :) http://bit.ly/RbpKW8
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list