[Beowulf] Checkpointing using flash
Justin YUAN SHI
shi at temple.edu
Sat Sep 29 20:40:17 PDT 2012
Something like that. But we don't want the app code to look too ugly.
My idea is to use data parallel API. This is nothing new. In theory,
every MPI program can be translated into data parallel. The magic is
the total transformation of the application architecture.
Traditionally computer, network and application architectures are
separate concerns. I think this is all WRONG.
Ultimately, the application architecture matters to the scientists.
Thus API design is more important than just passing information.
Using the data parallel API, we have the opportunity to build a smart
infrastructure to counter both communication and computation transient
faults by automatically re-transmit data tokens that is too slow to
return results. The application can then claim performance and
reliability at the same time when up scaling the nodes. The app code
can still be elegant looking. For example, you can have multiple
Infiniband interfaces (some machines already have) to help counter the
speed disparity between computing and communication.
The only explicit error handling is for redundant result submissions.
I think the app can do a quorum if you care about Byzantine failure.
Otherwise, you can just harvest the first and ignore others. I cannot
figure out a better way to avoid this step. Maybe you all can help.
On Sat, Sep 29, 2012 at 9:46 AM, Lux, Jim (337C)
<james.p.lux at jpl.nasa.gov> wrote:
> On 9/29/12 2:29 AM, "Justin YUAN SHI" <shi at temple.edu> wrote:
>>I missed this thread. Got busy with classes. Sorry.
>>Going back to Jim's comments on Infiniband and OSI and MPI. I see the
>>exacscale computing requires us to rethink MPI's insistence on sending
>>message directly. Even with the group communicators, the
>>insists on the same.
>>The problem with direct communication is that you leave the
>>application without a recourse when the transmission fails. As we have
>>discussed, any transient fault can cause that to happen. It is
>>practically impossible to provide redundancy for every transmission
>>unless we change our API design that eliminates the reliable
>>communication assumption. The application-level re-transmission will
>>allow the application to survive NOT only the communication failures
>>but also node failures (when you loose a chunk of memory). But the MPI
>>semantics does not allow this to happen, even if the implementation
>>tries to re-transmit a failed message.
> So what you're thinking is that the conceptual message passing be more
> like UDP sockets?
> That we explicitly accept that a "send" might not work, and in fact, may
> "fail silently".
> Yes.. That is a key aspect, and the higher level algorithm that uses it
> needs to explicitly account for it: by multiple transmissions, multiple
> paths, coding (in the ECC sense) or something else.
More information about the Beowulf