<div dir="ltr">Thanks Chris for the links. I took a quick look into ULFM work. It is really encouraging to see these type of efforts instead of hiding the transient failures under the "carpet".<div><br></div><div>
The issues are really hidden inside of the transient failures, though. For exascale apps, one must prove an architecture that can gain application performance and reliability at the same time when adding processors and interconnects.</div>
<div><br></div><div>The fix is easier than it seems on surface. We called it Statistic Multiplexed Computing. I gave it talk last year at NCAR. Here is the link for the slides and video: <a href="https://sea.ucar.edu/event/statistic-multiplexed-computing-smc-neglected-path-unlimited-application-scalability">https://sea.ucar.edu/event/statistic-multiplexed-computing-smc-neglected-path-unlimited-application-scalability</a></div>
<div><br></div><div>I am also helping with InterCloud HPC 2014 in Italy this year. Please submit your work if interested.</div><div><br></div><div>Many thanks in advance!</div><div><br></div><div>Justin Y. Shi</div><div><a href="mailto:shi@temple.edu">shi@temple.edu</a></div>
</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Mar 5, 2014 at 9:49 PM, Christopher Samuel <span dir="ltr"><<a href="mailto:samuel@unimelb.edu.au" target="_blank">samuel@unimelb.edu.au</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">-----BEGIN PGP SIGNED MESSAGE-----<br>
Hash: SHA1<br>
<br>
On 06/03/14 03:07, Joe Landman wrote:<br>
<br>
> I've not done much with MPI in a few years, have they extended it<br>
> beyond MPI_Init yet? Can MPI procs just join a "borgified"<br>
> collective, preserve state so restarts/moves/reschedules of ranks<br>
> are cheap? If not, what is the replacement for MPI that will do<br>
> this?<br>
<br>
</div>Oops, forgot this in my previous email - I stumbled across the Uni of<br>
Tenessee's ULFM (User Level Failure Mitigation) project which has a<br>
Wordpress blog here:<br>
<br>
<a href="http://fault-tolerance.org/" target="_blank">http://fault-tolerance.org/</a><br>
<br>
There is the PDF for a two page flyer from SC13 on the site which<br>
gives an overview and describes it thus:<br>
<br>
<a href="http://fault-tolerance.org/wp-content/uploads/2013/12/SC13-ULFM.pdf" target="_blank">http://fault-tolerance.org/wp-content/uploads/2013/12/SC13-ULFM.pdf</a><br>
<br>
# User Level Failure Mitigation is a set of MPI interface extensions<br>
# enabling Message Passing programs to restore MPI communication<br>
# capabilities affected by process failures. It supports rebuilding<br>
# communicators, RMA windows and I/O Files<br>
<br>
All the best,<br>
<div class="">Chris<br>
- --<br>
Christopher Samuel Senior Systems Administrator<br>
VLSCI - Victorian Life Sciences Computation Initiative<br>
Email: <a href="mailto:samuel@unimelb.edu.au">samuel@unimelb.edu.au</a> Phone: <a href="tel:%2B61%20%280%293%20903%2055545" value="+61390355545">+61 (0)3 903 55545</a><br>
<a href="http://www.vlsci.org.au/" target="_blank">http://www.vlsci.org.au/</a> <a href="http://twitter.com/vlsci" target="_blank">http://twitter.com/vlsci</a><br>
<br>
-----BEGIN PGP SIGNATURE-----<br>
Version: GnuPG v1.4.14 (GNU/Linux)<br>
Comment: Using GnuPG with Thunderbird - <a href="http://www.enigmail.net/" target="_blank">http://www.enigmail.net/</a><br>
<br>
</div>iEYEARECAAYFAlMX4kMACgkQO2KABBYQAh9iXgCffxwP07z91by2FCHxVRwtTl4Q<br>
yTUAni3Xn0C+Nla0rS4HwW2dfF4Czb0Q<br>
=yWTJ<br>
<div class="HOEnZb"><div class="h5">-----END PGP SIGNATURE-----<br>
_______________________________________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>
</div></div></blockquote></div><br></div>