<div dir="ltr"><div><div><div><div><div><div><div><div><br>
> We need to get to that place. Right now, our job scheduling, while
quite sophisticated in rule sets, is firmly entrenched in ideas from the
70's and 80's. "New" concepts in (pub sub, etc.) > schedulers are needed
for really huge scale. Fully distributed, able to route around
failure. Not merely tolerate it, but adapt to it.<br><br></div>There is a definite 'sea change' in HPC / Beowulfery at the moment.<br></div>Beowulf has always been about adopting COTS technology of course - but these days I see COTS as being what the web scale folks us.<br>
</div>So using configuration management such as Chef/Puppet, and using Openstack etc. for deployment.<br></div>Also the hardware from Opencompute.<br></div>And Map/reduce etc. and as you say Joe alternative schedulers.<br>
<br></div><div>A few months ago I went to a seminar on Maxeler's dattaflow computing.<br></div><div>You can imagine launching a compute job on a fabric of not very powerful nodes, each of which are working on a part of the problem.<br>
</div><div>the problem 'flows' across the fabric - but if there are holes in the fabric (ei node failures) the computation still proceeds.<br></div><div>Definitely hand waving on my part here, but as you say we have to get the concept of node failure being worked around,<br>
</div><div><br></div>As an aside, I am just tuning into the Bright Computing webinar, and they are coming out strongly in favour of Openstack.<br><br></div>Also, if anyone else is in the UK, look for Scale Summit on the 21st March. I shoudl be there.<br>
<br><br></div>ps. Joe - what is 'pub sub'. Yes, I know I can Google.<br><br><div><br><div><br><div><br><br><br><div><div><br><div><br><br><br>
</div></div></div></div></div></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 5 March 2014 16:07, Joe Landman <span dir="ltr"><<a href="mailto:landman@scalableinformatics.com" target="_blank">landman@scalableinformatics.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">On 03/05/2014 10:55 AM, Douglas Eadline wrote:<br>
</div><div class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
-----BEGIN PGP SIGNED MESSAGE-----<br>
Hash: SHA1<br>
<br>
On 05/03/14 13:52, Joe Landman wrote:<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I think the real question is would the system be viable in a<br>
commercial sense, or is this another boondoggle?<br>
</blockquote>
<br>
At the Slurm User Group last year Dona Crawford of LLNL gave the<br>
keynote and as part of that talked about some of the challenges of<br>
exascale.<br>
<br>
The one everyone thinks about first is power, but the other one she<br>
touched on was reliability and uptime.<br>
</blockquote>
<br>
Indeed, the fact that these issues were not even mentioned<br>
means to me the project is not very well thought out.<br>
At exascale (using current tech) failure recovery must be built<br>
into any design, either software and/or hardware.<br>
</blockquote>
<br></div>
Yes ... such designs must assume that there will be failure, and manage this. The issue, last I checked, is that most people coding to MPI can't use, or haven't used MPI resiliency features.<br>
<br>
Checkpoint/restart (CPR) on this scale is simply not an option, given that the probability of a failure occurring during CPR very rapidly approaches unity. CPR is built with this implicit assumption that copy out/copy back is *absolutely* reliable and will not fail. Ever.<br>
<br>
One way to circumvent portions of the issue are to use the SSD on DIMM designs to do very local "snapshot"-like CPR. And add in erasure coding, and other FEC for the data. So you can accept some small amount of failure in the copy out or copy back.<div class="">
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Basically if you scale a current petascale system up to exascale you<br>
are looking at an expected full-system uptime of between seconds and<br>
minutes. For comparison Sequoia, their petaflop BG/Q, has a<br>
systemwide MTBF of about a day.<br>
</blockquote>
<br>
I recall that HPL will take about 6 days to run<br>
on an exascale machine.<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
That causes problems if you're expecting to do checkpoint/restart to<br>
cope with failures, so really you've got to look at fault tolerances<br>
within applications themselves. Hands up if you've got (or know of)<br>
a code that can gracefully tolerate and meaningfully continue if nodes<br>
going away whilst the job is running?<br>
</blockquote>
<br>
I would hate to have my $50B machine give me a the wrong answer<br>
when such large amounts of money are involved. And we all know<br>
it is going to kick out "42" at some point.<br>
</blockquote>
<br></div>
Or the complete works of Shakespeare (<a href="http://en.wikipedia.org/wiki/Infinite_monkey_theorem" target="_blank">http://en.wikipedia.org/wiki/<u></u>Infinite_monkey_theorem</a>), though this would be more troubling than 42.<div class="">
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
The Slurm folks is already looking at this in terms of having some way<br>
of setting up a bargaining with the scheduler in case of node failure<br>
</blockquote>
<br>
As a side point, the Hadoop YARN scheduler allows dynamic resource<br>
negotiations while the program is running, thus if a node or rack dies,<br>
a job can request more resources. For MR this rather easy to do because of<br>
the functional nature of the process.<br>
<br>
</blockquote>
<br></div>
We need to get to that place. Right now, our job scheduling, while quite sophisticated in rule sets, is firmly entrenched in ideas from the 70's and 80's. "New" concepts in (pub sub, etc.) schedulers are needed for really huge scale. Fully distributed, able to route around failure. Not merely tolerate it, but adapt to it.<br>
<br>
This is going to require that we code to reality, not a fictional universe where nodes never fail, storage/networking never goes offline ...<br>
<br>
I've not done much with MPI in a few years, have they extended it beyond MPI_Init yet? Can MPI procs just join a "borgified" collective, preserve state so restarts/moves/reschedules of ranks are cheap? If not, what is the replacement for MPI that will do this?<br>
<br>
FWIW, folks on Wall Street use pub sub, message passing (ala AMPS, *MQ, ...) to handle some elements of this.<div class="im HOEnZb"><br>
<br>
<br>
<br>
-- <br>
Joseph Landman, Ph.D<br>
Founder and CEO<br>
Scalable Informatics, Inc.<br>
email: <a href="mailto:landman@scalableinformatics.com" target="_blank">landman@scalableinformatics.<u></u>com</a><br>
web : <a href="http://scalableinformatics.com" target="_blank">http://scalableinformatics.com</a><br>
twtr : @scalableinfo<br>
phone: <a href="tel:%2B1%20734%20786%208423%20x121" value="+17347868423" target="_blank">+1 734 786 8423 x121</a><br>
cell : <a href="tel:%2B1%20734%20612%204615" value="+17346124615" target="_blank">+1 734 612 4615</a><br></div><div class="HOEnZb"><div class="h5">
______________________________<u></u>_________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/<u></u>mailman/listinfo/beowulf</a><br>
</div></div></blockquote></div><br></div>