<div dir="ltr">We routinely run jobs that last for months, some are codes that have an endpoint others are processes that provide some service (SOLR, ElasticSearch, etc,...) which have no defined endpoint. Unless you have some seriously flaky hardware or ongoing power/cooling issues there is nothing special needed to get high uptimes. I'd suggest making NFS mounts hard, so processes can recover from an NFS server reboot. Another good idea is to start several copies of an important run on several different nodes, preferably in different racks/PDUs/UPS. <div><br></div><div>The frame of your question as a thought exercise does open up the possibility for commentary though, challenge accepted. </div><div><br></div><div>A question like this will pique the interest of anyone seeking to justify their existence through the application of BS, myself included. To a project manager, this problem you have is a veritable goldmine of opportunity. To a vendor, you have opened up pandora's box and are the potential sale that will allow them to buy their kid the GI Joe with the kung-fu grip this Christmas. No solution is too costly or too complex to apply to this challenge, and the planning and execution must be detailed, require committees, phone calls, copious emails, a procurement process, budgetary analysis, more training, most certainly additional staffing....</div><div><br></div><div>Fortunately xkcd has a cartoon for this: <a href="http://xkcd.com/1445/">http://xkcd.com/1445/</a></div><div><br></div><div>You should take this opportunity to perform an experiment for us. You'll need two groups, first the experimental group. Contact a local project manager (just follow the trail of 6 sigma motivational posters) and explain your problem. Be detailed, spend some time googling up as many buzzwords as possible for your explanation. Look earnest and sincere as you present your case and drop hints about how you too may be interested in some of that "wonderful sigma training". Start a timer.</div><div><br></div><div>Now, the control group. When you get back to your desk go drag out some of those old workstations that were destined for surplus. Put them on some scrounged up, half dead desk-side UPS and start a half dozen copies of your code. Start a second timer, then go back to your normal duties.</div><div><br></div><div>In a few months report back which approach produced the best results as measured by time-to-run-completion and cost in dollars per completed run. Don't forget to have the project manager provide a detailed time record of hours spent by all the people they involve in the process. </div><div><br></div><div>I look forward to the results. </div><div><br></div><div>jbh</div></div><br><div class="gmail_quote"><div dir="ltr">On Wed, Oct 26, 2016 at 1:57 AM Skylar Thompson <<a href="mailto:skylar.thompson@gmail.com">skylar.thompson@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Assuming you can contain a run on a single node, you could use<br class="gmail_msg">
containers and the freezer controller (plus maybe LVM snapshots) to do<br class="gmail_msg">
checkpoint/restart.<br class="gmail_msg">
<br class="gmail_msg">
Skylar<br class="gmail_msg">
<br class="gmail_msg">
On 10/25/2016 11:24 AM, Michael Di Domenico wrote:<br class="gmail_msg">
> here's an interesting thought exercise and a real problem i have to tackle.<br class="gmail_msg">
><br class="gmail_msg">
> i have a researchers that want to run magma codes for three weeks or<br class="gmail_msg">
> so at a time. the process is unfortunately sequential in nature and<br class="gmail_msg">
> magma doesn't support check pointing (as far as i know) and (I don't<br class="gmail_msg">
> know much about magma)<br class="gmail_msg">
><br class="gmail_msg">
> So the question is;<br class="gmail_msg">
><br class="gmail_msg">
> what kind of a system could one design/buy using any combination of<br class="gmail_msg">
> hardware/software that would guarantee that this program would run for<br class="gmail_msg">
> 3 wks or so and not fail<br class="gmail_msg">
><br class="gmail_msg">
> and by "fail" i mean from some system type error, ie memory faulted,<br class="gmail_msg">
> cpu faulted, network io slipped (nfs timeout) as opposed to "there's a<br class="gmail_msg">
> bug in magma" which already bit us a few times<br class="gmail_msg">
><br class="gmail_msg">
> there's probably some commercial or "unreleased" commercial product on<br class="gmail_msg">
> the market that might fill this need, but i'm also looking for<br class="gmail_msg">
> something "creative" as well<br class="gmail_msg">
><br class="gmail_msg">
> three weeks isn't a big stretch compared to some of the others codes<br class="gmail_msg">
> i've heard around the DOE that run for months, but it's still pretty<br class="gmail_msg">
> painful to have a run go for three weeks and then fail 2.5 weeks in<br class="gmail_msg">
> and have to restart. most modern day hardware would probably support<br class="gmail_msg">
> this without issue, but i'm looking for more of a guarantee then a<br class="gmail_msg">
> prayer<br class="gmail_msg">
><br class="gmail_msg">
> double bonus points for anything that runs at high clock speeds >3Ghz<br class="gmail_msg">
><br class="gmail_msg">
> any thoughts?<br class="gmail_msg">
> _______________________________________________<br class="gmail_msg">
> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" class="gmail_msg" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br class="gmail_msg">
> To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" class="gmail_msg" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br class="gmail_msg">
><br class="gmail_msg">
<br class="gmail_msg">
_______________________________________________<br class="gmail_msg">
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" class="gmail_msg" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br class="gmail_msg">
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" class="gmail_msg" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br class="gmail_msg">
</blockquote></div><div dir="ltr">-- <br></div><div data-smartmail="gmail_signature"><div dir="ltr"><div>‘[A] talent for following the ways of yesterday, is not sufficient to improve the world of today.’</div><div> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC</div></div></div>