<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p>I would be laughing if this wasn't so true. <br>
</p>
<p>The sad thing is, the person who took on this convoluted,
BS-heavy approach would probably get promoted for managing a
"large, complicated project with many moving parts" while the guy
who took Gavin's approach would continue to toil away in his
basement office for quietly getting the job done quickly and
saving money. <br>
</p>
<pre class="moz-signature" cols="72">Prentice </pre>
<div class="moz-cite-prefix">On 10/25/2016 11:45 PM, John Hanks
wrote:<br>
</div>
<blockquote
cite="mid:CAGrHuK7woJdEfA+Y18=BVdcmbsek73_O7oO5U71bZ=PS8dTKeA@mail.gmail.com"
type="cite">
<div dir="ltr">We routinely run jobs that last for months, some
are codes that have an endpoint others are processes that
provide some service (SOLR, ElasticSearch, etc,...) which have
no defined endpoint. Unless you have some seriously flaky
hardware or ongoing power/cooling issues there is nothing
special needed to get high uptimes. I'd suggest making NFS
mounts hard, so processes can recover from an NFS server reboot.
Another good idea is to start several copies of an important run
on several different nodes, preferably in different
racks/PDUs/UPS.
<div><br>
</div>
<div>The frame of your question as a thought exercise does open
up the possibility for commentary though, challenge accepted. </div>
<div><br>
</div>
<div>A question like this will pique the interest of anyone
seeking to justify their existence through the application of
BS, myself included. To a project manager, this problem you
have is a veritable goldmine of opportunity. To a vendor, you
have opened up pandora's box and are the potential sale that
will allow them to buy their kid the GI Joe with the kung-fu
grip this Christmas. No solution is too costly or too complex
to apply to this challenge, and the planning and execution
must be detailed, require committees, phone calls, copious
emails, a procurement process, budgetary analysis, more
training, most certainly additional staffing....</div>
<div><br>
</div>
<div>Fortunately xkcd has a cartoon for this: <a
moz-do-not-send="true" href="http://xkcd.com/1445/">http://xkcd.com/1445/</a></div>
<div><br>
</div>
<div>You should take this opportunity to perform an experiment
for us. You'll need two groups, first the experimental group.
Contact a local project manager (just follow the trail of 6
sigma motivational posters) and explain your problem. Be
detailed, spend some time googling up as many buzzwords as
possible for your explanation. Look earnest and sincere as you
present your case and drop hints about how you too may be
interested in some of that "wonderful sigma training". Start a
timer.</div>
<div><br>
</div>
<div>Now, the control group. When you get back to your desk go
drag out some of those old workstations that were destined for
surplus. Put them on some scrounged up, half dead desk-side
UPS and start a half dozen copies of your code. Start a second
timer, then go back to your normal duties.</div>
<div><br>
</div>
<div>In a few months report back which approach produced the
best results as measured by time-to-run-completion and cost in
dollars per completed run. Don't forget to have the project
manager provide a detailed time record of hours spent by all
the people they involve in the process. </div>
<div><br>
</div>
<div>I look forward to the results. </div>
<div><br>
</div>
<div>jbh</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr">On Wed, Oct 26, 2016 at 1:57 AM Skylar Thompson
<<a moz-do-not-send="true"
href="mailto:skylar.thompson@gmail.com">skylar.thompson@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Assuming you
can contain a run on a single node, you could use<br
class="gmail_msg">
containers and the freezer controller (plus maybe LVM
snapshots) to do<br class="gmail_msg">
checkpoint/restart.<br class="gmail_msg">
<br class="gmail_msg">
Skylar<br class="gmail_msg">
<br class="gmail_msg">
On 10/25/2016 11:24 AM, Michael Di Domenico wrote:<br
class="gmail_msg">
> here's an interesting thought exercise and a real problem
i have to tackle.<br class="gmail_msg">
><br class="gmail_msg">
> i have a researchers that want to run magma codes for
three weeks or<br class="gmail_msg">
> so at a time. the process is unfortunately sequential in
nature and<br class="gmail_msg">
> magma doesn't support check pointing (as far as i know)
and (I don't<br class="gmail_msg">
> know much about magma)<br class="gmail_msg">
><br class="gmail_msg">
> So the question is;<br class="gmail_msg">
><br class="gmail_msg">
> what kind of a system could one design/buy using any
combination of<br class="gmail_msg">
> hardware/software that would guarantee that this program
would run for<br class="gmail_msg">
> 3 wks or so and not fail<br class="gmail_msg">
><br class="gmail_msg">
> and by "fail" i mean from some system type error, ie
memory faulted,<br class="gmail_msg">
> cpu faulted, network io slipped (nfs timeout) as opposed
to "there's a<br class="gmail_msg">
> bug in magma" which already bit us a few times<br
class="gmail_msg">
><br class="gmail_msg">
> there's probably some commercial or "unreleased"
commercial product on<br class="gmail_msg">
> the market that might fill this need, but i'm also
looking for<br class="gmail_msg">
> something "creative" as well<br class="gmail_msg">
><br class="gmail_msg">
> three weeks isn't a big stretch compared to some of the
others codes<br class="gmail_msg">
> i've heard around the DOE that run for months, but it's
still pretty<br class="gmail_msg">
> painful to have a run go for three weeks and then fail
2.5 weeks in<br class="gmail_msg">
> and have to restart. most modern day hardware would
probably support<br class="gmail_msg">
> this without issue, but i'm looking for more of a
guarantee then a<br class="gmail_msg">
> prayer<br class="gmail_msg">
><br class="gmail_msg">
> double bonus points for anything that runs at high clock
speeds >3Ghz<br class="gmail_msg">
><br class="gmail_msg">
> any thoughts?<br class="gmail_msg">
> _______________________________________________<br
class="gmail_msg">
> Beowulf mailing list, <a moz-do-not-send="true"
href="mailto:Beowulf@beowulf.org" class="gmail_msg"
target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin
Computing<br class="gmail_msg">
> To change your subscription (digest mode or unsubscribe)
visit <a moz-do-not-send="true"
href="http://www.beowulf.org/mailman/listinfo/beowulf"
rel="noreferrer" class="gmail_msg" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br
class="gmail_msg">
><br class="gmail_msg">
<br class="gmail_msg">
_______________________________________________<br
class="gmail_msg">
Beowulf mailing list, <a moz-do-not-send="true"
href="mailto:Beowulf@beowulf.org" class="gmail_msg"
target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin
Computing<br class="gmail_msg">
To change your subscription (digest mode or unsubscribe) visit
<a moz-do-not-send="true"
href="http://www.beowulf.org/mailman/listinfo/beowulf"
rel="noreferrer" class="gmail_msg" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br
class="gmail_msg">
</blockquote>
</div>
<div dir="ltr">-- <br>
</div>
<div data-smartmail="gmail_signature">
<div dir="ltr">
<div>‘[A] talent for following the ways of yesterday, is not
sufficient to improve the world of today.’</div>
<div> - King Wu-Ling, ruler of the Zhao state in northern
China, 307 BC</div>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Beowulf mailing list, <a class="moz-txt-link-abbreviated" href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit <a class="moz-txt-link-freetext" href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a>
</pre>
</blockquote>
<br>
</body>
</html>