[Beowulf] non-stop computing
landman at scalableinformatics.com
Wed Oct 26 06:52:13 PDT 2016
Licensing might impede this ... Usually does.
On 10/26/2016 09:50 AM, Prentice Bisbal wrote:
> There is a amazing beauty in this simplicity.
> On 10/25/2016 02:46 PM, Gavin W. Burris wrote:
>> Hi, Michael.
>> What if the same job ran on two separate nodes, with IO to local
>> scratch? What are the odds both nodes would fail in that three week
>> period. No special hardware / software required. Simple. Done.
>> On Tue 10/25/16 02:24PM EDT, Michael Di Domenico wrote:
>>> here's an interesting thought exercise and a real problem i have to
>>> i have a researchers that want to run magma codes for three weeks or
>>> so at a time. the process is unfortunately sequential in nature and
>>> magma doesn't support check pointing (as far as i know) and (I don't
>>> know much about magma)
>>> So the question is;
>>> what kind of a system could one design/buy using any combination of
>>> hardware/software that would guarantee that this program would run for
>>> 3 wks or so and not fail
>>> and by "fail" i mean from some system type error, ie memory faulted,
>>> cpu faulted, network io slipped (nfs timeout) as opposed to "there's a
>>> bug in magma" which already bit us a few times
>>> there's probably some commercial or "unreleased" commercial product on
>>> the market that might fill this need, but i'm also looking for
>>> something "creative" as well
>>> three weeks isn't a big stretch compared to some of the others codes
>>> i've heard around the DOE that run for months, but it's still pretty
>>> painful to have a run go for three weeks and then fail 2.5 weeks in
>>> and have to restart. most modern day hardware would probably support
>>> this without issue, but i'm looking for more of a guarantee then a
>>> double bonus points for anything that runs at high clock speeds >3Ghz
>>> any thoughts?
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>> To change your subscription (digest mode or unsubscribe) visit
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
e: landman at scalableinformatics.com
p: +1 734 786 8423 x121
c: +1 734 612 4615
More information about the Beowulf