[Beowulf] non-stop computing
pbisbal at pppl.gov
Wed Oct 26 06:50:51 PDT 2016
There is a amazing beauty in this simplicity.
On 10/25/2016 02:46 PM, Gavin W. Burris wrote:
> Hi, Michael.
> What if the same job ran on two separate nodes, with IO to local scratch? What are the odds both nodes would fail in that three week period. No special hardware / software required. Simple. Done.
> On Tue 10/25/16 02:24PM EDT, Michael Di Domenico wrote:
>> here's an interesting thought exercise and a real problem i have to tackle.
>> i have a researchers that want to run magma codes for three weeks or
>> so at a time. the process is unfortunately sequential in nature and
>> magma doesn't support check pointing (as far as i know) and (I don't
>> know much about magma)
>> So the question is;
>> what kind of a system could one design/buy using any combination of
>> hardware/software that would guarantee that this program would run for
>> 3 wks or so and not fail
>> and by "fail" i mean from some system type error, ie memory faulted,
>> cpu faulted, network io slipped (nfs timeout) as opposed to "there's a
>> bug in magma" which already bit us a few times
>> there's probably some commercial or "unreleased" commercial product on
>> the market that might fill this need, but i'm also looking for
>> something "creative" as well
>> three weeks isn't a big stretch compared to some of the others codes
>> i've heard around the DOE that run for months, but it's still pretty
>> painful to have a run go for three weeks and then fail 2.5 weeks in
>> and have to restart. most modern day hardware would probably support
>> this without issue, but i'm looking for more of a guarantee then a
>> double bonus points for anything that runs at high clock speeds >3Ghz
>> any thoughts?
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf