[Beowulf] non-stop computing

Joe Landman landman at scalableinformatics.com
Wed Oct 26 06:52:13 PDT 2016


Licensing might impede this ...  Usually does.


On 10/26/2016 09:50 AM, Prentice Bisbal wrote:
> There is a amazing beauty in this simplicity.
>
> Prentice
>
> On 10/25/2016 02:46 PM, Gavin W. Burris wrote:
>> Hi, Michael.
>>
>> What if the same job ran on two separate nodes, with IO to local 
>> scratch?  What are the odds both nodes would fail in that three week 
>> period.  No special hardware / software required.  Simple. Done.
>>
>> Cheers.
>>
>> On Tue 10/25/16 02:24PM EDT, Michael Di Domenico wrote:
>>> here's an interesting thought exercise and a real problem i have to 
>>> tackle.
>>>
>>> i have a researchers that want to run magma codes for three weeks or
>>> so at a time.  the process is unfortunately sequential in nature and
>>> magma doesn't support check pointing (as far as i know) and (I don't
>>> know much about magma)
>>>
>>> So the question is;
>>>
>>> what kind of a system could one design/buy using any combination of
>>> hardware/software that would guarantee that this program would run for
>>> 3 wks or so and not fail
>>>
>>> and by "fail" i mean from some system type error, ie memory faulted,
>>> cpu faulted, network io slipped (nfs timeout) as opposed to "there's a
>>> bug in magma" which already bit us a few times
>>>
>>> there's probably some commercial or "unreleased" commercial product on
>>> the market that might fill this need, but i'm also looking for
>>> something "creative" as well
>>>
>>> three weeks isn't a big stretch compared to some of the others codes
>>> i've heard around the DOE that run for months, but it's still pretty
>>> painful to have a run go for three weeks and then fail 2.5 weeks in
>>> and have to restart.  most modern day hardware would probably support
>>> this without issue, but i'm looking for more of a guarantee then a
>>> prayer
>>>
>>> double bonus points for anything that runs at high clock speeds >3Ghz
>>>
>>> any thoughts?
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin 
>>> Computing
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
e: landman at scalableinformatics.com
w: http://scalableinformatics.com
t: @scalableinfo
p: +1 734 786 8423 x121
c: +1 734 612 4615



More information about the Beowulf mailing list