[Beowulf] non-stop computing

John Hearns John.Hearns at xma.co.uk
Wed Oct 26 07:02:24 PDT 2016


Well you CAN have RAM arranged in banks which mirror each other in a RAID-1 fashion.

But heck, why not have THREE servers running the same problem - then two of them can vote out the other one,
and start to mutter about it behind its back...


-----Original Message-----
From: Beowulf [mailto:beowulf-bounces at beowulf.org] On Behalf Of Prentice Bisbal
Sent: 26 October 2016 14:51
To: beowulf at beowulf.org
Subject: Re: [Beowulf] non-stop computing

There is a amazing beauty in this simplicity.

Prentice

On 10/25/2016 02:46 PM, Gavin W. Burris wrote:
> Hi, Michael.
>
> What if the same job ran on two separate nodes, with IO to local scratch?  What are the odds both nodes would fail in that three week period.  No special hardware / software required.  Simple.  Done.
>
> Cheers.
>
> On Tue 10/25/16 02:24PM EDT, Michael Di Domenico wrote:
>> here's an interesting thought exercise and a real problem i have to tackle.
>>
>> i have a researchers that want to run magma codes for three weeks or
>> so at a time.  the process is unfortunately sequential in nature and
>> magma doesn't support check pointing (as far as i know) and (I don't
>> know much about magma)
>>
>> So the question is;
>>
>> what kind of a system could one design/buy using any combination of
>> hardware/software that would guarantee that this program would run
>> for
>> 3 wks or so and not fail
>>
>> and by "fail" i mean from some system type error, ie memory faulted,
>> cpu faulted, network io slipped (nfs timeout) as opposed to "there's
>> a bug in magma" which already bit us a few times
>>
>> there's probably some commercial or "unreleased" commercial product
>> on the market that might fill this need, but i'm also looking for
>> something "creative" as well
>>
>> three weeks isn't a big stretch compared to some of the others codes
>> i've heard around the DOE that run for months, but it's still pretty
>> painful to have a run go for three weeks and then fail 2.5 weeks in
>> and have to restart.  most modern day hardware would probably support
>> this without issue, but i'm looking for more of a guarantee then a
>> prayer
>>
>> double bonus points for anything that runs at high clock speeds >3Ghz
>>
>> any thoughts?
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> Computing To change your subscription (digest mode or unsubscribe)
>> visit http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Employees of XMA Ltd are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising. XMA Limited is registered in England and Wales (registered no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP


More information about the Beowulf mailing list