[Beowulf] non-stop computing

Tue Oct 25 11:33:16 PDT 2016

On 10/25/2016 02:24 PM, Michael Di Domenico wrote:
> here's an interesting thought exercise and a real problem i have to tackle.
>
> i have a researchers that want to run magma codes for three weeks or
> so at a time.  the process is unfortunately sequential in nature and
> magma doesn't support check pointing (as far as i know) and (I don't
> know much about magma)
>
> So the question is;
>
> what kind of a system could one design/buy using any combination of
> hardware/software that would guarantee that this program would run for
> 3 wks or so and not fail
>
> and by "fail" i mean from some system type error, ie memory faulted,
> cpu faulted, network io slipped (nfs timeout) as opposed to "there's a
> bug in magma" which already bit us a few times

You'd need to design an HA network and storage system to handle the 
possibility of external failure.  For internal failure, you'd want to 
run this in a kvm very close to the metal, and snapshot/checkpoint the 
VM every so often to local/remote VERY FAST storage.

This said, it would help to start with a system that can handle 
hard/heavy load for that period of time w/o failure.  We have units at 
various places around the world that sustain many GB/s continuously of 
IO for more than a year of operations, under fairly intense loads.

Choose your systems wisely, and don't let brand names decide the outcome.

> there's probably some commercial or "unreleased" commercial product on
> the market that might fill this need, but i'm also looking for
> something "creative" as well

Start with good.  If you ping me about our burn in test case, I'll be 
happy to send it over.  Its running y-cruncher to do burn in on all 
CPUs/ram continuously.  Its pretty good at catching bad MB/CPU/RAM. 
Previously, I had a GAMESS run I used for this (also very good).

>
> three weeks isn't a big stretch compared to some of the others codes
> i've heard around the DOE that run for months, but it's still pretty
> painful to have a run go for three weeks and then fail 2.5 weeks in
> and have to restart.  most modern day hardware would probably support
> this without issue, but i'm looking for more of a guarantee then a
> prayer
>
> double bonus points for anything that runs at high clock speeds >3Ghz

See above.  This is fairly *easy* for various definitions of easy.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
e: landman at scalableinformatics.com
w: http://scalableinformatics.com
t: @scalableinfo
p: +1 734 786 8423 x121
c: +1 734 612 4615