[Beowulf] non-stop computing

Michael Di Domenico mdidomenico4 at gmail.com
Tue Oct 25 11:24:39 PDT 2016


here's an interesting thought exercise and a real problem i have to tackle.

i have a researchers that want to run magma codes for three weeks or
so at a time.  the process is unfortunately sequential in nature and
magma doesn't support check pointing (as far as i know) and (I don't
know much about magma)

So the question is;

what kind of a system could one design/buy using any combination of
hardware/software that would guarantee that this program would run for
3 wks or so and not fail

and by "fail" i mean from some system type error, ie memory faulted,
cpu faulted, network io slipped (nfs timeout) as opposed to "there's a
bug in magma" which already bit us a few times

there's probably some commercial or "unreleased" commercial product on
the market that might fill this need, but i'm also looking for
something "creative" as well

three weeks isn't a big stretch compared to some of the others codes
i've heard around the DOE that run for months, but it's still pretty
painful to have a run go for three weeks and then fail 2.5 weeks in
and have to restart.  most modern day hardware would probably support
this without issue, but i'm looking for more of a guarantee then a
prayer

double bonus points for anything that runs at high clock speeds >3Ghz

any thoughts?


More information about the Beowulf mailing list