[Beowulf] non-stop computing

Paul McIntosh paul.mcintosh at monash.edu
Tue Oct 25 14:46:44 PDT 2016


Hi Michael,

You could try BLCR for check pointing - I have only had a brief test of it
and it check pointed OpenFOAM ok on one node (though I think a single
threaded run)
http://crd.lbl.gov/departments/computer-science/CLaSS/research/BLCR/

So it would be likely to work on magma.

There is also https://criu.org/Main_Page which I have never tried as it
needs newer kernels.

Cheers,

Paul


-----Original Message-----
From: Beowulf [mailto:beowulf-bounces at beowulf.org] On Behalf Of Michael Di
Domenico
Sent: Wednesday, 26 October 2016 5:25 AM
To: Beowulf Mailing List <Beowulf at beowulf.org>
Subject: [Beowulf] non-stop computing

here's an interesting thought exercise and a real problem i have to tackle.

i have a researchers that want to run magma codes for three weeks or so at a
time.  the process is unfortunately sequential in nature and magma doesn't
support check pointing (as far as i know) and (I don't know much about
magma)

So the question is;

what kind of a system could one design/buy using any combination of
hardware/software that would guarantee that this program would run for
3 wks or so and not fail

and by "fail" i mean from some system type error, ie memory faulted, cpu
faulted, network io slipped (nfs timeout) as opposed to "there's a bug in
magma" which already bit us a few times

there's probably some commercial or "unreleased" commercial product on the
market that might fill this need, but i'm also looking for something
"creative" as well

three weeks isn't a big stretch compared to some of the others codes i've
heard around the DOE that run for months, but it's still pretty painful to
have a run go for three weeks and then fail 2.5 weeks in and have to
restart.  most modern day hardware would probably support this without
issue, but i'm looking for more of a guarantee then a prayer

double bonus points for anything that runs at high clock speeds >3Ghz

any thoughts?
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To
change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list