[Beowulf] Checkpointing using flash

Andrew Holway andrew.holway at gmail.com
Mon Sep 24 09:57:49 PDT 2012

> Haha, I doubt it -- probably the opposite in terms of development cost.
>   Which is why I question the original statement on the grounds that
> "cost" isn't well defined.  Maybe the costs just performance-wise, but
> that's not even clear to me when we consider things at huge scales.

40 years ago an army of cheap software developers were needed to
service a single very expensive box. Now the boxes are super cheap and
the price for decent software developers is very high.

With hardware, you just have to solve the problem once. With this
exascale node failing problem, if we push this problem to software
every single application that wants to scale to those heights is going
to have to find a way to apply this method and restriction. This
approach also hits us very hard in a place where we are hurting the
most; in our developers. In Germany, at present, there is I believe a
fairly significant net surplus if compute resource as our scientists
try to wrap their heads around parallel programming to take advantage
of this exponentially increasing resource.

Checkpointing to some kind of non volatile disk might work for some
codes but its not a universal solution. Some MPI tricks might work for
another code. What about QCD codes that are almost completely I/O
bound....I cant wrap my head around how either solution would work in
that circumstance but then again I am not a computer scientist and
have a moderately weak grasp on the mechanics.

Its easy to underestimate the golden rule of HPC! "Never underestimate
the crappyness of the code!". It is our task to provide a safe an
elegant playground for our users so that this crappyness matters a bit
less :)

> I think the MapReduce framework actually makes a good case for
> (admittedly non-general, fairly sequential workloads) the ability for
> software to cheaply and at reasonable performance scale with added
> hardware.  Don't expect to do any real physics on MR of course, but for
> huge data crunches it is quite nice.  A /totally/ general framework that
> scales on a number of platforms is one of those cure-alls we aren't
> likely to see for a decade or three.  Just my pessimistic perspective
> though :D.
> Best,
> ellis
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list