[Beowulf] Checkpointing using flash

Tue Sep 25 05:19:49 PDT 2012

On 09/24/2012 12:57 PM, Andrew Holway wrote:
>> Haha, I doubt it -- probably the opposite in terms of development cost.
>>    Which is why I question the original statement on the grounds that
>> "cost" isn't well defined.  Maybe the costs just performance-wise, but
>> that's not even clear to me when we consider things at huge scales.
>
> 40 years ago an army of cheap software developers were needed to
> service a single very expensive box. Now the boxes are super cheap and
> the price for decent software developers is very high.

40 years ago the demand for this type of job was...what?  Incredibly 
limited, I'd bet, if not a downright niche (supercomputing, 
defense-related calculations, business apps, maybe a handful of other 
purposes).  And the boxes aren't super cheap because things have been 
"solved" in hardware rather than software -- the fabs for modern 
processors are much, much more expensive than they used to be, but the 
laws of sales at scale, if you will, kick in to make things cheap since 
so many want PCs.

> With hardware, you just have to solve the problem once. With this

I am totally unconvinced about this...if I solve something in software, 
don't I only need to solve it once as well, opensource my code, and 
share it?  While I agree certain things are downright destined for 
hardware (computer vision problems, arithmetic, etc), it is completely 
unclear to me that something as unsolved and as high-level as parallel 
programming for exascale computing should even be attempted to be dealt 
with in hardware.  What are you expecting the developers to code like 
then, if they cannot understand parallel programming?  Serial codes? 
Good luck finding or writing a compiler (also software) that will turn a 
serial code into a parallel code perfectly.  That's many decades down 
the line.

> Checkpointing to some kind of non volatile disk might work for some
> codes but its not a universal solution. Some MPI tricks might work for

Uhh...I think it's the opposite.  We've been discussing Checkpointing in 
this thread as a general solution that almost always works (I mean 
you're literally snapshotting your memory, I cannot think of an instance 
where that would not work), but it's not a solution that we'd like to 
continue using for most of our codes in the future.  It's just inefficient.

> another code. What about QCD codes that are almost completely I/O
> bound....I cant wrap my head around how either solution would work in
> that circumstance but then again I am not a computer scientist and
> have a moderately weak grasp on the mechanics.

What does I/O-bound or CPU-bound have to do with correctness of a 
checkpoint?  Do you mean data continues to be streamed in real-time like 
from a collider so we have to deal with that during the checkpoint?  Or 
are you referring to something else entirely?

> Its easy to underestimate the golden rule of HPC! "Never underestimate
> the crappyness of the code!". It is our task to provide a safe an
> elegant playground for our users so that this crappyness matters a bit
> less :)

On a related note (I assume a majority of your users are scientists), 
regarding your or somebody else's post a bit back about how poor 
scientists are at coding -- I've witnessed the exact opposite.  Now, 
this is going on limited experience and all, but when I interned at 
Argonne National Labs by Chicago I saw some absolutely amazing code 
written by people without a computer science background that ran on what 
was then one of the top supers in the country (Intrepid).  The point is, 
they need to get their work done, and they know just how painful and 
long poor code will be and take.  Moreover, their careers rest on the 
premise that their calculations and resultant code are correct, and they 
have deadlines like the rest of us that they have to meet, which means 
therefore their code has to complete by.  My golden rule of HPC is 
therefore quite the opposite: "Never underestimate the cleverness of 
your users."  Their code might do "weird" things, but it's simply 
because your framework wasn't adaptive enough.  I have supreme respect 
for most of the "users" I've dealt with, but as I said before, this is 
admittedly going on limited experience and I could be an exceptional case.

Best,

ellis