[Beowulf] HPC workflows

Fri Dec 7 10:07:12 PST 2018

On 12/7/18, 8:46 AM, "Beowulf on behalf of Michael Di Domenico" <beowulf-bounces at beowulf.org on behalf of mdidomenico4 at gmail.com> wrote:

    On Fri, Dec 7, 2018 at 11:35 AM John Hanks <griznog at gmail.com> wrote:
    >
    >  But, putting it in a container wouldn't make my life any easier and would, in fact, just add yet another layer of something to keep up to date.

    i think the theory behind this is the containers allow the sysadmins
    to kick the can down the road and put the onus of updates on the
    container developer.  but then you get into a circle of trust issue,
    whereby now you have to trust the container developers are doing
    something sane and in a timely manner.

    a perfect example that we pitched up to our security team was (this
    was few year ago mind you); what happens when someone embeds openssl
    libraries in the container.  who's responsible for updating them?
    what happens when that container gets abandoned by the dev?  and those
    containers are running with some sort of docker/root privilege
    menagire.  this was back when openssl had bugs coming up left and
    right.  yeah, that conversation stopped dead in its tracks and we put
    a moratorium on docker.

    but i don't think the theory lines up with the practice, and that's
    why dev's shouldn't be doing ops

this is a generic problem in areas other than HPC.  Over the past few years, a fair amount of the software I'm working with is targeted to spacecraft platforms - We had an interesting exercise over the past couple years.  I was porting a standard orbit propagation package (SGP4, see http://www.celestrak.com/ for the Pascal version from 2000), which is available in many different languages. I happened to be implementing the C version in RTEMS running on a SPARC V8 processor (the LEON2 and LEON3, as it happens).  The software itself is quite compact, has no dependencies other than math.h, stdio.h, stdlib.h, and derives from an original Fortran version.  RTEMS is a real time operating system that exposes POSIX API, so it's easy to work with.  What we did is create a wrapper for SGP that matches a standardized set of APIs for software radios (Space Telecommunications Radio System, STRS).

But here's the problem - There are really 4 different target hardware platforms, all theoretically the same, but not. In the space flight software business, one chooses a toolchain and development environment at the beginning of the project (Phase A - Formulation) and you stay with that for the life of the mission, unless there's a compelling reason to change.   In the course of the last 10 years, we've gone through 5 versions of RTEMS (4.8.4.10,4.11,4.12,5.0), 3 different source management tools (cvs,svn,git), an IDE that came and went (Eclipse), not to mention a variety of versions of the gcc toolchain.  Each mission has its own set of all of this. And, a bunch of homegrown make files and related build processes. And, of course, it's a hodgepodge of CentOS, Scientific Linux, Ubuntu, Debian, and RH, depending on what was the "most supported distro" at the time the mission picked it (which might depend on who the SysAdmin on the project was). 

10 years is *forever* in the software development world. I've not yet had the experience of a developer born after the first version of the flight software they're working on was created - but I know that other people at JPL have (when it takes 7 years to get to where you're going, and the mission lasts 10-15 years after that...).  And this is perfectly reasonable - SGP4, for instance, basically implements the laws of physics as a numerical model - it worked fine in 2000, it works fine now, it's going to work just fine in 2030, with no significant changes. "The SGP4 and SDP4 models were published along with sample code in FORTRAN IV in 1988 with refinements over the original model to handle the larger number of objects in orbit since" (Wikipedia article on SGP)

So, "inheriting" the SGP4 propagator from one project into another is not just a matter of moving the source code for SGP. You have to compile it with all the other stuff, and there are myriad hidden dependencies - does this platform have hardware floating point or software emulated floating point, and if the latter, which of several flavors.  Where in the source tree (for that project) does it sit? What's the permissions strategy? Where do you add it in the build process?

And then contemplate propagating a bug fix over all those platforms.  You might make a decision to propagate a change to some, but not all platforms - Maybe the spacecraft you're contemplating is getting towards the end of its life, and you'll never use the function you developed 4 years ago again. Do you put that bug fix to address the incorrect gravitation parameter at Mars into the systems that are orbiting Earth?

Yes - folks have said "put it in containers" and in the last few years, folks have started spinning up VMs to manage this. Historically, we keep "systems under glass" - once you've got the build PCs working, you preserve them for the project forever.   The problem is that PCs fail eventually.  But whether it is keeping half a dozen PCs on a shelf running, or half a dozen VMs running, it's really the same administrative burden - they all need to have annual security audits, perhaps have patches applied (if it's "on the network").  And you've really not addressed the underlying problem of needing to support a remarkable variety of heterogenous platforms.  You've basically saved the physical space on a shelf for all those PCs.