[Beowulf] Revelations on Roadrunner's Retirement

Eugen Leitl eugen at leitl.org
Fri Apr 5 08:48:56 PDT 2013


Revelations on Roadrunner's Retirement

Nicole Hemsoth

Earlier this week we reported on the decommissioning of the Roadrunner
supercomputer at Los Alamos National Laboratory, which was being shuttered
following a stint of fame as the first system to break the petascale barrier
back in 2008.

According to Paul Henning from the computational physics division at Los
Alamos, Roadrunner’s checkout made big news, but the end of the line for the
super was well-planned, if not right on schedule.

The system served its purpose chewing a bevy of mostly classified and some
key civilian code. However, in the end, the combination of a finite contract,
an extinct chip, the cost of crumpling up code to fit into IBM’s Cell, and
the promise of swifter, more efficient technologies were main factors in the
planned clipped lifecycle of the petaflop pioneer.

“Rather than think of these machines as physical entities, we think of them
as projects,” he explained. “At the beginning of the Roadrunner acquisition
we laid out a project lifetime for this—and that lifetime considered a number
of things, including the cost of maintenance, power, vendor and licensing
contracts, and how we would upgrade the system.”

Henning detailed that the support contract with IBM was up and since they
don’t even produce the core of the machine’s architecture, the Cell, the
question of even scrounging up some spare parts would have presented a rather
tricky issue. The retirement party had been planned years ago anyway, but
there are some meaty learning opportunities to glean from the scrap metal.

When any system at the lab is shuttered, the autopsy, which looks at
everything from the integrity of the memory and OS to the more nuts and bolts
physical properties, is performed. A key finding of the post-mortem revolves
around the condition of the boxes after five years of heat, wear and
tear—it’s here where the materials analysis begins. It’s given the renowned
materials science team at the center an insider’s view into the real stress
on systems after high-yield, high-heat production—and from what we read
between the lines, these boxes are maxed out.

Then again, there were never any plans to build the system out to new glory
ala the Jaguar to Titan transformation. Anyway, even if the hardware wasn’t
on its last, weak leg, considering they’d have to retrofit the entire system
since IBM would return a 404 on their build-out needs, it makes sense that
they’d want to rip…and of course, replace.

Currently, Los Alamos has sent its applications on a redirect course to the
smaller, slightly more efficient and roughly performance-equivalent Cielo
system, which is housed in the same space as the now-defunct Roadrunner.
Henning said the developer-friendly architecture saves time and money on code
retooling, ostensibly while they try to fit something new into their

And so here is where things get interesting. Because we can speculate on what
Los Alamos might dream up to fill the 6,000 square foot gap left behind.
That’s a pretty large spate of empty space for any upstart system to settle
into. Titan’s sprawl is right under 5,000 square feet and a lot of flops have
fit in less than that.

There are a few hints at what might sit on the charred spot Roadrunner once
occupied post-ripdown. However, it’s worth noting that a quick perusal of the
NNSA’s procurement plans for the next year include something on the order of
a $50 million to (yes) one billion dollar project, which is currently
accepting proposals. And it’s kind of hard to imagine what else would be
filed under tech procurements to that monetary tune. If any of you know
anything about this, that comments section down there looks awfully
empty….(hint, hint).

All speculation aside, it looks like we’ll find out soon enough—probably
later this year—just what will turn off that vacancy sign at the lab. Until
then, the Roadrunner story serves as a reminder about how quickly the tides
of this type of tech shift and leave superhero machines drifting into
forgotten waters.

When national labs and large HPC sites sit down to spill ink on new system
designs, they’re hedging their bets on what future technologies will look
like. It’s rare, unless folks are on a TACC/Stampede-like course to go from
ground to super in a tick over a year, to know what innovations on the
architecture, efficiency or acceleration front will yield big
price-performance dividends. So at the time that Los Alamos set about
architecting Roadrunner based on the very unique Cell approach, they were
placing their bets on the future of that technology.

Since that development cycle, the rise of GPU acceleration, the introduction
of the promising Phi, and some efficiency tweaks on the software side have
rendered some of what made Roadrunner shine seem rather date. It’s now
possible to get more compute power in a smaller power envelope…and with a lot
less in the way of programming hassle, as well, notes Henning.  However, for
the NNSA and Los Alamos, whatever the clandestine code was they cooked around
the Cell, it must have been worth the effort on the retooling side.

Although the story of the Roadrunner being forced into retirement found its
way into a number of mainstream tech media stories over the course of the
week, this is a pretty standard order of operations for large HPC centers,
especially national labs. Henning stressed that the shutdown of the
once-famous system is not unlike the series of other supers they’ve shuttered
in succession at the center. They build a plan for acquisition, see a machine
run its course, learn from it post-mortem and shuttle it off in parts to make
way for something fresh.

More information about the Beowulf mailing list