[Beowulf] Revelations on Roadrunner's Retirement

Joshua mora acosta joshua_mora at usa.net
Fri Apr 5 09:00:59 PDT 2013


It would be good to know what were the levels of efficiency of the
applications wrt FLOP/s and GB/s and the typical node count for the runs.
Then compare that against the current PF/s systems.

Joshua

------ Original Message ------
Received: 05:49 PM CEST, 04/05/2013
From: Eugen Leitl <eugen at leitl.org>
To: Beowulf at beowulf.org
Subject: [Beowulf] Revelations on Roadrunner's Retirement

> 
>
http://www.hpcwire.com/hpcwire/2013-04-04/revelations_on_roadrunner_s_retirement.html?featured=top
> 
> Revelations on Roadrunner's Retirement
> 
> Nicole Hemsoth
> 
> Earlier this week we reported on the decommissioning of the Roadrunner
> supercomputer at Los Alamos National Laboratory, which was being shuttered
> following a stint of fame as the first system to break the petascale
barrier
> back in 2008.
> 
> According to Paul Henning from the computational physics division at Los
> Alamos, Roadrunner’s checkout made big news, but the end of the line for
the
> super was well-planned, if not right on schedule.
> 
> The system served its purpose chewing a bevy of mostly classified and some
> key civilian code. However, in the end, the combination of a finite
contract,
> an extinct chip, the cost of crumpling up code to fit into IBM’s Cell,
and
> the promise of swifter, more efficient technologies were main factors in
the
> planned clipped lifecycle of the petaflop pioneer.
> 
> “Rather than think of these machines as physical entities, we think of
them
> as projects,” he explained. “At the beginning of the Roadrunner
acquisition
> we laid out a project lifetime for this—and that lifetime considered a
number
> of things, including the cost of maintenance, power, vendor and licensing
> contracts, and how we would upgrade the system.”
> 
> Henning detailed that the support contract with IBM was up and since they
> don’t even produce the core of the machine’s architecture, the Cell,
the
> question of even scrounging up some spare parts would have presented a
rather
> tricky issue. The retirement party had been planned years ago anyway, but
> there are some meaty learning opportunities to glean from the scrap metal.
> 
> When any system at the lab is shuttered, the autopsy, which looks at
> everything from the integrity of the memory and OS to the more nuts and
bolts
> physical properties, is performed. A key finding of the post-mortem
revolves
> around the condition of the boxes after five years of heat, wear and
> tear—it’s here where the materials analysis begins. It’s given the
renowned
> materials science team at the center an insider’s view into the real
stress
> on systems after high-yield, high-heat production—and from what we read
> between the lines, these boxes are maxed out.
> 
> Then again, there were never any plans to build the system out to new glory
> ala the Jaguar to Titan transformation. Anyway, even if the hardware
wasn’t
> on its last, weak leg, considering they’d have to retrofit the entire
system
> since IBM would return a 404 on their build-out needs, it makes sense that
> they’d want to rip…and of course, replace.
> 
> Currently, Los Alamos has sent its applications on a redirect course to the
> smaller, slightly more efficient and roughly performance-equivalent Cielo
> system, which is housed in the same space as the now-defunct Roadrunner.
> Henning said the developer-friendly architecture saves time and money on
code
> retooling, ostensibly while they try to fit something new into their
> environment.
> 
> And so here is where things get interesting. Because we can speculate on
what
> Los Alamos might dream up to fill the 6,000 square foot gap left behind.
> That’s a pretty large spate of empty space for any upstart system to
settle
> into. Titan’s sprawl is right under 5,000 square feet and a lot of flops
have
> fit in less than that.
> 
> There are a few hints at what might sit on the charred spot Roadrunner once
> occupied post-ripdown. However, it’s worth noting that a quick perusal of
the
> NNSA’s procurement plans for the next year include something on the order
of
> a $50 million to (yes) one billion dollar project, which is currently
> accepting proposals. And it’s kind of hard to imagine what else would be
> filed under tech procurements to that monetary tune. If any of you know
> anything about this, that comments section down there looks awfully
> empty….(hint, hint).
> 
> All speculation aside, it looks like we’ll find out soon
enough—probably
> later this year—just what will turn off that vacancy sign at the lab.
Until
> then, the Roadrunner story serves as a reminder about how quickly the tides
> of this type of tech shift and leave superhero machines drifting into
> forgotten waters.
> 
> When national labs and large HPC sites sit down to spill ink on new system
> designs, they’re hedging their bets on what future technologies will look
> like. It’s rare, unless folks are on a TACC/Stampede-like course to go
from
> ground to super in a tick over a year, to know what innovations on the
> architecture, efficiency or acceleration front will yield big
> price-performance dividends. So at the time that Los Alamos set about
> architecting Roadrunner based on the very unique Cell approach, they were
> placing their bets on the future of that technology.
> 
> Since that development cycle, the rise of GPU acceleration, the
introduction
> of the promising Phi, and some efficiency tweaks on the software side have
> rendered some of what made Roadrunner shine seem rather date. It’s now
> possible to get more compute power in a smaller power envelope…and with a
lot
> less in the way of programming hassle, as well, notes Henning.  However,
for
> the NNSA and Los Alamos, whatever the clandestine code was they cooked
around
> the Cell, it must have been worth the effort on the retooling side.
> 
> Although the story of the Roadrunner being forced into retirement found its
> way into a number of mainstream tech media stories over the course of the
> week, this is a pretty standard order of operations for large HPC centers,
> especially national labs. Henning stressed that the shutdown of the
> once-famous system is not unlike the series of other supers they’ve
shuttered
> in succession at the center. They build a plan for acquisition, see a
machine
> run its course, learn from it post-mortem and shuttle it off in parts to
make
> way for something fresh.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf





More information about the Beowulf mailing list