[Beowulf] Supercomputers face growing resilience problems

Fri Nov 23 09:51:04 PST 2012

On 22 Nov 2012, at 16:12, Lux, Jim (337C) wrote:

> And on slashdot, as well..
> http://hardware.slashdot.org/story/12/11/21/2259233/supercomputers-growing-
> resilience-problems

I think the original suggestion is right on the money, I've always wondered if it would be possible to do something like this.  The traditional argument is that ram is too expensive and disks are too slow but neither of those are as compelling as they were ten years ago.  There are some interesting, non-trivial ways that compression, xor overlays of multiple messages and redundancy levels interact to dictate the level of storage that would be required.  More so if you only try to protect against single failure.  As Jim also said, as soon as you break the assumption that hardware is reliable you immediately get into application, or at least algorithm, specific approaches, nearest-neighbour comms v's global reduce have vastly different charactersitcs, both in the amount of data that would need to be saved but also how many processes would need to be involved in any rollback.  Trying to write next-generation applications and middleware whilst keeping within the confines of the MPI specification is probably not ideal either.

> Bringing up a topic that this list is well suited to address.
> 
> How do you describe the difference between HPC of the type discussed here
> and, say, Google's search engine/processing, or Amazon's cloud offerings.
> What, exactly, makes it special?

Determinism is the big difference here, a lot of the big map/reduce players don't need the same answers 100% of the time. Really though, the similarities are bigger than the differences and there isn't nearly as much cross-flow of ideas that there should be, at least in part this is due to a  "not invented here" attitude from both sides but also commercial pressures keep a lot of the work and algorithms secret.  Just look at the number of people from HPC who have signed on with Amazon/Google and then seemingly disappeared from the community completely.

Ashley.