[Beowulf] [tt] One million ARM chips challenge Intel bumblebee

Lux, Jim (337C) james.p.lux at jpl.nasa.gov
Thu Jul 7 11:26:04 PDT 2011

> > It's all about ultimate scalability.  Anybody with a moderate competence (certainly anyone on this
> list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work
> in unit time.  It's substantially more challenging to devise a scheme to do 1000 quanta of work in
> unit time on, say, 1500 processors with a 20% failure rate.  Or even in 1.2*unit time.
> >
> Just to be clear - I wasn't saying this was a bad idea. Scaling up to
> this size seems inevitable. I was just imagining the team of admins who
> would have to be working non-stop to replace dead processors!
> I wonder what the architecture for this system will be like. I imagine
> it will be built around small multi-socket blades that are hot-swappable
> to handle this.

I think that you just anticipate the failures and deal with them.  It's challenging to write code to do this, but it's certainly a worthy objective. I can easily see a situation where the cost to replace dead units is so high that you just don't bother doing it: it's cheaper to just add more live ones to the "pool".

More information about the Beowulf mailing list