[Beowulf] [tt] One million ARM chips challenge Intel bumblebee

Ellis H. Wilson III ellis at runnersroll.com
Thu Jul 7 12:25:35 PDT 2011


On 07/07/11 14:26, Lux, Jim (337C) wrote:
>>> It's all about ultimate scalability.  Anybody with a moderate competence (certainly anyone on this
>> list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work
>> in unit time.  It's substantially more challenging to devise a scheme to do 1000 quanta of work in
>> unit time on, say, 1500 processors with a 20% failure rate.  Or even in 1.2*unit time.
>>>
>>
>> Just to be clear - I wasn't saying this was a bad idea. Scaling up to
>> this size seems inevitable. I was just imagining the team of admins who
>> would have to be working non-stop to replace dead processors!
>>
>> I wonder what the architecture for this system will be like. I imagine
>> it will be built around small multi-socket blades that are hot-swappable
>> to handle this.
> 
> I think that you just anticipate the failures and deal with them.  It's challenging to write code to do this, but it's certainly a worthy objective. I can easily see a situation where the cost to replace dead units is so high that you just don't bother doing it: it's cheaper to just add more live ones to the "pool".

Or rather than replace or add to the pool, perhaps just allow the ones
that die to just, well, stay dead.  The issue with things this scale is
that unlike the individual or smallish business there are very few good
reasons to not upgrade every, say, 3 to 5 years.  The costs involved in
having spare CPUs sitting around waiting to be swapped in, the
maintenance of having administrators replacing stuff and any potential
downtime replacements require seem at first glance to outweigh the
elegance of "letting nature taking it's course" with the supercomputer.

For instance, if Prentice's MTBF of 1 million hours is realistic (I
personally have no idea if it is), then that's "only" 43,800 CPUs by the
end of year 5.  That's less than 5% of the total capacity - i.e. not a
big deal if this system can truly tolerate and route around failures as
our brains do.  Perhaps they could study old and/or drug abusing bees at
that stage, hehe.

Just my 2 wampum,

ellis



More information about the Beowulf mailing list