SMP robustness
Pierre Brua
brua at paralline.com
Wed Jul 26 15:21:03 PDT 2000
Someone wrote:
> This is not the case with the Origin 2000. The O2000 will shutdown
> problem processors and continue running.
> There are a number of Origin Issues (pricing, performance, etc) but
> reliability and failover have not been problems in my three years of
> Origin Admin.
Maybe we are not talking about the same computers, you are talking about
Origin2000 from sgi, aren't you ? If you have a little one you may not
experience much problems like that, but with big configurations that's
another matter...
Let me give 3 examples :
* alim block failure
There is one of those for each 8 processors block. It happened
approximately one time per year and per 8cpu-block with the config I
had. In that case the whole supercomputer goes down.
* scsi controller lock
Some problems with backup devices can lock the O2K scsi controller
badly. In that case only a reboot can correct it. A reboot of the
processor controlling the scsi bus where the device is ? No way, a
reboot of the whole supercomputer. You can even read that black on white
in some O2K docs.
* processor bug
If a processor is buggy/burned, you have to shut down your entire
supercomputer to replace it.
Beowulf systems, viewed as a lot of little independent hardware pieces,
are quite more solid from that point of view. Like a bunch of ants.
That's not to say O2K are not good supercomputers of course, but their
integrated hardware has some unexpected "features" in that area.
Pierre
--
PARALLINE Pierre BRUA Parallelism & Linux Solutions
71,av. des Vosges Phone:+33 388 141 740 mailto:brua at paralline.com
F-67000 STRASBOURG Fax:+33 388 141 741 http://www.paralline.com
More information about the Beowulf
mailing list