SMP robustness

Pierre Brua brua at paralline.com
Wed Jul 26 15:21:03 PDT 2000


Someone wrote:
> This is not the case with the Origin 2000. The O2000 will shutdown
> problem processors and continue running.
> There are a number of Origin Issues (pricing, performance, etc) but
> reliability and failover have not been problems in my three years of
> Origin Admin.

Maybe we are not talking about the same computers, you are talking about
Origin2000 from sgi, aren't you ? If you have a little one you may not
experience much problems like that, but with big configurations that's
another matter...

Let me give 3 examples :

* alim block failure
	There is one of those for each 8 processors block. It happened
approximately one time per year and per 8cpu-block with the config I
had. In that case the whole supercomputer goes down.

* scsi controller lock
	Some problems with backup devices can lock the O2K scsi controller
badly. In that case only a reboot can correct it. A reboot of the
processor controlling the scsi bus where the device is ? No way, a
reboot of the whole supercomputer. You can even read that black on white
in some O2K docs.

* processor bug
	If a processor is buggy/burned, you have to shut down your entire
supercomputer to replace it.

	Beowulf systems, viewed as a lot of little independent hardware pieces,
are quite more solid from that point of view. Like a bunch of ants.
	That's not to say O2K are not good supercomputers of course, but their
integrated hardware has some unexpected "features" in that area.

	Pierre
-- 
PARALLINE         Pierre BRUA    Parallelism & Linux    Solutions
71,av. des Vosges Phone:+33 388 141 740 mailto:brua at paralline.com
F-67000 STRASBOURG  Fax:+33 388 141 741  http://www.paralline.com




More information about the Beowulf mailing list