cluster frustrations

Joachim Worringen joachim at lfbs.RWTH-Aachen.DE
Thu Jan 17 00:12:24 PST 2002

Patrick Geoffray wrote:
> Joachim,
> Joachim Worringen wrote:
> > But they don't get it to run reliably with
> > the current Linux/GM/MPICH versions which of course should run faster,
> > better, nicer. I don't blame Linux or Myrinet for these problems -
> Obviously, you do. Inciting another flame war ?

No, I never intend to incite flame wars, but discussions. I can tell you
a lot of stories about mal-functioning self-made SCI clusters, but I
have no hands-on experience with such a cluster being operated in a
similar (production) environment, because such customers usually chose
Scali-made systems. And I prefer to talk about hands-on experience, not
second-hand stories. The Scali-equipped systems I know of run well now,
although this hasn't always been like this (mostly due to bugs/strange
features in the last generation hardware, LC2). But Scali systems, to
stick with these, are well-defined platforms, running qualified kernels
etc., which (if not using such) is one source of problems.

> So if you really experienced problems with this machine, please
> contact help at, this is the first step toward happiness.

I had reproducable application aborts when running PMB with 32
processes. I informed Ulrich Detert about this, and he confirmed the
problems. Up to now, they stick with 2.2 (which runs stable, but not as
fast it could), which does *not* mean, that such a system wouldn't work
with 2.4 and current GM - it's only that these guys did try to find that
"golden configuration" during their update (or by chance did hit the one
dirty configuration) and didn't succeed. 

Once again: I don't doubt that there do exist Myrinet systems which run
perfectly. There just may be a lot of chances (with self-made clusters
in general) to make mistakes, hindering stable operation.

> You cannot compare Crays/SP2 with do-it-yourself Linux clusters. 

Exactly. Paying less money means investing more time. Which may be
equivalent to money.


|  _  RWTH|  Joachim Worringen
|_|_`_    |  Lehrstuhl fuer Betriebssysteme, RWTH Aachen
  | |_)(_`|
    |_)._)|  fon: ++49-241-80.27609 fax: ++49-241-80.22339

More information about the Beowulf mailing list