clustering both linux and unix..

Luc Vereecken Luc.Vereecken at
Wed Oct 17 05:16:23 PDT 2001

Hmmm... rgb you seem to see the difficulties in keeping heterogenous
clusters running, and I the possibilities to keep them running. Some more
thoughts... (truth is as usual somewhere in the middle)

At 14:17 16/10/01 -0400, you wrote:
>On Tue, 16 Oct 2001, Luc Vereecken wrote:
>> Most Unixes are "immensely stable", not just Linux. In my experience, there
>> are no added instability problems in a heterogenous cluster compared to a
>> homogenous cluster, one you have managed to get your programs working on
>> all OSs involved.
>This is sort of like saying that there are no additional problems once
>you've solved the additional problems. The point is that overall admin
>and applications development effort scales at least linearly --
>different packagings, different maintenance at the OS level, different
>include files and different libraries at the application programming
>level (although e.g. POSIX compliance has to some extent ameliorated the

OK, that didn't come out the way I intended :-) It kind of depends on what
programs you want to run.
Getting you programs working on different OSs need not involve additional
problems. Many commercial and non-commercial packages have been ported
across different operating systems, and installing them on different OSs is
always the same (easy) procedure (I usually write scripts that remotecopy
the tarfile, untar it, configure, make and makeinstall, and iterate over
all OSs). Libraries such as MPICH etc. also seem to work well
crossplatform; compiling a well-written MPI fortran or C program on
different OSs is not a problem at all; interoperable queueing systems are
available for all OSs, and I never ran into stability problems on
heterogenous calculations. 
Admin and development costs do not scale at least linearly. If running one
os requires 100%, than adding a second one only adds about 40% additional
effort (mainly because one wasn't aware yet one is using OS-specific
commands/syntax/... and you need to change your way of working a bit), and
the third OS adds even less, say 15 %. If you know in advance you will be
aiming for heterogenous clusters, one tends to use a different style of
admin/development efforts, one that is more portable and relies more on the
common things in all OSs rather than focussing on OS-specific enhancements.
Also, a large part of maintaining a cluster is related to unique services
that only run on one computer, and that therefore need not be maintained
across different OSs : webserver, mailrouters, firewalls,
ssh-authentication issues, encryption libraries, running X applications,
nifty scripts that automatically generate stat html pages, ..... typically
only run on 1 computer only (eg the head node), and are therefore by nature
not affected by a heterogenous environment. 

>Also, most of the alternative Unices (with the exception of
>FreeBSD) are not open source, as well, which adds its own layers of
>difficulty and instability which have been discussed at length on the
>list.  Open source doesn't mean working and functional, but it at least
>gives you a fighting chance at fixing some of the stuff that doesn't
>work (as work by e.g. Josip Loncaric and others has clearly demonstrated
>in this venue).

How much free time do you have to fix kernel-mysteries ? I have none, nor
do I want to spend time writing/improving OSs (or applications I didn't
develop for that matter) at this time. It's not because improvement is a
theoretical possibility, that it is practical. For that matter, I run into
more problems with linux than I do with other OSs. Just the libc version
problems drove me mad at some time, and I never had to tune my TCP stack
parameters in, say, solaris or AIX. That, however, could also be due to the
fact that they run on more performant hardware than PCs, and the bottleneck
is usually the linux-PCs.

>It is also undeniable that one's risk of jobkilling errors is some
>combinatorial factor higher when running on several OS's rather than
>just one (in fact, this is just a restatement of the previous
>observation -- if you spend less than Nx the effort on average, you run
>greater risks, on average on one of the OS's).  Again, the list has seen
>reports over the years of many problems that affect one particular
>kernel subsystem (such as the TCP stack) in one particular kernel
>flavor, sometimes in just one parallel library.  Those problems can
>sometimes be very time consuming to solve (and may require access to the
>kernel sources to even identify).

Hmmm... I still think that hardware instabilities causes more jobkilling
errors than OS-related problems (bad memory, overheating CPUs, badly
manufactured Mobos, switch problems,...) Hence, I think that the risk of
jobkilling errors is mainly a combinatorial factor of the number of
machines, not the number of OSs (which will be small compared to the number
of machines : there are only about 6 to 10 decent OSs out there). But
statistically speaking you are right that is will have an effect. I
wouldn't know because I can't even remember the last time I had a
jobkilling error I couldn't trace back to a hardware problem (switches
behaving weird, too long network delays between the different buildings (I
also use the universities SP2 nodes in my parallel calculations) ). 

>Still, if one >>can<< reduce the number of OS's supported in any given
>organization, one almost always realizes economies of scale and sees
>improved scaling and stability.  One person can easily run an extremely
>large linux-only network.  If one person CAN easily run an extremely
>large AIX+Linux+Irix+DU+Solaris+... network, my hat is off to them!
>They are clearly in the Unix Super Genius category of human -- I've
>never managed more than 3 Unixoid OS's at once, and one of those was
>pretty poorly run to be frank.  Nowadays I would not willingly handle
>more than one...;-)

Now this I agree with entirely. Running only one brand is always easier,
more economic, and scales better. If given the option to use a homogenous
cluster, take it. Better stability I'm not too sure of : on my systems
instabilities occur so infrequent as to make no difference with a single-OS
environment. Maybe I'm just lucky, or just happen to run a mix of
applications that is happens to avoid problematic area's. I would not
hesitate to go for another heterogenous cluster.


Luc Vereecken

More information about the Beowulf mailing list