clustering both linux and unix..

Tue Oct 16 08:45:07 PDT 2001

At 09:15 16/10/01 -0400, Robert G. Brown wrote:
>On Tue, 16 Oct 2001, Senol Tekdal wrote:
>>
>>   can i set up a cluster that work on both linux and unix?
>
>The answer is sure, easily.  However, the cluster won't be a
>"true beowulf" unless you work very hard writing new glueware, since
>true beowulf software like the Scyld distribution is mostly based on the
>assumption of linux homogeneity.

>You also have to deal with stability problems -- a distributed, tightly
>coupled computation with no checkpointing very likely will have to be
>restarted if any node goes down in mid-calculation.  If you are using
>128 nodes and 64 of them are running WinXX and the computation runs for
>a whole day before it finishes with a lot of memory management going on,
>your odds of EVER completing the work are very slender.  This isn't just
>picking on Windows (however satisfying it is to do so;-), if you split
>it up across ANY two or three or four OS's, you are subjecting yourself
>to weak-link instability across all of the choices.  One bad memory
>manager, one buggy communications stack, and your whole computation goes
>down the tubes.
>A linux-only cluster has the advantage of being immensely stable -- we
>are currently running nodes from OS installation/upgrade to OS
>installation/upgrade with no non-hardware related failures in between,
>even nodes running NIS (which "works" for us as we run mostly EP code on
><100 nodes so far) since we started rebooting the NIS servers
>therapeutically to deal with the NIS memory leak.

Most Unixes are "immensely stable", not just Linux. In my experience, there
are no added instability problems in a heterogenous cluster compared to a
homogenous cluster, one you have managed to get your programs working on
all OSs involved. 

>Glancing at our two clusters, one has been up for 73 days (except for
>one node which I've used to demonstrate kickstart node installs in the
>last week to visitors from a nearby school) and two nodes that were
>"busy" during the last OS upgrade that were upgraded late).  The other
>cluster has been up 21 days, which is also an upgrade epoch (we upgrade
>them separately to new RH kernels as they appear in the RH updates
>directory).  If it weren't for the soon-come RH 7.2 (which we've been
>running in beta on selected hosts and are kickstart-ready to upgrade
>when they finally release it) I'm confident that I could run those nodes
>"indefinitely" -- very possibly until they break or some act of God like
>a power failure or nuclear war takes them down.
>There may be some other Unix variants out there that can boast similar
>stability, but not many.
>    rgb

I think most decent Unixversions can boast similar stability. I have AIX
machines (uptime all about 330 days now), a solaris machine (250 days
uptime), and linux nodes (150 to 200 days uptime). The stability of BSDs
etc are also similar to Redhat's. The stability with most machines (barring
OSs like winxx) is related to hardware stability, not OS-problems,
especially once you learned which subsystems are less stable, such as
memory-leaking NIS implementations etc., are are avoiding them as much as
possible (unfortunately you usually only find this type of errors after a
while, and usually without a direct replacement).

Luc Vereecken