clustering both linux and unix..
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Luc Vereecken Luc.Vereecken at chem.kuleuven.ac.beWed Oct 17 05:16:23 PDT 2001
- Previous message: clustering both linux and unix..
- Next message: Linux Cluster Course?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hmmm... rgb you seem to see the difficulties in keeping heterogenous clusters running, and I the possibilities to keep them running. Some more thoughts... (truth is as usual somewhere in the middle) At 14:17 16/10/01 -0400, you wrote: >On Tue, 16 Oct 2001, Luc Vereecken wrote: > >> Most Unixes are "immensely stable", not just Linux. In my experience, there >> are no added instability problems in a heterogenous cluster compared to a >> homogenous cluster, one you have managed to get your programs working on >> all OSs involved. > >This is sort of like saying that there are no additional problems once >you've solved the additional problems. The point is that overall admin >and applications development effort scales at least linearly -- >different packagings, different maintenance at the OS level, different >include files and different libraries at the application programming >level (although e.g. POSIX compliance has to some extent ameliorated the >latter). OK, that didn't come out the way I intended :-) It kind of depends on what programs you want to run. Getting you programs working on different OSs need not involve additional problems. Many commercial and non-commercial packages have been ported across different operating systems, and installing them on different OSs is always the same (easy) procedure (I usually write scripts that remotecopy the tarfile, untar it, configure, make and makeinstall, and iterate over all OSs). Libraries such as MPICH etc. also seem to work well crossplatform; compiling a well-written MPI fortran or C program on different OSs is not a problem at all; interoperable queueing systems are available for all OSs, and I never ran into stability problems on heterogenous calculations. Admin and development costs do not scale at least linearly. If running one os requires 100%, than adding a second one only adds about 40% additional effort (mainly because one wasn't aware yet one is using OS-specific commands/syntax/... and you need to change your way of working a bit), and the third OS adds even less, say 15 %. If you know in advance you will be aiming for heterogenous clusters, one tends to use a different style of admin/development efforts, one that is more portable and relies more on the common things in all OSs rather than focussing on OS-specific enhancements. Also, a large part of maintaining a cluster is related to unique services that only run on one computer, and that therefore need not be maintained across different OSs : webserver, mailrouters, firewalls, ssh-authentication issues, encryption libraries, running X applications, nifty scripts that automatically generate stat html pages, ..... typically only run on 1 computer only (eg the head node), and are therefore by nature not affected by a heterogenous environment. >Also, most of the alternative Unices (with the exception of >FreeBSD) are not open source, as well, which adds its own layers of >difficulty and instability which have been discussed at length on the >list. Open source doesn't mean working and functional, but it at least >gives you a fighting chance at fixing some of the stuff that doesn't >work (as work by e.g. Josip Loncaric and others has clearly demonstrated >in this venue). How much free time do you have to fix kernel-mysteries ? I have none, nor do I want to spend time writing/improving OSs (or applications I didn't develop for that matter) at this time. It's not because improvement is a theoretical possibility, that it is practical. For that matter, I run into more problems with linux than I do with other OSs. Just the libc version problems drove me mad at some time, and I never had to tune my TCP stack parameters in, say, solaris or AIX. That, however, could also be due to the fact that they run on more performant hardware than PCs, and the bottleneck is usually the linux-PCs. >It is also undeniable that one's risk of jobkilling errors is some >combinatorial factor higher when running on several OS's rather than >just one (in fact, this is just a restatement of the previous >observation -- if you spend less than Nx the effort on average, you run >greater risks, on average on one of the OS's). Again, the list has seen >reports over the years of many problems that affect one particular >kernel subsystem (such as the TCP stack) in one particular kernel >flavor, sometimes in just one parallel library. Those problems can >sometimes be very time consuming to solve (and may require access to the >kernel sources to even identify). Hmmm... I still think that hardware instabilities causes more jobkilling errors than OS-related problems (bad memory, overheating CPUs, badly manufactured Mobos, switch problems,...) Hence, I think that the risk of jobkilling errors is mainly a combinatorial factor of the number of machines, not the number of OSs (which will be small compared to the number of machines : there are only about 6 to 10 decent OSs out there). But statistically speaking you are right that is will have an effect. I wouldn't know because I can't even remember the last time I had a jobkilling error I couldn't trace back to a hardware problem (switches behaving weird, too long network delays between the different buildings (I also use the universities SP2 nodes in my parallel calculations) ). >Still, if one >>can<< reduce the number of OS's supported in any given >organization, one almost always realizes economies of scale and sees >improved scaling and stability. One person can easily run an extremely >large linux-only network. If one person CAN easily run an extremely >large AIX+Linux+Irix+DU+Solaris+... network, my hat is off to them! >They are clearly in the Unix Super Genius category of human -- I've >never managed more than 3 Unixoid OS's at once, and one of those was >pretty poorly run to be frank. Nowadays I would not willingly handle >more than one...;-) Now this I agree with entirely. Running only one brand is always easier, more economic, and scales better. If given the option to use a homogenous cluster, take it. Better stability I'm not too sure of : on my systems instabilities occur so infrequent as to make no difference with a single-OS environment. Maybe I'm just lucky, or just happen to run a mix of applications that is happens to avoid problematic area's. I would not hesitate to go for another heterogenous cluster. Cheers, Luc Vereecken
- Previous message: clustering both linux and unix..
- Next message: Linux Cluster Course?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
