[Beowulf] SGI to offer Windows on clusters

Mon Apr 16 11:12:34 PDT 2007

On Mon, 16 Apr 2007, John Hearns wrote:

> Robert G. Brown wrote:
>> On Sun, 15 Apr 2007, John Hearns wrote:
>> 
>>> And re. the future version of Scientific Linux, there has been debate on 
>>> the list re. co-operating with CENTos and essentially using CENTos 
>
>> 
>> IMO, most cluster builders will find it more advantageous to track the
>> FC releases instead of using RHEL or Centos or things derived therefrom.
>> Hardware support is key, and Centos can get long in the tooth pretty
>> quickly in a cluster environment with any sort of annual turnover.
>> 
> Bob,
>   at long last I can take issue with you.

Oh, I expected that -- and maybe even provoked it a wee bit;-).  Indeed,
we should really very likely rename the thread sinced we're off (once
again) in LSB-land, on just what "supported" means for a distro.

> I don't agree re. Fedora.    We as cluster builders have to support machines 
> for at least three years, and are commonly requested to extend support. I 
> don't see how we can support a distribution which has a 'live' lifetime of 
> six months (not sure how long updates are for after that). After three years 
> the distro is far, far out of date.

My point exactly, the other way around.  After three years Centos is
far, far out of date.  Its STABILITY is neither greater nor less than FC
frozen at more or less the same date, since both of them are 99% or
better pure and stable and debugged a year after release.  The
difference is that FC "has" an "easy" upgrade path, if you choose to
take it.  Centos does not.

> Your point re. hardware support of course is correct, and refutes my argument 
> above. We deal with this by backporting up-to-date kernels and drivers, (and 
> other packages such as dhcp server to RH 7.3 recently!)

Right, and IMO backporting kernels and complex packages with significant
dependencies is far, far more difficult than resolving problems with an
FC upgrade once every year or two.  The latter generally doesn't require
code to be touched (programmers competent to backport being "expensive",
time being "expensive", hardware to prototype a free distribution being
"cheap").

You start by rerunning your kickstart script on a test node with the new
FC, maybe four or five months after it is first released (to give early
adopters a chance to do most of the debugging for you).  Be sure to
yum-update when done.  With luck it just works.  If your node package
installation is modest, chances are you'll be lucky, as all you
generally need is a nearly completely standard core and a handful of
specialized libraries that mostly DEPEND only on the standard core.  I
cannot imagine a stripped node installation consisting of the basic
minimal unix (e.g. libc and libm), the GSL, ssh, and maybe lam or pvm
not "just working" even one whole month post-release.  If you move on up
and try to install cernlib or x-based apps or a full workstation install
or open office on each node, well, your stability will diminish although
it is likely to be an APPLICATION level instability and not a kernel
level instability.

So in maybe 95% of all cases, you boot-install a prototype node or three
and they just work with no visible difference in stability or
performance relative to the other nodes in your cluster.  In 5% of all
cases, you encounter a problem -- lam doesn't work, PVM has somehow
developed a bug, rarely the kernel has a bug (in most cases, the kernel
will work BETTER for FC-anything than it does in two year old
Centos-anything on pretty much all hardware within a month or two after
release, according to my own direct experience).  You then have the
usual choices of waiting until the bug is resolved for you and the nodes
'just work' following a transparent yum update one night, get involved
enough to report the bug via bugzilla (and actually be clued in as to
when the bugfix updates occur), or get involved enough to help fix the
bug as well as report it, which is still likely to be no more work than
backporting kernels and far LESS work than packporting e.g. dhcp or (god
help you) NetworkManager.  Backporting NM was basically impossible
across several FC releases (let alone Centos) as the kernel was rapidly
changing the way it managed the hardware abstraction layer (HAL), gnome
was changing the way it managed keys, NM was changing the way it managed
devices and keys, and getting the three layers of dependencies
backported was enough to make ME cringe, at least, and I actually gave
it a try.

Backporting in this sort of environment is working without a net.  If
there is a problem, you or your local coders solve it.  RH probably
won't help you -- their "support" is a standard update stream that
freezes the versions being supported as early as possible to minimize
their cost of providing the support, not agressively rebuilding GSL or
NM or whatever from source to current and updating from your own
disto-specific RPMs, and there is no community to rely on.

>
>
> If you reply that 'rolling updates' a la Debian would be possible, that would 
> be OK if well engineered (*) on academic sites.
> But on commercial and secure Government sites machines are very often 
> operated on an isolated LAN, and stability (read 'don't change things 
> unnecessarily')  is a key requirement there too.

Sure.  But again, if you freeze FC at the point where its "official
support" ends, you are almost certainly going to be freezing it in a
state that is four or five nines stable on your hardware.  The MOST you
are worrying about is the lack of security bugfixes, which are rare, and
in MOST cases anything that is really important there will still get
backported and redistributed by the entire community OR will be
sufficiently serious that doing a full upgrade to the latest FC will be
warranted, maybe even warranted if you are frozen on a three year old
Centos where the security implications are more widely dispersed and
harder to seal off.

Try installing two year old Centos AT ALL on six-month-old hardware, and
I think that there is a very high probability that it will require a
much larger investment in time backporting kernels and worse.

> (*) Ha. Well engineered?
> Take the recent SuSE update which killed system logging on one of our 
> clusters. SuSE update RPM for syslog-ng now requires that the syslog-ng.conf 
> file is present (not present on the default install).
> Yast quietly updates the RPM during the night last November.
> System is rebooted a couple of weeks ago. We're asked to diagnose a problem - 
> and lo and behold no system logs.
> (The fix is to use SuSE-config to create the syslog-ng.conf file and restart 
> syslog)

But that is on a fully supported commercial system, right?  Again the
point being that whether or not they are commercial and slowly varying,
updates are in fact DEstabilizing, not stabilizing in ALL cases.  If you
want stable, invoke the "if it ain't broke, don't fix it" rule, freeze
ANY distro snapshot that appears to be stable, and forget it.  Defend
your cluster against security problems either by selected tested updates
or (better) by network isolation and careful monitoring.

I've watched "commercial support" via updates etc on real RHEL systems,
and I'm not impressed.  All that I can see is that it becomes a lot more
costly to fix the same old problems, and that those problems tend to
persist far longer when they are outside the narrow window of what gets
fixed.  The people who install the commercial (or delogo'd commercial)
distros often do so because they don't really understand how things work
all that well in the first place and think that if they install RHEL on
their IBM server with their over-the-counter NAS that they'll get some
sort of help when the NAS doesn't seem to work stably with their basic
install.  (Note well that I except nearly anyone on this list from this
blanket statement, so don't bother excepting yourselves.)

The reality, of course, is that you have to know what you are doing at a
much higher level than most sysadmins that work in the commercial sphere
ever do to deal with real systems problems no matter what distro you are
using and who or how it is supported.  To real experts, the important
thing is more often the support community and how aggressive and
interactive it is and not the "support contract" from RH or any other
linux vendor, that at best fronts the REAL support gotten from the
community, perhaps with the input of some minimal number of work hours
on the part of RH coders for problems that look likely to impact a large
number of users.

I wouldn't hesitate to put FC on cluster nodes on any cluster,
commercial, government or university, as long as there was some sort of
support community I could attach to that also used FC in similar
context.  I have yet to see >>serious<< (persistent) stability issues
with an FC release, although of course there are stability issues
(isolated or otherwise) for the first few months after a release for any
distro bar none.  After those early issues are resolved, FC will "just
work" for pretty much any reasonable hardware/software configuration and
can be frozen, stable, on a cluster that gets rebooted once a year or so
if that.

And honestly, I rather expect that the same is true of Debian.  The
fundamental point is that for nearly all systems, if you plot stability
as a function of package updates performed, I'd actually expect to find
a MINIMUM in this curve rather than a monotonically decreasing function
in many cases.  Stability increases as egregious bugs are discovered and
corrected until nearly all the common execution pathways and critical
shared packages are debugged to a functional level, eventually meeting
the nonzero risk of INTRODUCING NEW BUGS associated with any update at
all on the far side of things.

At some point your risk (expected "cost" of failure) associated with
introducing new bugs with an update roughly equals the (expected)
benefit.  At some still later point it will greatly exceed it,
especially for updates of critical subsystems that are dependencies of
many packages, as opposed to userspace application updates that aren't
dependencies of other installed packages.  At that point (if not
before!) if it ain't broke -- don't fix it!  Especially things in the
critical core libraries, the kernel, places where if a bug IS introduced
it will have a major impact (a nonlinearly high marginal cost compared
to say a linear marginal benefit).

I've learned this one the hard way with marginal hardware.  With some
systems we've owned, once you get a kernel that reliably boots the
hardware, you JUST DON'T CHANGE IT.  Unless you want a world of pain all
OVER again.

I would therefore argue that from the point of view of true cost-benefit
(as opposed to CYA behavior or make-the-vendor-happy behavior in the
absence of a binary API, and LSB core, and compliance therewith) the
really important question isn't the length of time that a supported
update stream will be provided by the vendor, it is the time required
for the update stream (however it is provided) to reach this global
minimum and how effectively and intelligently the update process locks
into this minimum to do no harm while cleaning up the tail, however long
or short it proves to be.

     rgb

>
>
>
>
>
>
>
>
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu