[Beowulf] OS for 64 bit AMD

Robert G. Brown rgb at phy.duke.edu
Mon Apr 4 11:06:52 PDT 2005

On Sun, 3 Apr 2005, Joe Landman wrote:

> It is *long term* behavioural, driver, and interface stability. 
> Changing an ABI midway through (4k stacks) is *not* behavioral 
> stability.   You have no real reason to expect a code to work correctly 
> when you alter one of the critical underlying structures that it relies 
> upon.  Many drivers rested on 8k kernel stacks, it was in the ABI as a 
> (defacto) standard.  RHEL3 did not (properly so) change its underlying 
> kernel structures in such a way to render some portions of the system 
> unworkable.  RHEL4 is not likely to change its underlying kernel 
> structures in such a way to render some portions of the system 
> unworkable.  FC-x is likely to (and has) changed its underlying kernel 
> structures.

I don't understand this assertion at all.  Are we talking about the
linux kernel here?  The one that Linus Torvalds and friends are working

I would think that whether or not the kernel stacks change size, and how
drivers insert into the kernel, and oh so much more are utterly out of
the hands of BOTH RH and the FC people (who are, after all, largely the
same people).  If/when the kernel changes stack size and other things
(for example, the layout of /proc) as it has in the past, we will all
have to live with it and yes things will break.  The only way RH could
avoid it is to either run a twinned kernel configuration with the change
backed off or to freeze on a legacy kernel, both of which are such
stellarly bad ideas that I hope there is no need to even discuss them.

So sure, RHEL 3.x may be "frozen" with 8k kernel stacks as long as they
choose to use and support the last kernel release with that set as the
default or where RH is willing to back off a shift, but RHEL in general
is going to break this kind of thing from release/upgrade to
release/upgrade.  Again, one DEFINITION of major release upgrade is that
it is one that breaks binary compatibility of at least something major
(e.g. kernel, libc, x).  Minor releases break or add features in
userspace, updates fix bugs.

So the only thing you're discussing is WHEN FC does it vs WHEN RHEL does
it.  RH will do it more slowly, for sure, but you'll be just as pissed
off about it when it does.  In either case, your biggest problem isn't
with the distro itself (which IS, as Mark has repeatedly pointed out,
beta tested and a real release) -- it is with packages or programs that
are NOT in the distribution and were NOT beta tested -- they de facto go
through a whole alpha/beta/pilot/production cycle from a point that
STARTS in many cases when the new distribution is first released.

For just one example -- how many people would notice it (or DID notice
it) the minute that the kernel starts outputting numbers in /proc that
were ulls instead of uls?  I notice it because stuff I write breaks like
all shit.  Most people wouldn't/didn't notice it because they got the
change at the same time they got a fixed procps that managed it.  Only
people like me running software that was coded on the older no longer
valid assumptions see it break.  Then it is usually a straightforward
task to fix it in an open source world where things are well documented.
Even in the closed source world companies like NVIDIA tend to be at
least REASONABLY responsive and eventually fix it if their marketplace
demands that it be fixed.

So be aware that things like this DO change fairly regularly in ALL
distributions, commercial or non.  Most of the time the changes are
hidden, especially for folks that use the properly beta-tested software
that comes "with" the distribution.  Sometimes they are not, and
obviously can strongly affect people whose source is e.g. 32 bit source
but they are suddenly running on a 64 bit platform.

The real question is:  Feature or Bug?  I tend to think of dynamical
evolution as an unquestionable feature.  Who here really WISHES we were
still running the 0.9x kernel?  SLS linux?  X11R5?  Raise those hands
high, folks, can't make them out from here.  Hmmmm, not a whole lot of

Everything after that is a question of rate and degree.  FC is "fast"
but has a full development cycle and (most important) has engaged
developers with a well-defined, yum/repository based mechanism for
distributing updates from the toplevel repositories to end-user
computers in no more than 1-2 days (faster than that in an emergency).
RHEL is "slow" and has pretty much the same update mechanism (or
up2date, used by an increasingly vanishingly small segment of hapless
humanity given that it scales, um, "poorly" and costs the moon). SuSE is
different, Debian different again.  Viva la choice.

> So you have pre-release testing as "beta-testing" but you deny that 
> "proving ground" is beta-testing?  Seems to be same side of a coin here. 
>   Having a normal release management does not a production quality 
> system make.  It is most definitely one of the requirements for such a 
> system, but it does not, in and of itself, make the OS a production 
> class OS.  A reasonable definition of production class OS will likely 
> incorporate inherent stability of the underlying structures of the 
> system, and a guarantee that they will not change for some fixed 
> interval.  Production specifically implies a repetitive behavior, 
> specifically for HPC, a cycle shop.  If the next incompatible change in 
> FC-x renders your IB drivers unworkable for your cluster, does that in 
> fact make the OS that you have installed on the system production ready 
> or not?  If you have to continuously chase hacks/patches/etc to keep 
> your system operational after every upgrade, does that make your system 
> production ready?

Here is where you keep getting hung up on this whole beta thing.

Look, beta testing (as previously noted) referes to a specific phase of
a commercial-grade software development cycle.

It is fair to apply the term to "Fedora Core X" as a whole, as it goes
through such a cycle.  It is crazy to assert that Fedora Core X is a
"beta" product because e.g. "Matlab" (to pick on one commercial package)
may not run on it the day it comes out the door.

In actual fact, matlab might run on it, or might not.  FC does not
guarantee binary compatibility across major release numbers.  Neither
does RH.  They can't.  The very definition of a major release is one
that shifts at least some ABI's.  In actual fact, the matlab people
technically need to undergo a whole product cycle of their own including
alpha and beta testing ON FC or RHEL's new release to port to it if
necessary and certify the result.

It is entirely possible that RH has a relationship with many vendors and
includes them IN their beta cycle so that those vendors can complete
their own port and betas in time to update their product at the same
time as the new release.  So perhaps RHEL 4 comes out Monday, and by
Monday evening customers can upgrade matlab to run under RHEL 4 for free
or for an additional fee.  If a LOT of their customers use FC, though,
they are also likely to do that port and testing as rapidly as they can
manage it.

SO, please differentiate between:

  * the kernel -- in a space by itself outside ALL distributions.  All
you can choose here is how soon you want your next major release to also
be a major kernel release, but when 2.8 is released in the fullness of
time, all distros will eventually use it even if it breaks the hell out
of every driver in current existence.  Note well that complaining about
"breaking drivers" is really complaining about the kernel and that
Mark's point about closed source binary insertion modules is really well
taken.  This has nothing to do with FC per se, only with their decision
to track the kernel fairly rapidly.  The real issue here is that Nvidia
and others should clearly keep up with the current linux kernels and not
release a product and let it just sit static forever, or they should
release their code so that it can be built into and beta tested WITH the

  * the major compilers -- also in a space outside of all distributions,
also a major driver of incompatibility.  A conservative approach to gcc
would have features like SSE still unsupported, which is a bad thing for
HPC systems.  Remember the problems that ensued when the kernel required
a different variant of gcc to build than the production gcc in many

  * the major libraries -- libc, libm, and a slew of other core dynamic
libraries ARE the ABI for the "distribution".  The only requirement for
a stable distribution (that I'm aware of) is that the entire
distribution be built self-consistently from the kernel and compilers
through the primary/major libraries down through the applications.  A
binary built for e.g. RH 7.2 is supposed to run on 7.1 or 7.3 but is
absolutely not guaranteed to run on 6.2 or 9.  Nobody would be horribly
surprised if a binary built on 7.2 in fact fails on 7.1 or 7.3, though.

This is what "rpmbuild --rebuild" is for...and why open source portable,
rebuildable packaging of standards compliant sources are a really really
good thing.  It is also one of the things many vendors utterly fail to
cope with -- since their code is often built according to proprietary
shop standards, it ends up being reviewed only by a limited set of
inbred eyes and ends up non-portable or maintainable crap.  Since it
costs money (in the eyes of the board) to actually invest in the
development process, underfunded crap at that.  The board would LOVE it
if they could pay just once to have the sources developed and then fire
the whole development team and sell the product forever, as that is the
way THEY view intellectual property -- as a commodity they can purchase
and exploit for a profit and wealth, not as a participatory exercise
from which they happen to earn a well-deserved living.  

So why should we be surprised by vendors that are still trying to sell
software that "only runs on RH 7.3" libraries?  They'd actually have to
assemble a team of competent programmers redevelop their product again
to get it to work on anything more recent instead of just make money...

  * X -- Again, X is on a separate development cycle outside of most
distributions (as are many other packages, but X is a low-level sine qua
non to many, many applications and hence qualifies as a "distribution"
of sorts in its own right).  In few places are the costs of both sides
of the rapid release coin as visible as with X.  OTOH rapid updates
break applications and require admins to learn new configuration tools
and are more likely to have bugs including serious ones.  OTOH everybody
gets pissed off if the brand new bleeding edge video card they got with
their high-end visualization or gaming workstation doesn't work
perfectly with linux.  To add insult to injury, you're getting all
irritated at FC-X for a change made in the KERNEL that broke an X DRIVER
(Nvidia) that is deliberately engineered to live OUTSIDE the X, library,
and kernel cycle.  What do you expect?  Sooner or later, the kernel, a
key library, X itself were bound to change.  When that happened Nvidia's
driver was BOUND to break.  How could it not break?  Surely you don't
expect Linus Torvalds to freeze the kernel development cycle just so
Nvidia never has to actually WORK on its proprietary driver and can just
keep making money on the basis of its original investment?

  * applications.  There are two parts of application space.  The part
that is "inside" the distribution, and add-ons.  The part that is inside
the distribution is the part that is beta tested as part and parcel of
the whole shebang -- kernel, compiler, libraries, X and applications
(GUI and otherwise).  There are a lot of moving parts, dependencies,
dynamic libraries, and complex interactions.  Incredibly, when any new
distribution is released (after beta testing!), it is a matter of WEEKS
before nearly all of this works on nearly all systems the distro is
installed on.  This is an effin' miracle and a testament to the
incredible strength and robustness of the open source development cycle.
The "gamma testing" in linux is ongoing, but it is also very, very
efficient and rapid because everybody has the sources, and tens of
thousands of competent eyes look at every emerging problem.

The add-on part is NOT beta tested by the distribution developers,
obviously.  How could it be?  Why is this a problem?  How is this an
AVOIDABLE problem?

It is a simple reality of modern software that it is complex and
typically has a complicated dependency tree on many libraries all with
slowly varying ABIs.  The safest way to get software to run perfectly on
any new distribution is to rebuild it (if possible) and test/port/patch
it as needed until it both rebuilds and functions properly.  Binary
compatibility is mostly an illusion, and will become INCREASINGLY
illusory as the systems become still more complex and intertwined in the
future.  However, because of rebuildable, standards compliant packaging
and sources, in MOST cases rebuilding is a matter of entering a simple
command or two, and in cases where this isn't true it is a clear signal
that the product badly needs a major rewrite.

Perhaps what this discussion should really morph to is one of "standard
linux" -- the Linux Standard Base -- a low level ABI for all major linux
libraries to assure binary compatibility across all flavors of linux.
Naturally, there is www.linuxbase.org and a lot of committed people.
Equally naturally, commercial linux vendors are in no rush to implement
it as far as I can see, and it is by no means clear that the goals of
the project CAN or SHOULD BE accomplished.  Standardization can equal
stagnation, and there is an alternative.

The alternative is that represented by gentoo but really possible within
all packaged linuces.  Non-binary packaging that autobuilds on your
system.  Shrink "linux" to little more than an LSB-standard core plus an
enhancement of any of the packaging schema that permits source packages
with complicated dependences to be automatically retrieved from a
repository via e.g. yum and built and installed as a part of the system
installation process.

However, that's not something that I'm pushing -- just noting to
emphasize that there are a number of software distributions competing
here, and the one that is suffering most is the one that relies on the
distribution of generic static binary images of software.  The
fundamental problem is that this is an outmoded paradigm and one that is
likely to disappear altogether in the next few years.  This has nothing
to do with choice of distribution except that some distributions cater
less strongly towards the desire of those software companies that rely
on this scheme to make money without an ongoing maintenance and
development effort, with clear tradeoffs.

> > the existence of commercial products which specify RH-whatever vX.Y
> > does not magically turn FC into a beta-test.  if you redefine words
> > that way, you might as well call all of SunOS a beta for Solaris.
> Er... you are the only one who indicated this, so if you want to argue 
> this, I would suggest you contact the person who generated this idea 
> (that commercial products dependent upon RH make FC a beta test) who can 
> be found at hahn _at_ physics _dot_ mcmaster _dot_ ca.
> I said "My customers care about running on distributions (whoops, there 
> we go with that word again) on which their apps are supported.  I am not 
> aware of active support for FC-x for applications from commercial 
> program providers.  If I am incorrect about this, please let me know 
> (seriously, as FC-3++ looks to be pretty good)."   Prior to this I said 
> "It is by Redhat's definition, a rolling beta (proving ground)."  The 
> two are specifically independent ideas.  I know of few commercially 
> supported applications that will accept support calls from FC-x running 
> users.
> Note:  Debian has very little in the way of commercial support (none 
> from the distributer).  It is most definitely not a beta.  You can use 
> the beta version in unstable.  This is analogous to Fedora.
> What makes FC a beta is that Redhat specifically is note that, and is 
> using Fedora as a "proving ground"  (c.f. 
> http://dictionary.reference.com/search?q=proving+ground ) as in "It is 
> also a proving ground for new technology that may eventually make its 
> way into Red Hat products." (from http://fedora.redhat.com/ )  From the 
> reference.com site "prov·ing ground (prvng) n.   A place for testing new 
> devices, weapons, or theories."  Would you call a system that is defined 
> by its maker to be a proving ground to be a production environment (e.g. 
> stable, unchanging) ?

As I repeatedly say (as self-appointed referee) here is where you guys
are REALLY fighting -- about nothing.

FC is not a beta for RH any more than Debian is.  RH is a collection of
packages.  So is FC.  So is Debian.  So is linux itself.  Those packages
include things as diverse as kernels, compilers, libraries, x, and
applications of all sorts.  MOST of those packages are NOT maintained by
RH per se, and the ones that are are often "co-maintained" by RH and
SuSE and Gnu and the kernel team and xorg and a host of contributing
programmers from all over the world.  ALL of them are a "beta" for RH,
and for FC, and for Debian, and for SuSE, by this sloppy a definition.

I reiterate -- the fundamental problem is that alpha/beta testing refer
to specific phases present in the development cycle for most
commercial-grade (monolithically supported) software.  They do not fit
comfortably into the open source development process where there often
isn't a single well-defined "team" with "responsibilities".  In lots and
lots of cases, the "team" is a single person who wrote and accepts full
responsibility for a product, and an associated "list" of active users
that serve as some mix of co-developer (as they contribute working
codons back to the project memetic source code set), alpha tester as
they implement new snapshots, beta tester as they implement new
snapshots (oops, failed to see much difference there) and user as the
new snapshots they implemented prove to be stable and are put into
production.  Then there is a whole NEW cycle when the product is
distributed outside of that group, and there may be several such cycles
in parallel as the product goes into several distributions.

"FC-X" is most definitely NOT pre-RHEL in anything like the sense that
rawhide was pre RH.  Sure, packages there migrate eventually into RHEL
-- how could they not?  So do packages that are actively developed under
Debian.  In fact, a lot of those packages "get into" FC-x as well.
Other packages (e.g. PVM or LAM) might be packaged by groups that have
nothing to do with any linux distro.  Also, FC goes through multiple
cycles of development betweeen RHEL upgrades.  How specifically is FC-1
contributing to RHEL4?  FC-2?  FC-3?  All three releases have overlapped
with the RHEL 3 timeframe -- so how is a package that appears and is
distributed, stable, for an entire FC release, then re-released and
distributed, stable, for an entire FC release, then re-released somehow
"being tested for RHEL" in two of those three releases?

It just isn't.  FC is a release in and of itself, and has a lot of
energy going into it for its own sake.  Sure, it is important to RH as a
"proving ground for new software" because RH has to be (or chooses to
be) conservative to the edge of insanity within the EL distribution.
Remember, now that they are selling it for so much money, customer
tolerance to "gamma" releases in the best of open source traditions is
doubtless greatly reduced.

Note well that RHEL is frozen WAY out to the point of insanity.  I can't
get packages I'm actively working on under FC-2 to BUILD under Centos
3.x because -- guess what -- they froze the GSL along with everything
else.  This is doubtless very comforting to somebody that built
something to work with just that GSL snapshot, but that snapshot was
broken in many ways and is missing all sorts of new features.

So here's an out for the two of you.  Call FC-X a "gamma release", not
of RHEL (it isn't) but of itself.  That's ok since gamma release is a
joke anyway, but to the extent that it means anything it is probably
accurate because ALL linux distributions are "gamma" releases.  Accept
the fact that FC-X is beta tested and quality assured before release,
and that the testing and assurance are likely lower than they are for RH
because they cost money and RH is trying to minimize investment here, so
it probably (to be fair) DOES rely more on the gamma phase for end-stage
debugging, which so far, for a community-based linux, seems to work just
gangbusters well.

Leave the entire issue of commercial software and library compatibility
alone, as it is something you don't NEED to agree on.  It is what it is.
Mark's environment uses FC (as does mine) because we rely very, very
little on commercial code and are comfortable telling consumers of our
resource that their commercial products need to run on top of FC or
don't bother.  We both have Centos as an alternative or can always pay
for RHEL or SuSE if we need more.  You have particular customers with
particular needs that are orthogonal (in some cases) to FC -- that's
fine too!  I think we'd all agree that this is FUNDAMENTALLY not FC's
problem -- its an open source/closed source issue, and at its ROOT is
probably due to inadequate investment in the software development
process in the owning corporation and poor methodology, with some
obvious exceptions -- but that doesn't make it less validly a problem
with your customers.  It does make it wrong to give as knee-jerk advice
"never install FC on clusters" as that's just plain silly, even as it
makes it perfectly reasonable to say "don't install FC on your cluster
if you want to run product X, as it may have binary/library
compatibility issues".

> > the customer needs to evaluate how fragile a commercial product is:
> > how well it conforms to the ABI.  NVidia is a great example of 
> > an attractive product which is inherently fragile since NVidia 
> > chooses to hide trade secrets in a binary-only, kernel-mode driver
> > which (by definition and example) depends on undefined behavior.  
> > VMWare is another good (flawed) example.
> Hmmm.  I hear this argument time and again from people about the closed 
> source nature of nVidia's drivers.  nVidia does not (as far as I know) 
> own all the intellectual property in their driver, and they do not have 
> the right to give that IP away via GPL or any other mechanism.  The 
> fundamental flaw in the arguments against the nVidia driver are an 
> inherent presumtion that nVidia is hiding trade secrets in order to make 
> its life better and get end user lock-in.  The behavior it (the driver) 
> depends upon has been built into the kernel, and when that behavior 
> suddenly changed, nVidia wasnt the only driver affected.  Many open 
> source drivers were impacted.   Are you going to argue that this makes 
> them (the open source drivers) inherently fragile?  This is a natural 
> extension and simple application of your argument.  This is a weak 
> argument at best, and some of its fundamental premises are fatally 
> flawed.  If nVidia owned all the IP in everything they released, and 
> chose simply to release binary only drivers, that would be a completely 
> different case.  Unfortunately, a fair amount of the IP in OpenGL and 
> other related standards is owned by companies that have no interest in 
> open source other than demolishing it.  SGI sold off most of its IP in 
> OpenGL to some other outfit.

Hmmm, I'm skeptical about this specific argument.  Cynical might be a
better word.  After all, there exist nvidia drivers in the open source
world -- see "nv" in xorg.  They simply don't work as well.  I find it
very, very difficult to believe that -- with access to the internals --
nvidia couldn't write an open source driver that worked as well as their
"proprietary" driver.

Of course I'm also a radical person who doesn't believe that there can
or should be "IP" in a software driver.  Their product is a hardware
device; it has an ABI, whether or not they choose to publish it.  It
isn't easy for me to see how that ABI is IP.  If they published the ABI
with full documentation, open source people could write as good a driver
as they could without their help.

One encounters similar things for everything from palm pilots to NICs.
There is always some argument for a hardware vendor protecting their
"trade secrets" in their drivers, where the real trade secret may be
that they're using some standard chipset with a tiny bit of nonstandard
glue and their board is really a piece of crap.  In linux, it is
evolution in action -- unsupported boards aren't purchased by linux
users, and that is finally adding up to something pretty significant in
the marketplace.  Nvidia is something of an exception because they are
very popular with gamers and visualization labs and because there is
SUCH a big difference between nvidia's driver and the nv driver.
However, this may not last.  

Note also that the kernel has NEVER guaranteed that drivers built for
one revision will remain valid for all revisions.  How could it?  Will
drivers that aren't "part" of the kernel break when the kernel changes
something major (the ones that are "part" of the kernel are again beta
tested in situ and tend not to break -- as much -- so we're obviously
talking about add-ons)?  Sure, how could they not?  Either participate
in the kernel development process and work all these kinks out snapshot
to snapshot or accept that you'll have to port/debug every time the
kernel changes enough to break your driver.  

Just don't complain.  There's nothing to complain about.  It's like
complaining that those consarned rattletraps with four wheels and a
stinkin' engine and horn scare the horses and should be banned from the

> > "supported configuration" is nothing more or less than a way to 
> > "download" support costs to the platform vendor (PV).  it's a lever,
> > acting on the customer as a pivot, to force the PV to avoid changes
> > of any sort, since its impossible to tell what internals the proprietary
> > product depends on.
> Uh....  I think we disagree again.  A supported configuration is 
> something that a customer, an end user, a developer should have a 
> reasonable and fighting chance of having it work right.  This means that 
> the internals that are exposed to developers will no change (including 
> driver developers).  This means that end users and customers have a 
> reasonable expectation that their configuration on the supported list 
> should work, and the onus is on the platform vendor (nice to see you 
> switched to the definition of platform that I was using BTW) to make it 
> work without breaking other stuff.

Oh, but that's easy.  Just lock into Red Hat 5.2 and tell customers that
any hardware older than a Pentium is out of bounds.

What, you meant that you wanted CONTEMPORARY hardware configurations to
work?  But DON'T want bleeding edge kernels (most likely to have the
requisite drivers), modern libraries (most likely to be posix compliant,
most likely to have useful features), modern compilers (most likely to
have e.g. SSE support and other bells and whistles) and modern
applications to help you run, configure, and otherwise support the
systems in question?

Don't want MUCH, do you, but your cake and its consumption all at the
same time.  FC-x is far more likely than RHEL to have:

  contemporary kernel, good 64 bit support, large list of supported
  up to date compilers
  up to date libraries (e.g. GSL given above)
  lots of nifty -- and new -- applications, and relatively bug-free and
feature-rich version of older ones e.g. Open Office that might have been
under active development when RHEL was "frozen".

So I don't think you mean this.  I think what you mean is that you wish
that VENDORS rewrote and updated their SOFTWARE PRODUCTS so they would
REBUILD on FC-x with less pain than months of work porting and debugging
and testing.  Since they won't do this and are instead insisting that
YOU use the ancient kernel and libc where their last build still worked,
you wish that somehow that ancient kernel could use newer drivers for
modern networks and busses, that libxml had been rebuilt for the older
libc (presuming that it COULD be rebuilt for it, by no means a given),
that xorg had bothered to port their latest set of devices and
applications back to that ancient OS release so that your customer could
still use their nifty new monitor and graphics card.

I'm not trying to be cruel here -- I'm just pointing out that there are
some fairly fundamental conflicts here that cannot easily be resolved by
YOU or YOUR CUSTOMER.  The only way to fix this problem properly is
(perhaps) the LSB, and even then it would only work if the SOFTWARE
PROVIDER learned how to write portable and cheaply rebuildable software
and made a siginficant and revolving investment in development and
maintenance of same.  In the meantime you literally cannot have what you
want -- you can only choose the place where you make the compromises
that let you make things work well enough to get by.

> > drastistically
> > similarly, SOP in the Fibrechannel world is to provide only negative
> > definitions of support (nothing but HP disks in HP SANs.)  this can be 
> > seen as a flaw in standard-defining, since Ethernet provides a fairly
> > decent counterexample where interoperability is the norm because 
> > products need to conform, not "qualify".
> A standard is only useful if people pay attention to it, and 
> engineer/design/build to it.  Standards are very useful to developers, 
> in that if they code in a particular manner that adheres to the 
> standard, they have a fighting chance of developing something that will 
> work.  If the standard suddenly changes on them, and their stuff breaks, 
> who do they turn to?  If the target is moving, how much time/effort will 
> they expend to chase it?

Sigh.  Somehow there are (how many?  lots!) mountains of linux packages
-- hundreds and hundreds -- that are written so that they will work.
Not only work, but keep working distribution release to distribution
release, often with nothing more than that aforementioned "rpmbuild
--rebuild".  It isn't an issue with "creeping standards".  The most
common problem is "failure to code to standards", followed by "failure
to invest in maintaining the code".

Most real standards have consortia associated with them.  Some are
defined by IEEE docs (a process that I abjure, because it is both
elitist and non-open).  I prefer the RFC-defined standards as exemplary
of the open standard development process.  In such a process there is
really never any reason for a vendor or developer to be caught by
surprise.  Surprises are more likely to occur when vendors are heavily
involved in writing the "standard" and yank it around for customer
lock-in and commercial advantage.  M$ being past masters at the game,
but they are far from the only ones.

> In some cases (development tools) it makes sense to chase some specific 
> moving targets (though it costs time/effort and therefore real money). 
> In other cases it makes sense to wait for stable releases where things 
> will not change, so your customers/end users can get your stuff and make 
> it work, because you have a fighting chance at making it work.
> Greg's company (and the folks at the Portland Group) have to chase these 
> targets... many of their customers are there (I'd bet that a small 
> fraction of their collective total customer base are using the 
> development tools to generate commercial code, most are using the tools 
> for their research/development tasks).

These are a notable exception to many of the things I say above, but
then, no compiler company survives for long without a sustained
investment in the world's MOST serious programmers who know MORE about
product development cycles and hardware features and interoperability
and maintenance and all that than nearly anybody.  However, I know of
numerous projects that are developed (one can WATCH them being
developed) and then most of the development staff goes away with only a
small skeleton hanging out to do debugging and solving those "gamma
release" problems.  This lets the company make a lot of money without
having to pay a full development team of pesky programmers.  It also
means that if they wrote a Windows version of the application and you
want a linux version you're SOL.  If you want a new feature you're SOL.
If you want new hardware to be supported, well, their two remaining
programmers will get around to new hardware in roughly 13 months as
maintenance issues and putting out fires run down.  In fact, if you just
want them to fix a damn bug get in line -- those same two guys are
booked up fixing bugs that have already been reported for the next nine
weeks, but they'll try really hard to do yours then.

I'm not criticizing this -- it may be that this is the only way for them
to maintain any sort of profitability at all.  Or maybe, just maybe, the
VC guys want to maximize profit for long enough to sell their stock
and/or make their investment x 20 back some other way, and don't give a
goddamn if the company still exists two years from now.  I've seen both
things openly expressed in board rooms, with the latter usually NOT said
but perfectly obvious.  Rape, pillage and burn, right?  Take no

> Yeah, there are significant interoperability problems in things like SAN 
> and what-not-else.  These are unfortunate.  This is part of the reason 
> why I try to avoid such things (I don't like vendors locking me in, and 
> I know my customers don't like being locked in, so I don't waste my 
> companies time trying to figure out how to do this).  Don't assume that 
> a companies / end users misapplication of a standard, hijacking of a 
> standard, or abuse of a standard somehow makes all standards bad.  They 
> are not.  Standards are sometimes the only lever you have in a 
> commercial closed source context... demanding that a company adhere to 
> what it claims to sell is sometimes a necessary path.  Interoperability 
> means that when people interpret the standards, that all parties agree 
> on the definitions, and that they guarantee that their products will in 
> fact conform to the standard, and that there will be tests of the 
> standard compliance, and out of compliant systems will be adjusted to be 
> in-compliance, and that interoperability with other standards will be 
> guaranteed.  This is why IDE, SCSI, and Ethernet work so well.  This is 
> why some others do not.  IB is likely to work quite well going forward. 
>   This is why the SAMBA folks are chasing a moving target, as the CIFS 
> "standard" is a moving one (just go ahead and update that XP with a 
> SAMBA server around .... grrrrr).
> I like and use FC-x, we run FC-2 and FC-3 on various machines (AMD64, my 
> laptop as part of a triple boot, and x86).  I make sure our software 
> runs on this, we compile and test on FC as well as for others 
> (RH/Centos, SuSE, looking at Ubuntu/Debian) .  I am happy that our 
> binary packages seem to work nicely across multiple distributions 
> (though we usually bring the source along to be sure), and our large 
> systems are built from source, so they should work (as long as the 
> underlying technology works).  Our software works at a high level, and 
> depends upon lower level bits.  I don't see the effect of the OS changes 
> as much as the tool/hardware vendors do, though every now and then 
> something breaks a driver.  But, and this is the critical point for us, 
> if our software breaks at our customers site, we own the fixes, it is 
> our job to make them happen.  More importantly, if something breaks in 
> the chain of software (whether we own it or not), we try to help, as it 
> is critical to make sure that failure modes are understood, and problems 
> are resolved.  We have been and will be helping our customers resolve 
> problems with third party software, commercial and otherwise.   If our 
> target platform were moving, so that the C compiler structures were 
> changing, and we had to rebuild time and time again with each OS update, 
> I would wait until we saw this settle out.  Otherwise we are spinning 
> our wheels, as each change is more work, and in the end, it should 
> converge to a final state.  It is the final state that is worth 
> targetting (for us, for others such as PathScale, they have to follow 
> what their customers use).

It sounds like you too take the software end seriously.  This is just
what you need to do to maintain a viable and interoperable product.
Good job.

> The issue in FC-x is that it is open to internals changing.  I think 
> this is a good thing.   It is doing what it was intended to do, and I 
> like seeing the directions I need to worry about going forward.  I will 
> not likely deploy this as an OS for a cluster customer without the 
> customer understanding exactly what they are getting, and making sure 
> they understand what is needed to support this.  If they really want a 
> cheap RH, they can get Centos/Tao.  If they want internal structural 
> stability, and support from commercial vendors for their commercial 
> codes, they will have to run something that the commercial vendors will 
> support.   PathScale and possibly the Portland group (and I am going to 
> guess Etnus and a few others) do or will likely support it.  LSTC, MSC, 
> Accelrys, Tripos, Oracle, ... will likely not (though it will probably 
> run fine with no issues).

This is all very reasonable, and Mark would probably even agree.  Run FC
if it works and your software base permits it, use Centos or buy
whatever you like if it doesn't.  Don't expect apples to become oranges,
remember that tanstaafl, and make the best compromises you can to get
things to work (the important thing).

And DON'T WORRY about what is a "beta" or what isn't.  It isn't
relevant.  FC is without question more dynamic.  It is without question
already through a real beta before release.  It is without question more
"daring" as it evolves more quickly and will break things more often.
It will also GENERALLY be a lot more functional, as breaking
non-commercial things is confined to a relatively short phase right
after release, and with yum fixes are RAPIDLY deployed.


Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list