[Beowulf] OS for 64 bit AMD
landman at scalableinformatics.com
Sun Apr 3 17:55:06 PDT 2005
Mark Hahn wrote:
> this is utterly pointless, since we seem to disagree on axioms:
Not really. We disagree on basic definitions. Axioms are accepted
> correct code conforms to the standard; it is buggy if it depends
> on undefined (outside-the-standard) behavior.
We agree on this.
> the platform is the ABI, not the distribution. if you believe that
> the ABI doesn't cover enough, talk to the organization that manages it.
We disagree on this. This is not an axiom. RH9 is the prototypical
case which changed the ABI in an incompatible manner with an existing
functional ABI. At this point, the platform became the distribution, as
commercial vendors target platforms (specifically RH) with the largest
installed base. If the linux platform were truly distribution
independent, then it would not matter what it was compiled for, and
frankly vendors would not need to QA against multiple distributions, as
they would have the ABI. Unfortunately this is not how it works. I
would like it to work like this. It would be great if the LSB would in
fact require certification, and that the application vendors would be
required to code to certification levels (been arguing this for years).
Not likely to happen, but it would be nice.
> productionworthiness (PW) is behavioral stability, not some vendor's
> assertion about "support".
It is *long term* behavioural, driver, and interface stability.
Changing an ABI midway through (4k stacks) is *not* behavioral
stability. You have no real reason to expect a code to work correctly
when you alter one of the critical underlying structures that it relies
upon. Many drivers rested on 8k kernel stacks, it was in the ABI as a
(defacto) standard. RHEL3 did not (properly so) change its underlying
kernel structures in such a way to render some portions of the system
unworkable. RHEL4 is not likely to change its underlying kernel
structures in such a way to render some portions of the system
unworkable. FC-x is likely to (and has) changed its underlying kernel
> there is no data to suggest that a "supported" configuration
> is actually more stable - support is a matter of CYA and risk aversion.
> (not the actual risk; PW is the actual risk (well, inverse of it).)
May in fact be less stable, though the likelihood is that it is more
conservative (makes support easier) so the implication is more stable.
The fact is that the supported configurations are fundamentally averse
to changing the underlying internals. This is not the case in FC-x (nor
should it be given its purpose).
> Fedora has normal release management, with pre-release testing
> as well as post-release updates. the pre-release testing is
> also known as "beta-testing".
So you have pre-release testing as "beta-testing" but you deny that
"proving ground" is beta-testing? Seems to be same side of a coin here.
Having a normal release management does not a production quality
system make. It is most definitely one of the requirements for such a
system, but it does not, in and of itself, make the OS a production
class OS. A reasonable definition of production class OS will likely
incorporate inherent stability of the underlying structures of the
system, and a guarantee that they will not change for some fixed
interval. Production specifically implies a repetitive behavior,
specifically for HPC, a cycle shop. If the next incompatible change in
FC-x renders your IB drivers unworkable for your cluster, does that in
fact make the OS that you have installed on the system production ready
or not? If you have to continuously chase hacks/patches/etc to keep
your system operational after every upgrade, does that make your system
> the existence of commercial products which specify RH-whatever vX.Y
> does not magically turn FC into a beta-test. if you redefine words
> that way, you might as well call all of SunOS a beta for Solaris.
Er... you are the only one who indicated this, so if you want to argue
this, I would suggest you contact the person who generated this idea
(that commercial products dependent upon RH make FC a beta test) who can
be found at hahn _at_ physics _dot_ mcmaster _dot_ ca.
I said "My customers care about running on distributions (whoops, there
we go with that word again) on which their apps are supported. I am not
aware of active support for FC-x for applications from commercial
program providers. If I am incorrect about this, please let me know
(seriously, as FC-3++ looks to be pretty good)." Prior to this I said
"It is by Redhat's definition, a rolling beta (proving ground)." The
two are specifically independent ideas. I know of few commercially
supported applications that will accept support calls from FC-x running
Note: Debian has very little in the way of commercial support (none
from the distributer). It is most definitely not a beta. You can use
the beta version in unstable. This is analogous to Fedora.
What makes FC a beta is that Redhat specifically is note that, and is
using Fedora as a "proving ground" (c.f.
http://dictionary.reference.com/search?q=proving+ground ) as in "It is
also a proving ground for new technology that may eventually make its
way into Red Hat products." (from http://fedora.redhat.com/ ) From the
reference.com site "prov·ing ground (prvng) n. A place for testing new
devices, weapons, or theories." Would you call a system that is defined
by its maker to be a proving ground to be a production environment (e.g.
stable, unchanging) ?
> the customer needs to evaluate how fragile a commercial product is:
> how well it conforms to the ABI. NVidia is a great example of
> an attractive product which is inherently fragile since NVidia
> chooses to hide trade secrets in a binary-only, kernel-mode driver
> which (by definition and example) depends on undefined behavior.
> VMWare is another good (flawed) example.
Hmmm. I hear this argument time and again from people about the closed
source nature of nVidia's drivers. nVidia does not (as far as I know)
own all the intellectual property in their driver, and they do not have
the right to give that IP away via GPL or any other mechanism. The
fundamental flaw in the arguments against the nVidia driver are an
inherent presumtion that nVidia is hiding trade secrets in order to make
its life better and get end user lock-in. The behavior it (the driver)
depends upon has been built into the kernel, and when that behavior
suddenly changed, nVidia wasnt the only driver affected. Many open
source drivers were impacted. Are you going to argue that this makes
them (the open source drivers) inherently fragile? This is a natural
extension and simple application of your argument. This is a weak
argument at best, and some of its fundamental premises are fatally
flawed. If nVidia owned all the IP in everything they released, and
chose simply to release binary only drivers, that would be a completely
different case. Unfortunately, a fair amount of the IP in OpenGL and
other related standards is owned by companies that have no interest in
open source other than demolishing it. SGI sold off most of its IP in
OpenGL to some other outfit.
> "supported configuration" is nothing more or less than a way to
> "download" support costs to the platform vendor (PV). it's a lever,
> acting on the customer as a pivot, to force the PV to avoid changes
> of any sort, since its impossible to tell what internals the proprietary
> product depends on.
Uh.... I think we disagree again. A supported configuration is
something that a customer, an end user, a developer should have a
reasonable and fighting chance of having it work right. This means that
the internals that are exposed to developers will no change (including
driver developers). This means that end users and customers have a
reasonable expectation that their configuration on the supported list
should work, and the onus is on the platform vendor (nice to see you
switched to the definition of platform that I was using BTW) to make it
work without breaking other stuff.
> similarly, SOP in the Fibrechannel world is to provide only negative
> definitions of support (nothing but HP disks in HP SANs.) this can be
> seen as a flaw in standard-defining, since Ethernet provides a fairly
> decent counterexample where interoperability is the norm because
> products need to conform, not "qualify".
A standard is only useful if people pay attention to it, and
engineer/design/build to it. Standards are very useful to developers,
in that if they code in a particular manner that adheres to the
standard, they have a fighting chance of developing something that will
work. If the standard suddenly changes on them, and their stuff breaks,
who do they turn to? If the target is moving, how much time/effort will
they expend to chase it?
In some cases (development tools) it makes sense to chase some specific
moving targets (though it costs time/effort and therefore real money).
In other cases it makes sense to wait for stable releases where things
will not change, so your customers/end users can get your stuff and make
it work, because you have a fighting chance at making it work.
Greg's company (and the folks at the Portland Group) have to chase these
targets... many of their customers are there (I'd bet that a small
fraction of their collective total customer base are using the
development tools to generate commercial code, most are using the tools
for their research/development tasks).
Yeah, there are significant interoperability problems in things like SAN
and what-not-else. These are unfortunate. This is part of the reason
why I try to avoid such things (I don't like vendors locking me in, and
I know my customers don't like being locked in, so I don't waste my
companies time trying to figure out how to do this). Don't assume that
a companies / end users misapplication of a standard, hijacking of a
standard, or abuse of a standard somehow makes all standards bad. They
are not. Standards are sometimes the only lever you have in a
commercial closed source context... demanding that a company adhere to
what it claims to sell is sometimes a necessary path. Interoperability
means that when people interpret the standards, that all parties agree
on the definitions, and that they guarantee that their products will in
fact conform to the standard, and that there will be tests of the
standard compliance, and out of compliant systems will be adjusted to be
in-compliance, and that interoperability with other standards will be
guaranteed. This is why IDE, SCSI, and Ethernet work so well. This is
why some others do not. IB is likely to work quite well going forward.
This is why the SAMBA folks are chasing a moving target, as the CIFS
"standard" is a moving one (just go ahead and update that XP with a
SAMBA server around .... grrrrr).
I like and use FC-x, we run FC-2 and FC-3 on various machines (AMD64, my
laptop as part of a triple boot, and x86). I make sure our software
runs on this, we compile and test on FC as well as for others
(RH/Centos, SuSE, looking at Ubuntu/Debian) . I am happy that our
binary packages seem to work nicely across multiple distributions
(though we usually bring the source along to be sure), and our large
systems are built from source, so they should work (as long as the
underlying technology works). Our software works at a high level, and
depends upon lower level bits. I don't see the effect of the OS changes
as much as the tool/hardware vendors do, though every now and then
something breaks a driver. But, and this is the critical point for us,
if our software breaks at our customers site, we own the fixes, it is
our job to make them happen. More importantly, if something breaks in
the chain of software (whether we own it or not), we try to help, as it
is critical to make sure that failure modes are understood, and problems
are resolved. We have been and will be helping our customers resolve
problems with third party software, commercial and otherwise. If our
target platform were moving, so that the C compiler structures were
changing, and we had to rebuild time and time again with each OS update,
I would wait until we saw this settle out. Otherwise we are spinning
our wheels, as each change is more work, and in the end, it should
converge to a final state. It is the final state that is worth
targetting (for us, for others such as PathScale, they have to follow
what their customers use).
The issue in FC-x is that it is open to internals changing. I think
this is a good thing. It is doing what it was intended to do, and I
like seeing the directions I need to worry about going forward. I will
not likely deploy this as an OS for a cluster customer without the
customer understanding exactly what they are getting, and making sure
they understand what is needed to support this. If they really want a
cheap RH, they can get Centos/Tao. If they want internal structural
stability, and support from commercial vendors for their commercial
codes, they will have to run something that the commercial vendors will
support. PathScale and possibly the Portland group (and I am going to
guess Etnus and a few others) do or will likely support it. LSTC, MSC,
Accelrys, Tripos, Oracle, ... will likely not (though it will probably
run fine with no issues).
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452
cell : +1 734 612 4615
More information about the Beowulf