[Beowulf] Which distro for the cluster?

Fri Dec 29 09:24:14 PST 2006

On Fri, 29 Dec 2006, Geoff Jacobs wrote:

> I'd rather have volatile user-level libraries and stable system level
> software than vice versa. Centos users need to be introduced to the
> lovely concept of backporting.

The problem (one of many) is with operations like banks.  In order for a
bank to use a distro at all, it has to be audited for security at a
horrendous level.  If you change a single library, they have to audit
the whole thing all over again.  Costly and annoying, so RHEL "freezes"
except for bugfixes because for companies like banks and other large
operations, any change at all costs money.  You can see the mentality
running wild in lots of other places -- most "big iron" machine rooms
were rife with it for a couple of decades, and even though I've been in
this business one way or another for most of my professional life I
>>still<< underestimate the length of time it will take for really
beneficial changes to permeate the computing community.  By years if not
decades.  I fully expected MS to be on the ropes at this point, being
truly hammered by linux on all fronts, for example -- but linux keeps
"missing" the mass desktop market by ever smaller increments even as it
has finally produced systems that do pretty damn well on office
desktops.  I still view Linus's dream of world domination as a
historical inevitability, mind you, I just no longer think that it will
happen quite as catastrophically suddenly.

Centos, of course, won't alter this pattern because diverging from RHEL
also costs money and obviates the point for Centos users, who want the
conservatism and update stream without the silly cost scaling or largely
useless support.  However, Centos "users" are largely sysadmins, not end
users per se, and lots of them DO backport really important updates on
an as needed basis.  Fortunately, in many cases an FC6 src rpm will
build just fine on a Centos 4 system, and rpmbuild --rebuild takes a few
seconds to execute and drop the result into a yum-driven local "updates"
repo.  So I'd say most pro-grade shops already do this as needed.

My problem with being conservative with a cluster distro is that it
requires impeccable timing.  If you happen to build your cluster right
when the next release of RHEL happens to correspond with the next
release of FC, it is auspicious.  In that case both distros are up to
date on the available kernel drivers and patches for your (presumably
new and cutting edge) hardware, with the highest probability of a
fortunate outcome.  However, if you try to build a cluster with e.g.
AMD64 nodes and the "wrong motherboard" on top of Centos/RHEL 4, all
frozen and everything back when the motherboard and CPU itself really
didn't exist, you have an excellent chance of discovering that they
distro won't even complete an install, at least not with x86_64
binaries.  Or it will install but its built in graphics adapter won't
work.  Or its sound card (which may not matter, but the point is clear).

Then you've got a DOUBLE problem -- to use Centos you have to somehow
backport regularly from a dynamically maintained kernel stream, or else
avoid a potentially cost-efficient node architecture altogether, or else
-- abandon Centos.  The stars just aren't right for the conservative
streams for something like the last year of each release if you are
interested in running non-conservative hardware.

The problem is REALLY evident for laptops -- there are really major
changes in the way the kernel, rootspace, and userspace manages devices,
changes that are absolutely necessary for us to be able to plug cameras,
memory sticks, MP3 players, printers, bluetooth devices, and all of that
right into the laptop and have it "just work".  NetworkManager simply
doesn't work for most laptops and wireless devices before FC5, and it
doesn't really work "right" until you get to FC6 AND update to at least
0.6.4. On RHEL/Centos 4 (FC4 frozen, basically), well...

One of the major disadvantages linux has had relative to WinXX over the
years has been hardware support that lags, often by years, behind the
WinXX standard.  Because of the way linux is developed, the ONLY way one
can fix this is to ride a horse that is rapidly co-developed as new
hardware is released, and pray for ABI and API level standards in the
hardware industry in front of your favorite brazen idol every night
(something that is unlikely to work but might make you feel better:-).

The fundamental "advantage" of FC6 is that its release timing actually
matches up pretty well against the frenetic pace of new hardware
development -- six to twelve month granularity means that you can
"usually" by an off-the shelf laptop or computer and have a pretty good
chance of it either being fully supported right away (if it is older
than six months) or being fully supported within weeks to months --
maybe before you smash it with a sledgehammer out of sheer frustration.
>From what I've seen, ubuntu/debian has a somewhat similar aspect, user
driven to get that new hardware running even more aggressively than with
FC (and with a lot of synergy, of course, even though the two
communities in some respects resemble Sunnis vs the Shites in Iraq:-).
SINCE they are user driven, they also tend to have lots of nifty
userspace apps, and since we have entered the age of the massive, fully
compatible, contributed package repo I expect FC7 to provide something
on the order of 10K packages, maybe 70% of them square in userspace (and
the rest libraries etc).

This might even be the "nextgen" revolution -- Windows cannot yet
provide fully transparent application installation (for money or not)
over the network -- they have security issues, payment issues,
installshield/automation issues, permission issues, and
compatibility/library issues all to resolve before they get anywhere
close to what yum and friends (or debian's older and also highly
functional equivalents) can do already for linux.  What the software
companies that are stuck in the "RHEL grove" don't realize is that RPMs,
yum and the idea of a repo enable them to set up a completely different
software distribution paradigm, one that can in fact be built for and
run on all the major RPM distros with minimal investment or risk on
their part.  Then don't "get it" yet.  When they do, there could be an
explosion in commercial grade, web-purchased linux software and
something of a revolution in software distribution and maintenance (as
this would obviously drive WinXX to clone/copy).  Or not.

Future cloudy, try again later.

> Call me paranoid, but I don't like the idea of a Cadbury Cream Egg
> security model (hard outer shell, soft gooey center). I won't say more,
> 'cuz I feel like I've had this discussion before.

Ooo, then you really don't like pretty much ANY of the traditional "true
beowulf" designs.  They are all pretty much cream eggs.  Hell, lots of
them use rsh without passwords, or open sockets with nothing like a
serious handshaking layer to do things like distribute binary
applications and data between nodes.  Grid designs, of course, are
another matter -- they tend to use e.g. ssh and so on but they have to
because nodes are ultimately exposed to users, probably not in a chroot
jail.  Even so, has anyone really done a proper security audit of e.g.
pvm or mpi?  How difficult is it to take over a PVM virtual machine and
insert your own binary?  I suspect that it isn't that difficult, but I
don't really know.  Any comments, any experts out there?

In the specific case of my house, anybody who gets to where they can
actually bounce a packet off of my server is either inside its walls and
hence has e.g. cracked e.g. WPA or my DSL firewall or one of my personal
accounts elsewhere that hits the single (ssh) passthrough port.  In all
of these cases the battle is lost already, as I am God on my LAN of
course, so a trivial password trap on my personal account would give
them root everywhere in zero time.  In fact, being a truly lazy
individual who doesn't mind exposing his soft belly to the world, if
they get root anywhere they've GOT it everywhere -- I have root set up
to permit free ssh between all client/nodes so that I have to type a
root password only once and can then run commands as root on any node
from an xterms as one-liners.

This security model is backed up by a threat of physical violence
against my sons and their friends, who have carefully avoided learning
linux at anything like the required level for cracking because they know
I'd like them to, and the certain knowledge that my wife is doing very
well if she can manage to crank up a web browser and read her mail
without forgetting something and making me get up out of bed to help her
at 5:30 am.  So while I do appreciate your point on a
production/professional network level, it really is irrelevant here.

> Upgrade it, man. Once, when I was bored, I installed apt-rpm on a RH8
> machine to see what dist-upgrade looked like in the land of the Red Hat.
> Interesting experience, and it worked just fine.

There are three reasons I haven't upgraded it.  One is sheer bandwidth.
It takes three days or so to push FCX through my DSL link, and while I'm
doing it all of my sons and wife and me myself scream because their
ain't no bandwidth leftover for things like WoW and reading mail and
working.  This can be solved with a backpack disk and my laptop -- I can
take my laptop into Duke and rsync mirror a primary mirror, current
snapshot, with at worst a 100 Mbps network bottleneck (I actually think
that the disk bottleneck might be slower, but it is still way faster
than 384 kbps or thereabouts:-).

The second is the bootstrapping problem.  The system in question is my
internal PXE/install server, a printer server, and an md raid
fileserver.  I really don't feel comfortable trying an RH9 -> FC6
"upgrade" in a single jump, and a clean reinstall requires that I
preserve all the critical server information and restore it post
upgrade.  At the same time it would be truly lovely to rebuild the MD
partitions from scratch, as I believe that MD has moved along a bit in
the meantime.

This is the third problem -- I need to construct a full backup of the
/home partition, at least, which is around 100 GB and almost full.
Hmmm, it might be nice to upgrade the RAID disks from 80 GB to 160's or
250's and get some breathing room at the same time, which requires a
small capital investment -- say $300 or thereabouts.  Fortunately I do
have a SECOND backpack disk with 160 GB of capacity that I use as a
backup, so I can do an rsync mirror to that of /home while I do the
reinstall shuffle, with a bit of effort.

All of this takes time, time, time.  And I cannot begin to describe my
life to you, but time is what I just don't got to spare unless my life
depends on it.  That's the level of triage here -- staunch the spurting
arteries first and apply CPR as necessary -- the mere compound fractures
and contusions have to wait.  You might have noticed I've been strangely
quiet on-list for the last six months or so... there is a reason:-)

At the moment, evidently, I do have some time and am kind of catching
up.  Next week I might have even more time -- perhaps even the full day
and change the upgrade will take.  I actually do really want to do it --
both because I do want it to be nice and current and secure and because
there are LOTS OF IMPROVEMENTS at the server level in the meantime --
managing e.g. printers with RH9 tools sucks for example, USB support is
trans-dubious, md is iffy, and I'd like to be able to test out all sorts
of things like the current version of samba, a radius server to be able
to drop using PSK in WPA, and so on.  So sure, I'll take your advice
"any day now", but it isn't that simple a matter.

>> within its supported year+, then just freeze it.  Or freeze it until
>> there is a strong REASON to upgrade it -- a miraculously improved
>> libc, a new GSL that has routines and bugfixes you really need,
>> superyum, bproc as a standard option, cernlib in extras (the latter a
>>  really good reason for at least SOME people to upgrade to FC6:-).
> Or use a distro that backports security fixes into affected packages
> while maintaining ABI and API stability. Gives you a frozen target for
> your users and more peace of mind.

No arguments.  But remember, you say "users" because you're looking at
topdown managed clusters with many users.  There are lots of people with
self-managed clusters with just a very few.  And honestly,
straightforward numerical code is generally cosmically portable -- I
almost never even have to do a recompile to get it to work perfectly
across upgrades.  So YMMV as far as how important that stability is to
users of any given cluster.  There is a whole spectrum here, no simple
or universal answers.

>> Honestly, with a kickstart-based cluster, reinstalling a thousand
>> nodes is a matter of preparing the (new) repo -- usually by rsync'ing
>>  one of the toplevel mirrors -- and debugging the old install on a
>> single node until satisfied.  One then has a choice between a yum
>> upgrade or (I'd recommend instead) yum-distributing an "upgrade"
>> package that sets up e.g.  grub to do a new, clean, kickstart
>> reinstall, and then triggers it.  You could package the whole thing
>> to go off automagically overnight and not even be present -- the next
>>  day you come in, your nodes are all upgraded.
> Isn't automatic package management great. Like crack on gasoline.

Truthfully, it is trans great.  I started doing Unix admin in 1986, and
have used just about every clumsy horrible scheme you can imagine to
handle add-on open source packages without which Unix (of whatever
vendor-supplied flavor) was pretty damn useless even way back then.
They still don't have things QUITE as simple as they could be -- setting
up a diskless boot network for pxe installs or standalone operation is
still an expert-friendly sort of thing and not for the faint of heart or
tyro -- but it is down to where a single relatively simple HOWTO or set
of READMEs can guide a moderately talented sysadmin type through the
process.

With these tools, you can adminster at the theoretical/practical limit
of scalability.  One person can take care of literally hundreds of
machines, either nodes or LAN clients, limited only by the need to
provide USER support and by the rate of hardware failure.  I could see a
single person taking care of over a thousand nodes for a small and
undemanding user community, with onsite service on all node hardware.  I
think Mark Hahn pushes this limit, as do various others on list.  That's
just awesome.  If EVER corporate america twigs to the cost advantages of
this sort of management scalability on TOP of free as in beer software
for all standard needs in the office workplace... well, one day it will.
Too much money involved for it not to.

>> I used to include a "node install" in my standard dog and pony show
>> for people come to visit our cluster -- I'd walk up to an idle node,
>> reboot it into the PXE kickstart image, and talk about the fact that
>> I was reinstalling it.  We had a fast enough network and tight enough
>>  node image that usually the reinstall would finish about the same
>> time that my spiel was finished.  It was then immediately available
>> for more work. Upgrades are just that easy.  That's scalability.
>>
>> Warewulf makes it even easier -- build your new image, change a
>> single pointer on the master/server, reboot the cluster.
>>
>> I wouldn't advise either running upgrades or freezes of FC for all
>> cluster environments, but they certainly are reasonable alternatives
>> for at least some.  FC is far from laughable as a cluster distro.
> What I'd like to see is an interested party which would implement a
> good, long term security management program for FC(2n+b) releases. RH
> obviously won't do this.

I thought there was such a party, but I'm too lazy to google for it.  I
think Seth mentioned it on the yum or dulug list.  It's the kind of
thing a lot of people would pay for, actually.

> Do _not_ start a contest like this with the Debian people. You _will_ lose.

And I _won't_ care...;-)

It took me two days to wade through extras in FC6, "shopping", and now
there are another 500 packages I haven't even looked at a single time.
The list of games on my laptop is something like three screenfuls long,
and it would take me weeks to just explore the new applications I did
install.  And truthfully, the only reason I push FC is because (as noted
above) it a) meets my needs pretty well; and b) has extremely scalable
installation and maintenance; and c) (most important) I know how to
install and manage it.  I could probably manage debian as well, or
mandriva, or SuSE, or Gentoo -- one advantage of being a 20 year
administrator is I do know how everything works and where everything
lives at the etc level beneath all GUI management tool gorp layers
shovelled on top by a given distro -- but I'm lazy.  Why learn YALD?
One can be a master of one distro, or mediocre at several...

> I haven't used a RH based machine which regularly synced against a
> fast-moving package repository, so I can't really compare. :)

Pretty much all of the current generation do this.  Yum yum.

Where one is welcome to argue about what constitutes a "fast-moving"
repository.  yum doesn't care, really.  Everything else is up to the
conservative versus experimental inclinations of the admin.

> I personally believe more configuration is done on Debian systems in
> package configuration than in the installer as compared with RH, but I
> do agree with you mainly. It's way short of what FAI, replicator, and
> system imager do too.

The last time I looked at FAI with was Not Ready For Prime Time and
languishing unloved.  Of course this was a long time ago.  I'm actually
glad that it is loved.  The same is true of replicators and system
imagers -- I've written them myself (many years ago) and found them to
be a royal PITA to maintain as things evolve, but at this point they
SHOULD be pretty stable and functional.  One day I'll play with them, as
I'd really like to keep a standard network bootable image around to
manage disk crashes on my personal systems, where I can't quite boot to
get to a local disk to recover any data that might be still accessible.
Yes there are lots of ways to do this and I do have several handy but a
pure PXE boot target is very appealing.

>> Yes, one can (re)invent many wheels to make all this happen --
>> package up stuff, rsync stuff, use cfengine (in FC6 extras:-), write
>> bash or python scripts.  Sheer torture.  Been there, done that, long
>> ago and never again.
> Hey, some people like this. Some people compete in Japanese game shows.

Yes, but from the point of view of perfect scaling theory, heterogeneity
and nonstandard anything is all dark evil.  Yes, many people like to
lose themselves in customization hell, but there is a certain zen
element here and Enlightment consists of realizing that all of this is
Illusion and that there is a great Satori to be gained by following the
right path....

OK, enough system admysticstration...;-)

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu