[Beowulf] Clusters and Distro Lifespans
landman at scalableinformatics.com
Wed Jul 19 10:03:13 PDT 2006
Robert G. Brown wrote:
> On Wed, 19 Jul 2006, Stu Midgley wrote:
>> We also have our install process configured to allow booting different
>> distros/images, which is useful to boot diagnostic cd images etc.
> Good point and one I'd forgotten to mention. It is really lovely to
> keep a PXE boot image pointed at tools like memtest86, a freedos image
> that can e.g. flash bios or do other stuff that expects an environment
> that can execute a MS .exe, boot into a diskless config for repair
> purposes (or to bring up a node diskless while waiting for a replacement
The tools we set up do all of this, and for those whom are brave (or
foolish, not sure which) we also have dban ... . Still working on
getting Knoppix to do this, I know its possible, haven't seen docs on
how to do it.
> Honestly, for MOST work people do with clusters, running pretty much the
> (PXE-installable) distro of your choice will almost certainly work. I
> tend to use FC-even or Centos (a.k.a. FC-even-frozen) on cluster nodes
> simply because we have long since gotten to where we can make RH-derived
> distributions jump through hoops. With Seth Vidal in charge of the core
> mirrors and repos, Duke is "Repo World" not just to campus but to much
> of the world. Heck, I PXE-boot and kickstart install my systems at
> HOME using mirrors of the duke repos, and if I ever bothered to figure
> out Icon's toolset for customizing kickstart boots per system (using
> some very clever CGI scripts and a bit of XML) it would make those
> installs even easier than they are now.
Sadly, not all distros do yum, nor do all distros have sensible
dependency trees, nor even sane/common naming.
SuSE as of 10.0 can work with yum. We have/host a repo for
ourselves/customers. The problem is that yum is not a first class
system tool on SuSE like rug or zmd or whatever. Which means that there
are things that break yum under SuSE that don't break running
Yast/zmd/rug. Grrrrr. (If anyone from SuSE is reading, this was a
really bad idea, go to yum, your life, my life, and your customers lives
will be *much* easier). Well there is that and yum on 10.1 is slightly
>>> > iii) Do people regularly upgrade their clusters in relation to
>>> > distros? I guess this is like asking how long is a piece of string
>>> > because everyone's needs are different.
>>> Cluster upgrades are rare unless you are missing functionality or
>>> something is broken. That is of course one opinion, some here do
>>> upgrades nightly. From a purely production oriented viewpoint, where
>>> downtime == lost money for our customers, we usually advise against
>> I think rare is a strong word. Infrequent may be better. We
>> regularly apply patches and upgrades to the front end nodes (globally
>> connected) and infrequently (~ every 6 months) upgrade all the cluster
>> nodes in the rolling fashon mentioned above.
I assume that rare == infrequent. Basically the argument for production
cycle shops are that you don't upgrade unless there is a need to. That
is, stuff could/does break with upgrades, and you have to be really
careful. Test test test. If you need a security patch, I am not sure
any production cycle shop considers this an upgrade, but again, test
test test. The rules of thumb that I see followed are "if it ain't
broke, don't fix it".
If you install new hardware, you likely need newer kernels and drivers
to deal with it (say like SATA and RHEL4 before U1).
>> You can even do a kernel upgrades to the file servers/front end nodes
>> (which requires a reboot) without killing or disrupting jobs. Having
>> complete control has a lot of benefits.
It does, and you often need a fairly competent staff around to make this
work. There are a shortage of Mark Hahn's in the world, so not every
site can work the stuff he does. Similarly for other sites.
> On the whole, though, updates are there for a reason and STABILIZE
> systems more often than the DESTABILIZE them.
The last Centos 4.3 x86_64 kernel update almost nuked one of our very
important servers. Had to back it out, and thankfully I had backups of
the affected files. Updates are *supposed* to increase stability. They
don't always do that. Remember that an update is brain surgery, if you
treat it anything less than that you are going to be burned someday.
The folks advising caution are not advising it because they like to be
cautious, but because they have been burned before, and they don't want
to see others fall into the same behavior that burned them.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf