Copying RedHat Install Wizard

Tue Feb 18 09:04:07 PST 2003

On Tue, 18 Feb 2003, Roberto F. Brandão wrote:

> Hi,
> 
>    Thanks for all messages. They've helped a lot.
> There is another question: is it possible to copy
> the RedHat boot disk (floppy) into a HD ? If the
> answer is yes, is it a good idea to boot every 
> Beowulf's node from HD and configure Linux 
> on each one, choosing for "Network Install" ?
> What do you think about that ?

If I understand you correctly, you mean "boot the kernel image in
bootnet.img" from your local hard disk to facilitate a reinstall.

This isn't quite the way to go about it, although you're on the right
track.  Here is a little script we use to set up a reinstall:

#!/bin/bash

cp /var/phy/xtmp/ks/initrd-everything.img /boot/initrd-everything.img
cp /var/phy/xtmp/ks/vmlinuz-install /boot/vmlinuz-install

grubby --add-kernel=/boot/vmlinuz-install \
       --args="ks=http://install.phy.duke.edu/kscgi/kscgi lang= devfs=nomount ra mdisk_size=8192" \
       --make-default --title="reinstall" --initrd=/boot/initrd-everything.img 

The initrd image contains drivers for "everything", especially all the
network drivers.  The kernel is one associated with this image (and is
otherwise fairly general).  The grubby program (in mkinitrd) adds a
new default target to /boot/grub/grub.conf.

The script thus installs a kernel and initrd image in /boot, sets
grub.conf to boot it on the next reboot and run kickstart from the
kickstart image given.  The kickstart image causes reinstallation of the
node (the kscgi figures out WHICH image to return to THIS node, but of
course you could point at an explicit ks image as well).  The install,
of course, replaces /boot/grub/grub.conf with the correct one for the
new installation and reboots into it as the last step.

Total automagic.  Login to a node over the network, run this script,
reboot, five minutes later a spanking new node.  For that matter, login
to a workstation, login to a server, run this script, reboot, five
minutes later done.

A PXE-based install is almost identical, except that it loads the kernel
via PXE and gets the ks information to the kernel via dhcp.

People don't realize just how scalable this sort of thing (and some
other tools, like yum) make linux management.  Even the fanciest and
newest features of e.g. Windows 2000 network installation pale in
comparison.  A kickstart file for each KIND of entity is quite simple to
create and clone; the most that one typically needs to edit for a given
particular machine is the disk layout and/or the video configuration
and/or the network configuration for systems with multiple NICs.  All of
which can be automagically managed, but with a decent chance that
tweaking will be required to get things just right for certain hardware.

Armed with a script to run the required commands across a LAN or cluster
from a single instantiation, a systems person can reinstall or upgrade a
cluster of 128 nodes with a few minutes of human effort plus some wait
time spent playing video games while the unattended network installs
complete, in parallel, at a rate limited only by server and network
capacity.  {\em Nothing on the planet} can do any better than this, at
least that I know of.  Nothing even comes close.

Again, PXE makes the original install almost as easy, except that one
probably does it at the console of each machine as one brings them up in
the rack (so that one can get their node names to match their labels in
the rack and static IP numbers).  Even a floppy based install is little
more difficult -- put the ks-defaulted floppy into the drive and reboot,
no console interaction necessary.  Remove the floppy when it stops
spinning and move on to the next node.

I'm estimating that {\em after} all of this is set up once and for all
for an institution (which requires real FTE time and expertise) the
scaled time required to install and manage a node over its entire
lifetime using methodology such as this can be as low as one hour.
Literally a few minutes to rack the node and install it, a few minutes a
year to reinstall it as necessary for a major upgrade -- minor updates
handled completely automagically.  By far the BIGGEST FTE human cost per
node is likely to be screwing around with hardware failure!  Fixing a
single component failure per node over its lifetime could cost several
times as much human time (and hence money) as installation and
management, making on-site service contracts look very good indeed
unless one has ample opportunity cost labor, a.k.a. graduate students
or partly idle systems people handy ;-)

We've found in our otherwise excruciatingly efficient and scalable linux
environment (cluster and LAN both) that hardware failure rates are
actually the thing that limits the number of hosts and nodes per systems
administrator (to ballpark of 100's each).  At that point they are
dealing with hardware failures for hours of every day with hardware
failure rates on the order of tenths of a percent per day.  Providing
human support of all sorts is also an hours-a-day limiting factor that
scales with both number of machines and number of users.  Providing
software-level systems (installation and update) support is a negligible
fraction of their time, per system -- minutes per day across hundreds of
machines.

  rgb

> 
> Hardware information:
> - 40-node Beowulf
> - Nodes: Dual Xeon 2.4, 4 GB RAM, 
>               40GB IDE HD, 2 Gigabit Ethernet Ports
> 
> Thanks 
> 
> Roberto
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu