[Beowulf] Help req: Building a disked Beowulf
Robert G. Brown
rgb at phy.duke.edu
Thu Aug 25 03:59:59 PDT 2005
Chaitanya Krishna writes:
> Hi,
>
> I am posting to the list for the first time.
>
> First a little background. Please bear.
>
> I am doing my research in Molecular Dynamics and we have very badly
> running Beowulf with 10 nodes in our lab. The position of the cluster
> is that the Master is able to connect to the outside world (it has two
> network cards) and we can access all the nodes (each has a network
> card) from the master which are all connected through a switch. We are
> able to run serial jobs on all the nodes but not parallel jobs. All of
> them have SuSE 9.1 installed on them but not exactly with the same
> partitions and the same software as the last few nodes were added by
> another person differenet from the one who originally built the
> cluster.
>
> I have been entrusted with the responsibility of getting the cluster
> to run parallel jobs as I am considered to be a computer geek here
> (which, as you will know, I am not). Hence the request for this help.
> You can consider that I just do not know anything. I have already
> Googled and found and read some interesting stuff on the net about
> building a cluster. But I am writing to get the views of you
> experienced guys out there in the cyber space.
>
> Well, the resources that I have are these:
>
> 1 Intel Pentium 4 3 Ghz Procs 10
> 2 Intel Mother boards 10
> 3 200 GB SATA Hard disks 10
> 4 120 GB IDE Hard disks 10
> 5 Network cards 10 + 1 (1 extra for master)
> 6 Some already present switches
>
> All the IDE drives will be primary (the OS will reside on this) and
> the SATA drives will be use as secondary drives for storage)
>
> My plan (and requirement) is the following:
>
> 1 To get the cluster up and running parallel jobs.
> 2 The way I intend to do 1 is this. Install the OS (SuSE 9.3 Pro) on
> the master and install barebones ( I am not sure, but may be something
> like kernel, NFS and/or NIS, SSH, etc) on the rest of the nodes so
> that I am able to run parallel jobs as well as serial jobs on the
> nodes. Will require help on this.
Your hardware looks perfectly reasonable for a small cluster. Let's
hope that your NICs and switches "match" in some way -- enough ports,
gigabit ports and gigabit cards, whatever. One has to wonder a bit
about why the nodes have both a 120 GB IDE and 200 GB SATA drive instead
of e.g. 2x[120,200] GB SATA only. I've never mixed drives like this and
would expect that it works but would worry that it might do something to
performance (Mark Hahn usually is the answer man as far as the overall
IDE drive subsystem is concerned:-).
I'm assuming that the NICs are PXE-capable and that you've got a KVM
setup that you can move from machine to machine somehow to set the BIOS
and manage at least the initial install.
In that case you will want to to master PXE (network) installs more or
less immediately. If you are going to install SuSE on the nodes I
cannot help you directly as I have very little experience with it, but
if you were using a RH-derived RPM-based distro (RHEL, Centos, Fedora
Core, ...) then it would go something like this:
1) Install master from whatever means you have at hand. I'd generally
suggest directly over the network from a reliable repository or mirror
simply because most network installs of e.g. FC4 will also set up yum
for you for automated updates. If you are affiliated with an
institution that has an FC mirror already it may even have updates and
various extensions (e.g. "extras"). The master should likely have both
"workstation" and "server" packages installed. Be generous in your
package selection at first as you have plenty of disk and you can tune
up your package selection later -- for now it will just cost you time to
have to constantly grab packages that are missing and install them to
get something to work. OTOH don't install "everything" per se if there
are clearly packages that you Will Not Need on the master. Definitely
install apache httpd and tftpd and dhcpd (in one flavor or another).
NIS is useful although there are some alternatives you CAN consider.
NFS is essential.
2) On the master, use rsync (usually) and install a mirror of the
repository of your choice in e.g. /var/www/fedora -- it will probably
end up looking like
/var/www/fedora/base/4/i386
/var/www/fedora/updates/4/i386
/var/www/fedora/extras/4/i386
and go ahead and create an extra (empty) repo for your own local builds:
/var/www/fedora/local/4/i386
Set up some sort of automated update system so that your repo stays
sync'd with the FC master (probably via one of the primary or secondary
mirrors). Every few days is probably acceptable, or every night if it
is allowed by your mirror host (you should probably ask if there is any
question of too much load on the mirror or use a less loaded mirror).
3) Set up /tftpboot according to directions available in several
HOWTOs to support remote/diskless booting. On my install server at home
(blush still running RH9 behind a firewall blush even as it supports FC4
clients:-) the path is /tftpboot/X86PC/UNDI/linux-install/ where one
puts the kernel and initrd from e.g.
/var/www/fedora/base/4/i386/images/pxeboot
into e.g.
/tftpboot/X86PC/UNDI/linux-install/fc4
and edits
/tftpboot/X86PC/UNDI/linux-install/pxelinux.cfg/default
to add targets like:
default localboot
timeout 100
prompt 1
display msgs/boot.msg
label localboot
localboot 0
label fc4-i386
kernel fc4/i386/vmlinuz
append initrd=fc4/i386/initrd.img ks=http://192.168.1.2/dulug/ks/def-fc4-i386 lang= devfs=nomount ramdisk_size=8192
label fc4-i386-manual
kernel fc4/i386/vmlinuz
append initrd=fc4/i386/initrd.img lang= devfs=nomount ramdisk_size=8192
Note that these lines define what happens when a pxe boot times out
(after 100 tenths of a seconds), where to find the message it prints to
prompt a user to do something
(/tftpboot/X86PC/UNDI/linux-install/msgs/boot.msg)
the name of a couple of things one can boot, and what/how to boot the
install process. The first fc4 target starts a kickstart install,
presuming that the server's internal network address is 192.168.1.2 and
that you've put an appropriate kickstart script on the http path
indicated. This use of http is one reason you installed apache on the
server as it gives you anonymous, reasonably secure file transfer
capabilities even to barebones initrd ramdisk hosts (install image).
Finally, install dhpcd to hand out node IP numbers AND to tell the nodes
to look to the PXE boot server when they boot. I offer up a copy of
mine not because it is perfect (it probably isn't) but because it is
functional. Obviously change names and IP numbers to match your own
situation. The important lines from the point of view of installs are
the booting/bootp and the stuff at the top of the next section.
##############################################################################
#
# /etc/dhcpd.conf - configuration file for our DHCP/BOOTP server
#
###########################################################
# Global Paremeters
###########################################################
option domain-name "rgb.private.net";
option domain-name-servers 152.3.250.1, 209.42.192.253, 209.42.192.252;
option subnet-mask 255.255.255.0;
option routers 192.168.1.1;
option broadcast-address 192.168.1.255;
ddns-update-style ad-hoc;
# option nis-domain "rgb.private.net";
# option ntp-servers 192.168.1.2;
use-host-decl-names on;
allow booting;
allow bootp;
###########################################################
# Subnets
###########################################################
shared-network RGB {
subnet 192.168.1.0 netmask 255.255.255.0 {
range 192.168.1.192 192.168.1.224;
default-lease-time 43200;
max-lease-time 86400;
option routers 192.168.1.1;
option domain-name "rgb.private.net";
option domain-name-servers 209.42.192.253, 209.42.192.252, 152.3.250.1;
option broadcast-address 192.168.1.255;
option subnet-mask 255.255.255.0;
}
}
group {
next-server 192.168.1.2; # name/ip of TFTP/http server
filename "X86PC/UNDI/linux-install/pxelinux.0"; # name of the bootloader program
host node1 {
hardware ethernet 00:07:E9:80:00:AC;
fixed-address 192.168.1.32;
option host-name "node1";
}
host node2 {
hardware ethernet 00:12:F0:74:06:C6;
fixed-address 192.168.1.33;
option host-name "node2";
}
.... (and so on)
}
This section tells the booting nodes who has the tftp server and where
to look to get pxe boot configuration via the bootloader. Each node's
MAC address should be entered into its own little box so that the nodes
all have static IPs and maintain identity from boot to boot.
Installing a node is now incredibly simple, although you'll have to work
pretty hard the first few times to get it all to work and then to tune
it up. Build a node kickstart file (there are tools to help, howtos to
help, examples on the web to help) and put it on a path like the one
above (doesn't have to be the same, just web-retrievable from the
server by the nodes). Turn on the node. Enter its bios (if necessary)
and tell it to boot FIRST from its PXE-enabled NIC. Reboot. Look on
the server logs after it has initialized its network (or look on the
screen as it boots) and record its MAC address. Add it to the
dhcpd.conf on the master and restart dhpcd. Reboot the node. When the
PXE prompt appears with your message saying something like "type
fc4-i386 (kickstart) or fc4-i386-manual (for manual install) or will
boot from local disk in 10 seconds" enter "fc4-i386".
If you are lucky and have done everything right, zowee, the node goes
and just plain installs itself! If you've cleverly told it to ignore X
and video configuration and built a nifty post script that it grabs in
the final stages of the ks install from the master via the web that it
executes to e.g. set up fstab, passwd, and so on, you can actually move
on to the next node while it installs and have IT installing in parallel
two minutes later (your node won't need a monitor or keyboard any more).
Once you've done this ONE TIME for every node, you can reinstall your
entire cluster by basically just -- rebooting it, even remotely from far
away after making a tiny change to the node /boot/grub/grub.conf files
telling it to reboot into kickstart/install instead of from the local
kernel image on the next boot.
Which is good, because now you get to login to the nodes from the master
and tinker with their kickstart files and post scripts to get this all
right, because it won't be at first. But NOW every bit of effort you
expend isn't wasted and every bit of progress you make has to be made
just once. When the kickstart image and post are "perfect", the server
is set up and happy, and you've learned all this stuff, then life is as
good as it gets as far as scalable automation is concerned.
Finish off the nodes with a yum update, ideally modifying their repo
lists so that your OWN server is their default yum repo and it falls
back on remotes only when for some reason they can't reach your master,
which should be "never".
This isn't an endorsement of FCX by the way -- the same methodology will
work for LOTS of distros including probably SuSE but I just don't know.
A similar methodology also supports booting the nodes fully diskless,
but with all that node disk why bother?
Put the install on the 120 GB disks. Take the 200's out and give them
to the head node (see below) to build your terabyte RAID.
BTW, if you use ssh as a node access protocol (recommended unless you
have some serious reason not to) you'll likely want to back up the node
/etc/ssh directories after the first install and store them where they
can be put BACK on the nodes as a step in the post script. Be careful
how you do this -- the private keys are not something to leave on an
open website. I usually set up root for passwordless ssh across the
cluster to facilitate administration (if they crack root on the master
they own the cluster anyway, right) but be careful how you distribute
/root/.ssh across the cluster ditto, those private keys are then REALLY
not something to be left around.
> 3 Whatever software I install on the master should be available on the
> nodes too (I guess this is possible either with NIS or NFS). Here too
> some help!
NFS export user workspace from the master. In fact, consider snitching
three SATA drives from nodes (that don't really need them, probably) and
putting them onto the master, and setting up the master as an md raid
host serving 600 GB in RAID 5. That way you'll get decent reliability
AND a bit workspace on each node. Stack up the other node SATA disks as
spares. Install on the smaller 120 GB IDEs, if you don't really "need"
lots of space on the nodes. If you have or can buy a BIG enclosure and
another interface for the server, you could even put more drives into it
and really run a big RAID indeed. You've got a potential terabyte-sized
RAID 5 disk server there WITH enough identical spares that you'll
probably never run out during the lifetime of the cluster!
NIS is (as noted above) very probably a good idea. It adds a bit of
overhead to certain classes of parallel problems (it cost an NIS hit to
stat a file, so file-stat-intensive activity and certain other kinds of
activity can be relatively inefficient if you use NIS unless you tune).
For so small a cluster, if you have a small, static user base you can
also just rsync /etc/[passwd, shadow, group, hosts] to the nodes.
> 4 I should have no need to propagate my executable to all the nodes
> manually to run a parallel job. I guess it should be possible if 3 is
> possible.
Trivial and there are lots of ways to do it including plain old scripts.
Look into e.g. SGE. Also look into warewulf, which should load on top
of just about any distro you install when you figure it out and which
has lots of cluster packages prebuilt and ready to fly. Or other
cluster distributions. They'll install similarly but automate some
parts of the above for you. Even plain old FC4, though, contains PVM
and LAM MPI out of the box (and obviously supports ordinary networking).
That more than suffices for a whole lot of cluster work, especially on a
"starter" cluster. You can tune, add different MPIs, try different
schedulers and monitoring tools and so on later as required.
> 5 All the nodes should be able to store data on the drives attached to
> them Storage is very important.
Well, ok. Consider the RAID solution outlined above, though. If
storage is important, so is backup (usually) and you don't have any.
RAID on the server is at least one level of redundancy; you need to add
a tape to the server and you have a second level. If storage is
important, you really need to think about it BEFORE you buy your
hardware and think hard about how to set it up for reliability and
backup and speed. Node local disk is fine, though, if you do need it.
You could even make little md RAID 0 or 1 partitions out of the 120 GB
disks and 120 GB of the 200's for SOME degree of fault tolerance. I
think that two SATA's would be better than a SATA and an IDE though...
Hope this helps. There's stuff in the archives, stuff on the brahma
site:
http://www.phy.duke.edu/brahma
and stuff on my personal website:
http://www.phy.duke.edu/~rgb
These have links to a lot more places, and Google Is Your Friend.
Hope this helps...
rgb
> I haven't yet checked out the archives of the Beowulf list, but it
> would be very helpful if someone can tell me if all or some of the
> above are possible and some pointers as to where I can go next for
> some more information.
>
> Regards,
> Chaitanya.
>
> Indian Institute of Science.
> Bangalore.
> India.
>
> --
> To err is human, but to really screw up you need a computer.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050825/5b1d8da9/attachment.sig>
More information about the Beowulf
mailing list