Newbie who needs help!

Fri Jul 6 12:15:00 PDT 2001

On the subject of installation and disk cloning on cluster system I've write 
an small article wich can be reached 
here:http://planetcluster.org/sections.php?op=viewarticle&artid=3
It's basicaly a recopilation of my own experiences installing medium sized 
clusters, common errors I've found and strategies used

Hope you'll find it interesting

Regards
Pedro


On Friday 06 July 2001 14:39, Robert G. Brown wrote:
> On Thu, 5 Jul 2001, Eric Linenberg wrote:
> > Ok, this is going to be kind of long, but I figured there are people
> > out there with more experience than me, and I don't have the option to
> > mess up as I have to be finished this project by AUG. 7th!
> >
> > I am working as a research assistant, and my task is to build an 8
> > node Beowulf cluster to run LS-DYNA (the world's most advanced
> > general purpose nonlinear finite element program (from their page))
> > (lstc.com)  My budget is $25,000 and I just want general help with
>
> A pretty generous budget for an eight node operation based on Intel or
> AMD, depending on what kind of networking the application needs.  I
> spent only $15K on a 16 node beowulf equipped with 1.33 GHz cpus
> (including the Home Depot heavy duty shelf:-).  Duals are actually
> generally cheaper on a per-CPU basis, although if you get large memory
> systems the cost of memory goes up very quickly.
>
> > where I should begin and what should be done to maximize the
> > upgradibility (would it be possible to just image the disk -- change
> > the IP settings, maybe update a boot script to add another node to the
> > cluster?) and to maximize the performance (what are the benefits of
> > dual-processor machines -- what about gigabit network cards?)
>
> Any of the 15 "slave" nodes (not the server node, which is of course
> more complicated) can be reinstalled from scratch in between five and
> six minutes flat by simply booting the node with a kickstart-default
> floppy (no keyboard, monitor, or mouse required at any time).  I've
> reinstalled nodes in just this way to demonstrate to visitors just how
> easy it is to administer and maintain a decent beowulf design -- just
> pop the floppy in, press the reset button, and by the time I'm finished
> giving them a tour of the hardware layout the node reboots itself back
> into the state it was in when I pressed reset, but with a brand new disk
> image.
>
> Then there is Scyld, which is even more transparently scalable but
> requires that you adopt the Scyld view and build a "true beowulf"
> architected cluster and think about the cluster a bit differently than
> just a "pile of PC's" with a distributed application (dedicated or not).
> Not being able to login to nodes, NFS mount from nodes, use nodes like
> networked workstations (headless or not) works for folks used to e.g.
> SP3's but isn't as mentally comfortable to folks used to using a
> departmental LAN as a distributed computing resource.
>
> Finally, as has been discussed on the list a number of times, yes, you
> can maintain "preinstalled" disk images and write post-install scripts
> to transform a disk image into a node with a particular name and IP
> number.  Although I've written at least four generations worth of
> scripts to do just this over the last 14 years or so (some dated back to
> SunOS boxes) I have to say that I think that this is the worst possible
> solution to this particular problem.  Perhaps it is >>because<< I've
> invested so much energy in it for so long that I dislike this approach
> -- I know from personal experience that although it scales better than a
> one-at-a-time installation/maintenance approach, it sucks down immense
> amounts of personal energy to write and/or tune the scripts used and it
> is very difficult and clumsy to maintain.
>
> For example, if your cluster paradigm is a NOW/COW arrangement and the
> nodes aren't on a fully protected private network (with a true
> firewall/gateway/head node between them and the nasty old Internet with
> all of its darkness and impurity and evil:-) then you will (if you are a
> sane person, and you seem sensible enough) want to religiously and
> regularly install updates on whatever distribution you put on the nodes.
> After all, there have been wide open holes in every release of every
> networked operating system (that claimed to have a security system in
> the first place) ever made.  If you don't patch these holes as they are
> discovered in a timely way, you are inviting some pimple-faced kid in
> Arizona or some juvenile entrepreneur in Singapore to put an IRC server
> or a SPAM-forwarder onto your systems.  If you use the image-based
> approach, you will have to FIRST upgrade your image, THEN recopy it to
> all of your nodes and run your post-install script.  If any of the
> software packages in your upgrade/update that interact with your
> post-install script have changed, you'll have to hand edit and test your
> post-install script.  Even so, there is a good chance that you'll have
> to reinstall all the nodes more than once to get it all right.
>
> Sure, there are alternatives.  You can maintain a single node as a
> template (you'll have to anyway) and then write a fairly detailed
> rsync-based script to synchronize the images of your template and your
> nodes, but not >>this<< file or >>that<< file, and even so if you do a
> major distribution upgrade you'll simply have to reinstall from the bare
> image as replacing e.g. glibc on a running system is probably not a good
> idea.  No matter how you cut it, you'll end up doing a fair amount of
> work to keep your systems sync'd and current and quite a lot of work for
> a full upgrade.
>
> Compare with the kickstart alternative above.  The ONLY work is in
> building the kickstart file for "a node", which is mostly a matter of
> selecting packages for the install and yes, writing a post-install
> script to handle any site-specific customization.  The post-install
> script will generally NOT have to mess with specific packages, though,
> since their RPMs already contain the post-install instructions
> appropriate for seamless installation as a general rule.  At most it
> will have to install the right fstab, set up NIS or install the default
> /etc/password and so forth -- the things that have to be done regardless
> of the (non-Scyld) node install methodology.
>
> Regarding the scaling to more nodes -- the work is truly not significant
> and how much there is depends on how much you care that a particular
> node retains its identity.  I tend to boot each new node twice -- once
> to get its ethernet number for the dhcpd.conf table entry that assigns
> it its own particular IP number, and once to do the actual install.
> This is laziness on m part -- if I were more energetic (or had 256 nodes
> to manage and were PROPERLY lazy:-) I'd invest the energy in developing
> a schema whereby nodes were booted with an IP number from a pool during
> the install while gleaning their ethernet addresses and then e.g. run a
> secondary script on the dhcpd server to install all the gleaned
> addresses with hard IP numbers and do a massive reboot of the newly
> installed nodes.  Or something -- there are several other ways to
> proceed.  However, with only 8-16 nodes it is hardly worth it to mess
> with this as it takes only twenty seconds to do a block copy in the
> dhcpd.conf and edit the ethernet numbers to correspond to what you pull
> from the logs -- maybe five minutes total for 8 nodes, and even the
> simplest script would take a few hours to write and test.  For 128 nodes
> it is worth it, of course.
>
> > Another concern here is actual floor space.  We have about a 6ft x 3ft
> > area for the computers, so I think I am just going to be putting them
> > onto a Home Depot industrial shelving system or something similar, so
> > dual processor systems may be much better for me.  Cooling and
> > electricity have both already been taken care of.
>
> With only 8 nodes the space is adequate and they should fairly easily
> run on a single 20 Amp circuit.  You're right at the margin where
> cooling becomes an issue -- you'll be burning between one and two
> kilowatts, sustained, with most node designs -- depending mostly on
> whether they are single or dual processor nodes.
>
> > I appreciate any help that is provided as I know someone out there has
> > had similar experiences (possibly with this software package)
>
> I'm afraid I cannot help you with the software package, but I can still
> give you some generic advice -- in one sense you MAY be heavily
> overbudgeted for only 8 nodes.  I actually favor answering all the
> architectural questions before setting the budget and not afterwards,
> but I'm also fully aware that this isn't always how things work in the
> real world.
>
> What you need to do (possibly by checking with the authors of the
> package itself) is to figure out what is likely to bottleneck its
> operation at the scales you wish to apply it.  Is it CPU bound (a good
> thing, if so)?  Then use your budget to get as many CPU cycles as
> possible per dollar and minimize your networking and memory expenses
> (e.g. cheap switched 100BT and just get the smallest memory that will
> comfortably hold your application, or at least 256 MB, whichever is
> larger).  Is it memory I/O bound (lots of vector operations, stream-like
> performance)?  Then investing in DDR-equipped Athlons or perhaps a P4
> with a wider memory path may make sense.  Look carefully at stream or
> cpu-rate benchmarks and optimize cost benefit in the occupied memory
> size regime you expect to run the application at.  Is it a "real
> parallel application" that has moderate-to-fine granularity, may be
> synchronous (has barriers where >>all<< the nodes have to complete a
> subtask before computation proceeds on >>any<< node)?  In that case your
> budget may be about right for eight nodes as you'll need to invest in a
> high-end network like myrinet or possibly gigabit ethernet.  In either
> case you may find yourself actually spending MORE on the networking per
> node than you do on the nodes themselves.
>
> Finally, you need to think carefully about the single vs dual node
> alternatives.  If the package is memory I/O bound AT ALL on a single CPU
> it is a BAD IDEA to get a dual packaging as you'll simply ensure that
> one CPU is often waiting for the other CPU to finish using memory so it
> can use memory.  You can easily end up paying for two processors in one
> node and getting only 1.3-1.4x as much work done as you would with two
> processors in two nodes.  You also have think carefully about duals if
> you are network bound -- remember, both CPUs in a dual will be sharing a
> single bus structure and quite possibly sharing a single NIC (or bonded
> channel).  Again, if your computation involves lots of communication
> between nodes, one CPU can often be waiting for the other to finish
> using the sole IPC channel so it can proceed.  Waiting is "bad".  We
> hate waiting.  Waiting wastes money and our time.
>
> Generally, duals make good economic sense for strictly CPU bound tasks
> and "can" make decent sense for certain parallel computation models
> where the two CPUs can sanely share the communications resource or where
> one CPU manages net traffic while the other does computations.  The
> latter can often be accomplished just as well with better/higher end
> communications channels, though -- you have to look at the economics and
> scaling.
>
> Given a choice between myrinet and gigabit ethernet, my impressions from
> being on the list a long time and listening are that myrinet is pretty
> much "the best" IPC channel for parallel computations.  It is very low
> latency, very high bandwitch, and puts a minimal burden on the CPU when
> operating.  Good drivers exist for the major parallel computation
> libraries e.g. MPI.  Check to make sure your application supports its
> use if it is a real parallel app.  It may be that gigabit ethernet is
> finally coming into its own -- I personally have no direct experience
> with either one as my own tasks are generally moderately coarse grained
> to embarrassingly parallel and I don't need high speed networking.
>
> Hope some of this helps.  If you are very fortunate and your task is CPU
> bound (or only weakly memory bound) and coarse grained to EP and will
> fit comfortably in 512-768 MB of memory, you can probably skip the
> eight-node-cluster stage altogether.  If you build a "standard" beowulf
> with switched 100BT and nodes with minimal gorp (a floppy and HD,
> memory, a decent NIC, perhaps a cheap video card) you can get 512 MB
> DDR-equipped bleeding edge (1.4 GHz) Athlon nodes for perhaps $850
> apiece.  (Cheap) switched 100Base ports cost anywhere from $10 each to
> perhaps $30 each in units from 8 to 40 ports.  You can easily do
> something like:
>
> 23 $900 nodes = $20700
> 1 $2000 "head node" with lotsa disk and maybe a Gbps ethernet NIC
> 1 <$1000 24 port 100BT switch with a gigabit port/uplink for your head
> node. $500 for shelving etc.
>
> you could build a 24 node 'wulf easily for your $25K budget.  Even if
> you have to get myrinet for each node (and hence spend $2000/node) you
> can probably afford 12 nodes, one equipped as a head node.
>
> Good luck.
>
>     rgb

-- 

  __________________________________________________
 /                                                  \
 | Pedro Diaz Jimenez                               |
 |                                                  |
 | pdiaz88 at terra.es      pdiaz at acm.asoc.fi.upm.es   |
 |                                                  |
 |                                                  |
 | http://planetcluster.org                         |
 | Clustering & H.P.C. news and documentation       |
 |                                                  |
 | There are no stupid questions, but there's a lot |
 | of inquisitve idiots                             |
 |        Anonymous                                 |
 \__________________________________________________/

-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: For info see http://www.gnupg.org

mQGiBDqcGZsRBADFIahNPLk8suMlS39m8RqatLgX4dO7PU2F5p1oVvkyB7PaLQCv
FREWwfrjGpxAjRnxyZ4TdaFi1oCP495t5R2CdjPZu0EfjsEqosdLXkjDsKl2n4Wo
Afb6BaHMJS5PADEI0QfpZOkB8OruAZja/oGmn5rThyjgCxWHUuK1ArmeGwCg7+9a
owg9wP1RohePHJSDB9d2HYMD/i7z1X4ev+K90LumgJwSWlScJ7MEip5rw4wqGOkK
lF/C2nTYsoX5CVEn/pu7hROL/BWIYtBgkNDaEjsVsyb+4KjQXcZUW5l3ADipWYx2
r9s4sFfeZ9nfhDcG0aNYRcCNkYSZ/WxUkXS8UjVEAEhkFu1BA+6UZmeq3pKtJZTR
+HqKA/9zRmgTon36zt2qe9eiR6DyY0EpGEI0iY+KYX6GC/wxizeHBw0FW1eOEoxF
GjtxdBv/U9vi7Vgav6aY+pr4la5q6jVabe03Y8yGDFeL8jM+lqww1rzpABiGrF+W
qge65zCUjL3jJE5+5yi+KcRyllb1OA7uXQTtsRw+TGq9Dvaaz7QwUGVkcm8gRGlh
eiBKaW1lbmV6IChCLk8uRi5ILikgPHBkaWF6ODhAdGVycmEuZXM+iFYEExECABYF
AjqcGZsECwoEAwMVAwIDFgIBAheAAAoJEJ7ud33hGMZRj20An2Ce4S/vBTuZDxnL
WFBrJRnc3UdaAKDnIPNRbz7r4dh9AuBcpbCE1pQ/SLkBDQQ6nBmqEAQAr7O07Dws
5zAbQvm1hwGthXKCHtIIuWCPdX/XkNG6ZxV/cXgs4LI4oAg3GhttD2JIEk2SoVXE
FOf/wIddIDz70/9mIZavMvpR31LxBFSJk0Up3caOvThM90wMttRi7tg7cf04rrMM
Phy8T5bOIW/q5SMwZffbJXD7bA0/jDLdQ6MAAwYD/1emSwNTzOOmMCZadoEBpKIE
HA35P2/m/SsCI+pQ/OKXKPvvrQKTQqRCcDa5aq31oSiT9M5WQ96BlIGKHRPWGpvm
0822V7M9RF2mYZPIfgKfTSvZpYHzjz+RM7PvBBiBc9l95vy70Sh7SywIF86H80Ag
D0dUIDtGlrSANhXjx4EJiEYEGBECAAYFAjqcGaoACgkQnu53feEYxlHdVACgjVhU
Y8CKf6MYZgQOR9eIDNvTX0AAn3dwbW1HLxEF5OQKJIsngl0BUlYK
=d4S3
-----END PGP PUBLIC KEY BLOCK-----