[Beowulf] best archetecture / tradeoffs

Robert G. Brown rgb at phy.duke.edu
Sun Aug 28 07:55:20 PDT 2005


Joe Landman writes:

>> 
>> you need to seriously rethink such jobs, since actually *using* swap is 
>> pretty much a non-fatal error condition these days.
> 
> I disagree with this classification (specifically the label you 
> applied).  Using swap means IMO that you need to buy more ram for your 
> machines.  There is no excuse to skimp on ram, as it is generally 
> inexpensive (up to a point, just try to buy reasonably priced 4GB sticks 
> of DDR single/dual ranked memory).

And there is also the "dancing bear" problem.  In some very large
problems, the amazing thing is not that it runs particularly well or
fast, but that you can run it at all.  Some jobs are parallelized IN
ORDER TO run something too big to fit into physical memory, and some
task partitionings put the job itself on a single node and use the rest
to provide some sort of extended memory to that node.  This is one of
the points (IIRC) of the trapeze project at Duke, and is one reason that
one might well consider e.g. swapping on a remote ramdisk as a
poor-man's way to access a much larger memory space than is currently
possible.  The richer-man's versions involving more efficient ways of
moving the memory back and forth over the network.

So it isn't ALWAYS a non-fatal error condition, but it should always be
done deliberately, because if an ordinary task swaps you start getting
that nasty old several order of magnitude slowdown...;-)

>> actually, swap over the network *could* make excellent sense, since,
>> for instance, gigabit transfers a page about 200x faster than a disk
>> can seek.  (I'm assuming that the "swap server" has a lot of ram ;)

Sure.  Ideally, swapping to remote ramdisk.  In fact, in some cases
configuring remote nodes so ALL they are is one big ramdisk to swap on
(or otherwise serve as an extension of memory for a single-threaded
task).

> I usually classify this in the "local disk is almost always fastest" 
> rule (which some folks disagree with, but never indicate data to the 
> contrary).
> 
> The take home messages are a) avoid swap if possible b) and if you 
> cannot swap at the fastest possible speed (e.g. locally).

Agreed, where locally may or may not be fastest but where it will likely
be faster than swapping to remote DISK. Depending on the access pattern
required, speed of the network, etc.  And where to get the best possible
speed, you may want to not use the VM subsystem to extend memory in this
way -- I really don't know its relative efficiency compared to e.g.
message passing used to load memory blocks over the net on demand, but
would guess that it is slower, if only because different assumptions are
made in the design of the subsystem(s).  Complicated still further by
the advent of RDMA NICs, which can bypass a lot of the OS/CPU overhead
and parallelize the data transfer with execution on a good day.

   rgb

>  From the commercial view of this, most end users just want a simple to 
> maintain machine (they view a cluster as a single machine for the most 
> part) that runs, with no surprises, and just works.  I am not aware of 
> any of the canned systems that do this while also meeting the critera 
> that they require in terms of flexibility of distribution choice (some 
> people have distribution constraints based upon their purchased software 
> support requirements), breadth of hardware support, support for a wide 
> array of infrastructure elements...

I don't know of warewulf is quite there, but they are damned good try.
You install any distro you like (if it is far from any beaten path
expect to do a bit of work).  You layer on their 3 required packages
(rebuilding from source as needed).  You customize (part of the work:-)
and run a script to build exportable vnfs chroot roots, using whatever
methodology makes sense to your KIND of (hopefully package supporting)
distro.  Or roll your own script from scratch.  The rest of the setup --
dhcp, tftp -- is managed semi-automagically for you.  That part is
actually not THAT hard to learn, but it is really useful to have
something to generate a working configuration for that first time.

Now one thing I'm still working on figuring out is just what warewulf
will do when confronted by heterogenous node hardware/infrastructure
etc.  One doesn't really want e.g. kudzu to redetect hardware on each
reboot, for example.  Not really a warewulf issue per se, just one of
the many things that has to be resolved setting up a default node
configuration in an actual cluster with particular components.  I
suspect that at that point automagic fails and one has to start to
customize... although doubtless Tim will let us know if this is
incorrect (I'm still learning warewulf by playing with it).  I also have
yet to see if a single arch server (e.g. i386) can comfortably serve a
different arch (e.g. x86_64) since I have both in my home/test/play
cluster.  I also have some "grumbles" about its marginally inadequate
and incomplete documentation and the lack of a yum repo tree or the
placement of its core packages in an existing extras tree such as livna,
but if I ever DO figure it all out and end UP fully embracing it this
may be something I end up contributing back to the project.:-)

Anyway, that's why I >>like<< warewulf as a philosophical approach (at
least) over some of the other choices.  It divorces the support of the
minimal "cluster" core from the choice of OS, from its natural
update/upgrade process, and so on an maximally leverages the particular
tools (e.g.  yum) that make managing/selecting packages easy.  The
clusters you end up with are close to what you'd get if you rolled your
own on top of your own distro (diskless, yet:-) but a whole lot easier
than roll-your-own-from-scratch.  Agnostic is good.  Automagic agnostic
is better (though harder -- requires a broad developer/participant
base).  Tools written/maintained by folks that eat their own dog food is
best.  warewulf looks like it on the generally correct track.

   rgb

> 
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
> phone: +1 734 786 8423
> fax  : +1 734 786 8452
> cell : +1 734 612 4615
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050828/45aca981/attachment.sig>


More information about the Beowulf mailing list