[Beowulf] best archetecture / tradeoffs

Sat Aug 27 16:22:08 PDT 2005

Mark Hahn wrote:

> swap across the network is asking for trouble.

Yes it is.  Especially if you are swapping 4k pages :(

>  you should evaluate whether
> you actually need swap at all.  I advocate having a disk in nodes to handle
> swap, actually, even though I'd rather *boot* as if diskless.

Swap is not something you should normally touch during a run, unless 
your runs have grown larger than ram.  More in a second.

> 
> 
>>>the nfs drive or is it just back into memory? What is the best ( fastest 
>>>) way to handle swap on diskless nodes that might sometimes be 
>>>processing jobs using more than the physical RAM?
> 
> 
> you need to seriously rethink such jobs, since actually *using* swap is 
> pretty much a non-fatal error condition these days.

I disagree with this classification (specifically the label you 
applied).  Using swap means IMO that you need to buy more ram for your 
machines.  There is no excuse to skimp on ram, as it is generally 
inexpensive (up to a point, just try to buy reasonably priced 4GB sticks 
of DDR single/dual ranked memory).

You could argue memory leak, but every so often I have a customer call 
me up to tell me how slow a machine got when they overcommitted memory 
as they ran a huge job (10x their old jobs).  At 100 MB/s bandwidth 
versus 3000 MB/s bandwidth, and a latency that is 4 orders of magnitude 
higher, swap is definitely not the place to go if you can avoid it.  But 
turning it off completely could create some other rather exciting problems.

> 
> 
>>conditions.  Networked remote disk even more so, if you manage to work
>>this out. 
> 
> 
> actually, swap over the network *could* make excellent sense, since,
> for instance, gigabit transfers a page about 200x faster than a disk
> can seek.  (I'm assuming that the "swap server" has a lot of ram ;)

The disk seek time is on the order of 8 ms while the bandwidth is on the 
order of 60+ MB/s per disk, while the gigabit has a "seek time" about 
the same (if you are swapping to a local or remote file system or disk, 
you still need to pay the seek time unless you are running in 
asynchronous mode), and the bandwidth is on the order of 20-90 MB/s. 
Plus you get to pay some additional bonus latencies.

Add to this that it is very easy to tweak local swap across 2 disks to 
get > 100 MB/s swap transfers at the same latency as a single.

I usually classify this in the "local disk is almost always fastest" 
rule (which some folks disagree with, but never indicate data to the 
contrary).

The take home messages are a) avoid swap if possible b) and if you 
cannot swap at the fastest possible speed (e.g. locally).

Now if we could get us some nice 4MB size pages ....

>>>Also, is it really true you need a separate copy of the root nfs drive 
>>>for every node? I don't see why this is. I have it working with just one 
> 
> 
> certainly not!  in fact, it's basically stupid to do that.  my diskless
> clusters do not have any per-node shares, though doing so would simplify
> certain things (/var mainly).

You might want to clarify this a bit, because this is an important 
point.  That is, for the N machines you install, some directories are 
going to be identical across similar ABI machines (/bin, /sbin, /lib, 
/usr/, ...) while there may be minimal variations in others (/etc, /var, 
...).

[...]

>>system just wrote.  So rolling your own single-exported-root cluster can
>>work, or can appear to work, or can work for a while and then
>>spectacularly fail, depending on just what you run on the nodes and how
>>they are configured.
> 
> 
> sorry Robert, but this is FUD.  a cluster of diskless nodes each mounting
> a single shared root filesystem (readonly) is really quite nice, robust, etc.

http://onesis.org

>>There are, however, ways around most of the problems, and there are at
>>this point "canned" diskless cluster installs out there where you just
>>install a few packages, run some utilities, and poof it builds you a
>>chroot vnfs that exports to a cluster while managing identity issues for
> 
> 
> canned is great if it does exactly what you want and you don't care to 
> know what it's doing.  but the existence of canned systems does NOT mean
> that it's hard!

The vast majority of the canned systems adhere to particular 
philosophical tennets.  Some insist upon a RedHat-like OS, so anything 
not supported by this is simply unsupported (such as SATA, Firewire, 
XFS, .... (long list of good technology) ).  Some insist upon other 
things which range between neat ideas to highly questionable ones.  Some 
require significant kernel/glibc changes which render them slightly 
incompatible at the binary level.

 From the commercial view of this, most end users just want a simple to 
maintain machine (they view a cluster as a single machine for the most 
part) that runs, with no surprises, and just works.  I am not aware of 
any of the canned systems that do this while also meeting the critera 
that they require in terms of flexibility of distribution choice (some 
people have distribution constraints based upon their purchased software 
support requirements), breadth of hardware support, support for a wide 
array of infrastructure elements...

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615