[Beowulf] best archetecture / tradeoffs

Michael Will mwill at penguincomputing.com
Tue Sep 6 09:59:07 PDT 2005

Looks like you inspired quite a technical discussion. Maybe I can answer 
of the questions:

> My requirements are easy, I think, since my program is already broken 
> up into a lot of different programs communicating via STDIN/OUT. I 
> benchmarked and found my problem is CPU intensive. The overall data 
> transfer is small, but all the different parts need to be assembled 
> before the final pass on the data. The final pass cannot be broken up, 
> but the final pass is fast. So my model is  input data -> break up 
> into N workers -> assemble results -> process -> done.

Perfect. You just have to make sure that all your N workers take about 
the same time to complete,
and that your compute nodes have about the same peformance 
characteristics because otherwise
a job will have to wait for the slowest node/subjob before assembly.

> 1) diskless vs disk
> I am thinking diskless is better. I don't worry about network traffic 
> as much as power consumption, overall node cost, reliability, and ease 
> of management. My nodes are all identical, so I figure diskless, 
> right?  Well I am having a few problems...
> I still don't know exactly about swap. One of the clusters I set up 
> was an NFS mounted root file system that did something with swap to 
> /dev/loop0, but I don't really understand that, is the swap going onto 
> the nfs drive or is it just back into memory? What is the best ( 
> fastest ) way to handle swap on diskless nodes that might sometimes be 
> processing jobs using more than the physical RAM?

To find out what it did, go to a node that has something mounted on 
/dev/loop0 and run 'mount|grep loop'
which should show you what it has mounted where via /dev/loop0.
If it was a ramdisk, that would be silly since it could have been used
to avoid swap in the first place. If it is a remote network device, that 
is already better. Maybe
a central memory server somewhere that offers swappartitions via nbd to 
write to directly rather than
going over say NFS with all its additional swap-irrelevant overhead?

Generally you want to avoid swap because it slows you down magnitutes. 
It might be more advisable to
run two subsets of smaller partial problems in sequence instead before 
assembling the results.

Some of our customers buy large enough clusters to fit their problem 
sizes into the caches so they even
avoid too many RAM accesses and get an overlinear speedup at that point.

> Also, is it really true you need a separate copy of the root nfs drive 
> for every node? I don't see why this is. I have it working with just 
> one for all nodes, but am I missing something here?

Scyld Beowulf (commercial product) does have an initial ramdisk for the 
root filesystem of a compute node
and does caching to be able to be diskless without the rootnfs 
performance penalty. It takes up very little space.
I just executed 'free' on an idle compute node that has 2G of RAM, and 
it reports that 53M are used.
At boot time a compute node uses PXE to boot a small kernel off of the 
master node and only starts
two processes: one to report back status and one to accept new jobs. It 
does not load all the workstation/server
related daemons and mechanisms that are only in the way of a compute 
nodes job.

> 2) message passing vs roll yer own
> I have played with a few different packages, and written a bunch of 
> perl networking code, and read a bunch and I am still not sure what is 
> better. Please chime in:

Generally: If you are the only developer and have tight control over 
your code then roll-your-own will
be more tailored to your problem and therefore more efficient. If you 
plan to throw other developers
at it, you will have to teach them your methodology though.

If you use a standard library like MPI or PVM (mpi has better 
performance) then you might find
developers that already know a lot about it and get up and running quicker.

>    - what is the fastest way to run perl on worker nodes. Remember I 
> don't need to do anything too fancy, just grab a bunch of workers, 
> send jobs to them, assemble the results, send the results to another 
> worker, etc. I don't need to broadcast to all nodes or anything else.

I have not looked at perl for beowulf computing, but I assume they also 
have some ::MPI implementation that you could toy with.

However easiest would be to  have one process write N data files with 
the input for the compute jobs and then
queue up N jobs that are calling perl with the script and input 
parameters that tells it where to find its input data.
Each job will transform their data file into an output data file and 
then remove (or rename) the input file to flag
that it is done.
The splitter program could then hang around checking once every 5 
seconds or so if all those input data files are
gone and then collect all the output data files and assemble them.

The load balancing only happens in terms of finding idle nodes to run 
new jobs on, and can be done by the
queueing system / scheduler. The cool thing is that your master process 
can queue up say 20 jobs, and your
7 compute nodes are being used with one (or two for SMP) job at a time 
until they are all processed, and
then your master node collects the results.

In scyld beowulf that scheduler would be BBQ or with an add-on moab, 
torque or sge, other beowulf
distributions might have torque and sge as well. BBQ would be good for 
you since it is so simple and does
not require any complicated setup for a simple task as yours.

>    - what is the easiest way to do it. I wrote the whole thing in perl 
> already, and I was not really impressed with the speed or reliability. 
> Certainly this was at least partially programmer error, but my 
> question stands, what is the easiest way to reliably control a cluster 
> of worker nodes running different perl programs, and assembling the 
> data. This includes load balancing.

In order to be fast and efficient I would recommend to at least program 
in c/c++. Perl is great for prototyping, but once you want to
go into production I would reimplement it in C.

>    - I saw some information on clusters that were linked in the kernel 
> and acted as a single machine. Is this a working reality? How does the 
> performance of such a system compare with message passing for 
> dedicated processing such as my own.

Scyld Beowulf does that to a certain extend but still relies on message 
passing. So you have a single process table
on the headnode, if you type 'ps' you see all the processes running on 
the cluster without having to log into
a compute node. However there is NO shared memory and you still have to 
use message passing because
the programmer will always know better what data needs to be 
communicated between processes than
an automated system could.

openmosix does dynamic process migration, and is very interesting 
but I would never use it for getting actual work done. It does automatic 
process migration, but needs
to communicate back through the network to where it had files open thus 
wasting bandwidth on the
network. AFAIK it does not have shared memory across nodes either, which 
you mostly need a fast
low latency interconnect for like quadrics.

>    - I was playing with MPICH-2, is this better than LAM? What about 
> other message passing libraries what is the best one?  any with direct 
> hooks into perl?

With the simple things you want to do even mpich-1 can do the job.

>    - how fast is NFS and RSH. If I were to change the code so it works 
> with a NFS mounted file instead of STDIN/OUT and I use RSH to 
> communicate how would the speed compare with message passing? with 
> direct perl networking?

You should try it out. Scyld beowulf uses bproc instead of RSH, so it's 
a system-call away instead of invoking an rsh login procedure. That makes
it more efficient.

> 3) Distribution and kernel
> I create my NFS system by copying directories off my RH9 distribution. 
> I had lots of problems and could never get everything working. I think 
> it would be loads easier if I could find a standard distribution image 
> already constructed somewhere out there... I don't really care what 
> distribution as long as I can run perl.

If you want something that is free and unsupported, go with Rocks. It 
does not have diskless booting,
bproc or a single process table as Scyld would have, but it does come 
with schedulers and perl and
is a complete cluster distribution.

If you want to try out scyld beowulf and experiment for a week, I can 
give you an account to
one of our 4-compute node demo clusters (dual opteron) so that is 8 cpus 
for your partial jobs.

> I keep seeing people advising against the NFS root option and 
> advocating ram disk images. Opinions here? Where can I get ram disk 
> images? I would be nice to download a basic complete ram disk image, 
> that boots with root rsh working already.

Use a distribution that has gotten that all right already. Rocks does 
not do the diskless stuff but still
uses NFS for your home directories.

> Well I guess that is enough for one day. Thank you for taking the time 
> to read this email. If you have the time please send me your opinions 
> on this stuff.

I hope this helped some.

Note I work for Penguin Computing with the subsidiary Scyld and so even 
through I try to be
objective my experience comes from that angle.


Michael Will
Penguin Computing Corp.
Sales Engineer
415-954-2899 fx
mwill at penguincomputing.com 

More information about the Beowulf mailing list