[Beowulf] best archetecture / tradeoffs

Sun Aug 28 06:49:44 PDT 2005

Mark Hahn writes:

(That my answer was off the Mark -- and sure, he's probably right...;-)

>> There are, however, ways around most of the problems, and there are at
>> this point "canned" diskless cluster installs out there where you just
>> install a few packages, run some utilities, and poof it builds you a
>> chroot vnfs that exports to a cluster while managing identity issues for
> 
> canned is great if it does exactly what you want and you don't care to 
> know what it's doing.  but the existence of canned systems does NOT mean
> that it's hard!
...
>> Seriously, if you want these features (especially in perl) you have to
>> program them all in, and none of them are terrible easy to code.  The
> 
> I would have to say that the difficulty of this stuff is overblown.
> getting a node to net-boot onto a RO nfs root is pretty trivial if you
> know what you're doing, *or* if you look at a working example (LTSP, etc).
> writing a daemon to field job requests is basically cut-and-paste.
> combining that with a bit of SQL code is surprisingly simple.
> every design decision has its drawbacks (including using SQL), but you just
> have to try it to know whether it is worth it in the end.  for instance,
> I chose SQL in part because it let me ignore challenges of a reliable 
> data store, let me scale extremely well, and ultimately, all the job info
> had to be in a DB eventually any way...

Hmmm.  I would personally invert this.  Canned is great if you have no
idea what you are doing and would like to learn without infinite pain
(as are working examples:-).  Doing it by reading man pages, HOWTOs, and
guessing is (to mix up a tasty metaphor) sort of like learning to remove
your own appendix using a steak knife and a mirror from Amazon prequels
of medical textbooks..

I only say this because I've been running diskless or partly diskless
systems off and on for oh, going on 20 years now and am horribly scarred
by some of the experiences had along the way:-) What I think IS a safe
statement is that it has never been EASIER to run linux diskless than it
is right now, particularly with one of the (e.g. warewulf) kits.  As in
with yum and a bit of chroot action, one can now build an exportable
root in a minute or two from a script, and I can't BEGIN to tell you how
painful this used to be back in pre-yum days.  I did it the hard way as
lately as maybe six years ago, and it was still a bloody PITA.  I think
it is a lot easier now and certainly is much better documented, but it
is also certainly not trivial -- unless you use a kit/package, in which
case you can skip a whole lot of the learning curve in the SHORT run
because it is all encapsulated so (if nothing else) you can learn it
quickly, efficiently, and in a functional context.

Remember, Mark, you're a bit of an uberprogrammer.  Not everybody would
find mixing a task-forking perl daemon with a "bit of SQL code"
surprisingly simple...the kind of project one could knock off in a
weekend;-) Not everybody would just say "screw all the nasty old
crack-ridden schedulers out there, gonna write my own." and actually do
it.  I started to do it myself once upon a time, and simply didn't have
time to finish it even without SQL -- and transitioned to C part way
through because the perl started to get a bit "messy".

In fact, a lot of people who administer very successful clusters and
even code a lot don't even KNOW SQL, and learning SQL and a particular
interface (e.g. mysql) well enough to usefully code it into even a
simple application is a task that (I recall) optimistically takes a week
or two right there -- if you are an uberprogrammer, of course.  Normal
humans sometimes take semester-long courses in it to get to where they
can use it with any sort of proficiency.  Or even two or three semester
long courses.

So I think your "simple" really translates to "in a few months of
learning and steady work" for most normal admin/programmers... unless
they use one of the cluster packages to short circuit the process and
leverage their own work and contributions for the greater good, in which
case they could likely be functional in a week and pretty well tuned in
two, focussing on the application instead of the cluster.

>> Reliability of ANY sort of networking task (run through perl or not) --
>> well, there are whole books devoted to this sort of thing and it is Not
>> Easy.  Not easy to tell if a node crashes, not easy to tell if a network
>> connection is down or just slow, not easy to restart if it IS down,
>> especially if the node drops the connection because of some quirk in its
>> internal state that -- being remote -- you can't see.  Not Easy.
> 
> depends.  I typically make sure my clusters have pretty reliable nodes 
> and networking - it just doesn't seem to make sense to patch around 
> unreliability.  given that, I don't think this is hard at all.
> 
>> OTOH, if you're using perl as a job distribution mechanism, you have
>> lots of room to improve without resorting to scyld, and you can always
>> try bproc (which is a GPL thing that is at the heart of scyld, or used
> 
> actually, a stripped down rsh seems to be about the same speed as bproc.

OK, didn't know that (I've only benchmarked the out-of-the-box rsh,
which is fast but not THAT fast), but it makes reasonable sense.
However, I was referring him to bproc, scyld, warewulf, etc thinking of
the higher level job control and cluster information that the "canned"
solutions provide, not just the distribution interface.  Really, as I
pointed out, for most sensible task partitionings with a good compute to
startup/communications ratio, choice of remote shell is irrelevant
anyway -- out at the <1% level in terms of its impact on efficiency.
Only when task startup takes as long as the task are you in trouble, and
then the problem is obvious and the solution usually isn't to do more
efficient startup, it is to reorganize the task to do more work PER node
startup.  A "worker daemon" approach can pay node startup costs a single
time per GLOBAL task startup and thereafter just manage the IPCs
required to initiate the next work unit and return the results -- no new
shells, forks, stats, everything already resident in memory and just
awaiting the next unit.

As far as load balancing and reliability go -- remember that there do
exist parallelized tasks that aren't fault tolerant so that the failure
of any node causes global failure of the job.  There also exist task
distribution schemes that (especially on heterogeneous clusters) will
leave nodes idle.  Unless the application is written carefully so that
either or both don't happen, if it CAN be so written.  In some of these
cases fixing everything up edges off into computer science and a one-off
solution based on the particular task at hand, which is all I as saying.

Still, I think you're right, I was a bit too casual with a lot of my
remarks in my reply.  I really wasn't trying to dis diskless (I actually
like it too and have used diskless systems for a long time now when
appropriate or necessary).  I was just trying to inject a note of
caution that "doing it yorself" it is almost certainly a bit MORE
difficult than just doing a kickstart install to a diskfull system
(which actually does so WITH diskless configuration, but using canned
tools), and that there were certain DIY issues with RO vs RW -- /var,
/var/tmp, /tmp if shared RO root and things you couldn't comfortably do
on the diskless systems that would bite you if you try shared RW root.
These things have been there a long time -- back with Sun SLC's and
ELC's, back with lots of early ways to boot up linux diskless -- and YES
there are solutions, but one should try hard not to reinvent these
wheels because the process can be somewhat painful.  There are now ways
around most/all of the problems, and as you correctly note if he looks
around at some examples he can make it work WITHOUT having to figure it
out "the hard way".

Finally, what I should have said in my discussion of reliability was
much more what you focussed on.  If his task is really CPU intensive
(bound) then there "should" exist master/slave task organizations that
are more or less automagically load balancing (by simply keeping each
node working at 100% all of the time) and "reliable" (where one has to
do a bit of work here, but one can e.g. preserve job data sent to any
node until it returns successfully and if the node goes down, redirect
it to the next idle node and so on).  This is still as task organization
permits -- a BIT tricky if the work units all have to be done in some
particular order or form some sort of transformation chain, but probably
doable, using any of the example tasks that parallelize Mandelbrot set
generation (e.g.) as templates.

So I apologize.  Not a terribly good answer last time.

   rgb

(P.S. -- my replies will probably only get wonkier with time over the
next week.  I start teaching again on Monday...;-)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050828/c265cbe5/attachment.sig>