[Beowulf] Newbie

Fri Jan 6 08:56:51 PST 2006

On Thu, 5 Jan 2006, Dan Stromberg wrote:

> On Thu, 2006-01-05 at 09:30 -0500, Robert G. Brown wrote:
>
>>
>> SSH per se greatly increases security and (IMHO) should be used in all
>> cases where an analysis of its expected overhead shows that it is in the
>> irrelevant (<1%) range, which is in nearly all cases -- a fraction of a
>> second per transaction (for just one or two transactions) to start up a
>> job against thousands to millions of seconds of runtime, per node, for
>> example.
>
> Actually, on gigabit networks (and I assume on 10 gigabit nets too), ssh
> overhead is often significant.

One would usually assume that the overhead would DECREASE (or at worst
remain level) as the network gets faster.

This discussion has occurred many times on the list.  To recap it, if
your cluster is single-headed and behind a firewall use ssh TO the head
node whatever you like inside -- if your head-node is compromised you're
screwed anyway on the nodes, and since nodes are typically boilerplate
and either diskless or e.g. kickstart reinstallable (and besides are
generally isolated from the WAN and hence "uninteresting" to most
crackers) nobody cares.

If your cluster is more of a NOW architecture, with either WAN access to
the nodes or WAN access to LAN workstations that are flat to the nodes,
use ssh exclusively unless you like pain (as the nodes CAN then turn
into viral spam engines and ARE of interest to crackers).  If your user
base is not local to your administrative domain, or has a good chance
of including would-be-crackers all by itself (e.g. -- a cluster open to
all University undergrads) use ssh exclusively.

In these latter cases, if ssh won't do in terms of efficiency, build a
firewall and isolate the nodes from the LAN.  Purchase a genuine wooden
Louisville Slugger and use a woodburner to etch "For Cracking Heads of
Students Who Crack" onto it.  Display it prominently in your office.  Be
observed taking a few practice swings with it by enough of your user
base that word gets around.  Then grit your teeth and install rsh on the
nodes.

The basic efficiency issue itself is pretty straightforward though.
>From my own measurements, the marginal overhead of starting up a job on
a remote node via ssh vs rsh is on the order of 0.25 seconds (depending
mostly on CPU speed, but maybe a bit on the network as well, where
faster in either one would be LESS).  So it might take as long as four
or five minutes longer to start up a task running across 1000 nodes.

If that job is going to run in four or five minutes (and then a new one
will be started) then sure, you're in trouble.  With rsh it might take a
minute to start per five minutes of run vs five and five.

If that job is going to run for a day or more, well, the overhead of
startup costs you <1% of your potential efficiency.  This is on the
order of other sources of efficiency you're ignoring, almost certainly,
and you'd be better off worrying about your communication algorithms or
installing ATLAS on the nodes.  Then you really have to do the
risk-benefit analysis -- is it better to pay the (small) "tax" in
potential duty cycle and be relatively secure, or pay the cost of
rearranging your network and the POTENTIALLY very GREAT cost of dealing
with a cracking incident enabled by the use of rsh instead of ssh.

Noting well that for smaller clusters, the equation is shifted rapidly
towards the use of ssh as a knee-jerk standard of practice.  For a 100
node cluster you're talking about startup times of <1 minute regardless,
less than 0.1% overhead on jobs that run a day and pretty ignorable even
on jobs that run in two hours.  Noting ALSO that workloads that DO take
only five minutes to run can almost certainly be reorganized so that
they don't require a new shell per subtask startup...

The final note is that this presumes that people aren't writing parallel
tasks that use ssh (or rsh, really) as the commmunications channel.
Then encryption certainly WOULD significantly increase overhead.
PVM, MPI and socket-based IPC channels, OTOH, are standard of practice
and totally unaffected by the choice of ssh vs rsh.

>> However, if any account is compromised by any means whatsoever, you're
>> equally screwed regardless of how you authenticate at the shell level.
>> I personally don't use ssh passwords EXCEPT for root accounts and on
>> servers and on relatively untrusted hosts, and in the latter case it is
>> more to give me a small chance of detecting an intrusion before it
>> spreads between networks.
>>
>> It is an exercise for the studio office to contemplate methodologies for
>> getting passwords, ssh keys, and pretty much anything else you want from
>> most users' accounts once you have access to them without their
>> knowledge.
>
> Yes, but at least it's an extra step, particularly if there's some
> decent cryptography going on in the filesystem.  Yes, once you have
> root, all bets are off to an extent, but few users have the
> sophistication to grab a private key out of core until someone writes a
> program to do it for them.

What is complex about reading ~/.ssh/* (as the user)?  Or installing a
~/.../ssh and modifying your default path or installing an alias so that
every time you use ssh your password for the target system is
compromised?  The user's ssh keys aren't in "core", they are in a
flatfile in plain ASCII, their gpg keys are a command away.

The argument that it is an extra step isn't invalid, of course.  It does
make it a (tiny) bit harder or slows the cracker by minutes to hours,
slightly increasing the chance of detecting them before they
metastasize.  However, you have to weight this benefit against the
costs.  Requiring a password to be typed for every LAN/cluster internal
transaction (or for any internal transactions) adds seconds over human
overhead per task every day.  If you have hundreds of users executing
thousands of password-authenticated tasks per day, you starting wasting
a whole person-day or more of potential productivity EVERY DAY.  This
loss of productivity has to be balanced against the 1-4 days that it
might take to put the MARGINAL cracking incident right.  That's the one
that happened with passwordless ssh logins that WOULDN'T have happened
anyway with a password based login.

For most people, most cluster/LAN environments, a successful crack of a
user account is a rare event, and the marginal risk associated with
passwords is small.  In >>my<< opinion, unless there is something
particularly (nonlinearly) "valuable" about the data and compute
resources you are protecting, fascist-level management decisions are
very likely to be penny wise, pound foolish in terms of pure
cost-benefit analysis.  They cost more "work" every day than they save
in security costs.  However, this is very much a local decision -- if
the application has e.g. HIPPA issues or is a banking application and
the costs of a cracking incident are very high, it may be worth being
fascist.

Similar analysis should really be done for ssh vs rsh (with or without
passwords).  I'm assuming order of one ssh vs rsh crack per LAN per year
for the "cost" (order of a couple of days of potential runtime over a
year) to balance the cost of managing the crack, throwing in a bit for
the (many) other benefits of ssh -- port forwarding, x forwarding,
environment passing.  I admit that I DO wish that OpenSSH still
permitted the no encryption option -- but not enough to either hack it
back in or fight with the ravingly paranoid developers...;-)

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu