[Beowulf] Why I want a microsoft cluster...

Robert G. Brown rgb at phy.duke.edu
Mon Nov 28 08:11:53 PST 2005


On Wed, 23 Nov 2005, Jim Lux wrote:

> At 05:15 PM 11/23/2005, Joe Landman wrote:
>
>
>> Jim Lux wrote:
>>> At 01:30 PM 11/23/2005, Joe Landman wrote:
>> 
>> [...]
>> 
>>> This is particularly pernicious for documents that get viewed with a 
>>> projector, and then get zoomed to look at the details.
>> 
>> Agreed.  This is a good reason to use vector formats in general whenever 
>> possible.
> <gigantic snip>
>
> Gosh.. the devil's gonna die.. I did my best to advocate.
>
> Ultimately though, to update the aphorism: "nobody ever got fired for 
> recommending MS"
>
> Many similarities between "big blue" a few decades ago and big whatever color 
> they are (well, it *was* a sort of forest green back in the 80s) these days, 
> not the least of which is marketing strategies.
>
> Enjoy your holidays everyone.. (those of you in the U.S., anyway.. the rest 
> of you will have to toil while we engage in the national festival of 
> overeating)

Uhhhhh, can't believe I ate all that...

OK, I'm back now, and it looks like Joe did a pretty good job of saying
everything I might have said and then some.  A very few late addenda to
further beat this dead mule:

   a) As Joe pointed out, there is no substitute for competence.  A
number of issues raised in support of Windows clusters were (in essence)
"Windows systems managers are incompetent klutzes who would do something
silly like run virus checkers, per node, on an NFS mount" or "Windows
systems managers would blindly implement Windows-centric security
policies on the linux system". While this is possibly true, it isn't
really relevant except from the MS marketing point of view and as Joe
pointed out would screw up a WinXX cluster as easily as anything else.

I think a safer assumption to make is one of presumed competence in
whatever shop is considering winsux vs linux clusters.  In which case
they'd presumably know (or be smart enough to learn or be deep-pocketed
enough to buy the knowledge from folks like Joe) enough to configure a
"sensible" cluster as opposed to a silly cluster -- one with the cluster
nodes e.g. behind a firewall, on a 192.168 private internal network but
otherwise flat within the organization, etc.  They would also know or
would learn quickly when considering the issue that many of the linux
cluster configurations they might consider basically boot a single,
stripped cluster image giving them a SINGLE SYSTEM to secure.  The nodes
aren't really, individual security risks unless you're running a NOW
type cluster (a possibility that this list tends to be a little bit
blind to).

If one DOES consider a NOW-type cluster then a whole RAFT of security
issues exist WinXX, but they are ones you have to handle anyway.  There
are fewer issues for linux -- see below -- but...

   b) a WinXX NOW cluster is a possibility that VERY DEFINITELY exists
and is potentially profitable to a WinXX shop, BTW.  To help out your
diabolical advocacy, consider the following.

A mythical organization has 1000 mythical WinXX desktops running email
clients, screen savers, Office tools, and a browser.  These systems are
already installed, managed (well or badly), secured (well or badly),
patched (ditto), and are effectively idle nearly all of the time even
when somebody is sitting at their console and typing furiously.  For
most users, each successive boost in CPU speed just increases the
already astronomical number of NoP cycles the system spends per cycle of
actual work done processing a keystroke or mouse click.

This organization therefore very likely has 0.95 x 1000 free cycles
already available or doing thumb-twiddling crap like making WinXX logos
fly around in 3d.  These cycles "could" be doing useful work, but WinXX
is if anything anti-engineered for this sort of process -- it is weak
on backgrounded tasks in general, scheduling, VM (especially VM that
doesn't leak), and network-driven task execution.  However, it may be
>>good enough<< at multitasking, and network-driven task execution is
fundamentally a pretty straightforward problem to solve, especially if
you set your sights on low-hanging fruit.

That is, MS "could" sell a "cluster tool" that is basically nothing but
an integrated, policy-driven job distribution tool so that a user on any
(authenticated, permitted) one of these 1000 systems on a standard LAN
can submit a job stream and have it farmed out to the "free" cluster of
idle desktops according to institutional policy.  A nice little cluster
management tool would let top level managers set that policy and give
them that warm fuzzy feeling of control.

Given Windows security track record, of course, I rather expect that
most systems managers would be a pretty tough sell on this, at least
right at the moment.  It's one thing for a single corporate system to
get a virus.  It's another for the entire corporate LAN to get a virus
without any of the tedium or delay of having to rely on social
engineering for transmission.  Building a sandbox whereby submitted
Tasks of Evil don't turn an entire corporation into Hell would be a bit
of a challenge.

I also don't know how well WinXX would function on nodes with a full
time CPU-sucking background task running -- historically this has proven
difficult even for a number of Unix schedulers and VM managers (mid 90's
Sloaris, anyone?) and my direct experience of this on the one gaming
system I run Windows on (where games are, in a sense, HPC applications
BTW) is that this will be a really serious problem for the current
generation Windows kernels as well.  I have never been impressed with
WinXX's ability to multitask, but it wouldn't have to multitask WELL as
long as they were able to tune it so that desktop application
performance didn't suffer.  This could be done at the REALLY
coarse-grained task level and still win -- as in run the BG application
instead of a CPU-sucking screensaver or OUT of the screensaver manager
and using the same exact controls.

This >>would<< really be amazingly simple to code, and with an
integrated front end making a WinXX NOW cluster that can do a
"mosix"-like embarrassingly parallel job redistribution at a cost of
(say) $100/node/year for the client and job management daemon, it would
even make economic sense for at least some shops.  MS makes $100K.  The
organization recovers the equivalent of a 950 node cluster for roughly
10% of the cost of a dedicated-function cluster of the same size, far
less than that if infrastructure requirements and scaling are taken into
account.

IF MS makes things work so you can recover 95% of the CPU and not impact
desktop performance, this is a huge win, and gives MS a bit of leverage
and experience to make their tool (or a brand new parallel programming
suite tightly coupled to their programming tools) work for e.g. MPI apps
or other real parallel apps.  Even 80% recovery of CPU would be a solid
win -- the fact that linux permits more like 98% recovery without
impacting desktop usage (given sufficient node memory) is irrelevant.

   c) You were really unfair to linux on the security side.  Windows
managers all KNOW that linux is secure and windows is not -- not
absolutely of course in either direction, but sufficiently that I'm
pretty safe making the absoolute statement anyway.  Windows managers
tend (if anything) to be jealous of linux managers on this very issue.
This (and scaling) is one of the major reasons that many places have
linux servers, whatever they run on the desktops.

At Duke our campus IT security person is just happy as a clam about
linux because linux at duke installs itself in an auto-updating pull mode
that yum resync's to the campus repository(s) every night.  Linux boxes
on campus therefore get security updated even if the owner knows
"nothing" about security, and toplevel management has to control and
defend a single set of toplevel servers to keep it that way.  NOBODY is
happy about WinXX from a security point of view.  Updating isn't done
nightly and transparently, where (in linux) most users never are even
aware that their system has been updated and patched or that the
application they run today isn't the same as the one that they ran
yesterday because a bug they hadn't ever encountered is now fixed.
Updating Windows is done rarely, after testing, and with great
trepidation because it can do anything from breaking nothing to breaking
everything to breaking SOME things.  Nightmarish is a reasonable term
for it.

It is also trivial to install linux so that it is "identical" desktop to
desktop across an organization.  This can be done with winsux, but it
often ISN'T done because it isn't quite as simple.  However, this is
really a competence issue so let's just assume that everybody is
competent so that it is.  The point is still that linux right out of the
metaphorical box is far more secure than WinXX is after investing quite
a bit of effort.  Linux competently installed on top of e.g. kickstart
files from a well-maintained yum-driven repo that mirrors the security
updates streams for the distro in question is very, very secure AT THE
DESKTOP, and still more secure (depending on cluster architecture) at
the cluster level.

   d) It is also important to be fair on the management scaling side.
Linux scales at the theoretical limit of management scalability.  One
(single) person can manage the install/update repo for an organization,
and yes, a COMPETENT organization will restrict all users to use the one
(or one of the) supported distribution(s), just like they wouldn't let
users run win95, win32, winXP, winme, winnt all at the same time on
different desktops (unless there were cost-beneficial reasons to do so).
Given this person, at the departmental LAN or cluster level the number
of systems a person can care for is almost completely independent of the
software.  It is limited by the frequency with which the hardware breaks
plus the number of requests for e.g. training or user-level software
support, per system per user per day.

If all hardware is tier 2 or better -- 3 year onsite service, competent
design reliable choices -- one person can care for from 100s to as many
as 1000 linux systems from the hardware point of view on a 24-48 hour
service basis (where you don't need "overnight call" or coverage).
Furthermore, this service can be done just as easily by WinXX trained
staff as linux trained staff.  Hardware is hardware; the only issue is
having ONE person in your organization who sets policy as to what
hardware you get on the linux side to avoid potential device driver
issues.

User support issues vary wildly per organization and are difficult to
categorize in any simple way.  A single user can (as sysadmins on list
can well attest) suck down inordinate amounts of support REGARDLESS of
the operating system they use, and you might be supporting dozens of
these incompetent, personality disordered, life-sucking weasels who call
you up in the middle of the night and blame you personally if their home
ISP is for some reason slow or they found a website that dumps code that
freezes their browser and ultimately their interface (where none of your
OTHER 300 users has ever had a problem).  Or you can have hundreds of
highly competent users that never need to be taught that to print a
document you click these little buttons and look for the printer down
the hall and be sure not to pick one from three buildings over that
HAPPENS to appear on the list due to the miracle of printer sharing over
the network and that promiscuously accept print jobs from anybody.

However, >>working<< at an institution with a wide range indeed of mixed
Win/Lin LAN configurations, I know of no reason to believe that WinXX
user level support is likely to be cheaper, ever, once you've hired the
minimum 1-2 linux people required for a minimum buy-in to linux (one for
small, two for large).  There is a nontrivial startup cost, sure, but
from what I've seen HERE, at least, if anything linux support costs
scale better than winsux support costs across the board.  It is cheaper
at the server level (by far).  It is cheaper to install (and not just
because of free software -- it is cheaper in HUMAN terms to install).
It is identical to support at the hardware level, EXCEPT for device
selection -- you have to be more careful to validate any given hardware
arrangement for linux, but once validated it tends to be identical.  It
is anecdotally somewhat better to support linux at the user level,
certainly in homogeneous environments (all lin vs all win) but still
largely true in a mixed environment.  We have a relatively small number
of WinXX boxes but manage to get support requests from their users at
almost the same rate we get them from linux users, possibly because of
their relative competence (lin tends to be used by e.g. faculty and
students, win by secretarial staff).  However, we also have win-only
labs that have a crisis a week, it seems like.

This is the point I was making last week -- ONCE AN ORGANIZATION PAYS
THIS BUY IN COST (1-2 competent linux sysadmins) the marginal cost per
additional linux seat, be it desktop or cluster node, is strictly less
than that of an additional winsux seat, with the sole exception of
interoperability costs -- integrating OOffice desktops with MS Office
desktops.  This actually (as Joe has pointed out) "works" pretty well
these days for most things, but there are enough things for which it
doesn't work that it can create problems or additional work or some
restrictions on usage.  This gets back to competence and cost/benefit
again, as one can ARGUE that using MS Office at all is a fundamentally
incompetent thing to do in any institution that wishes to archive the
documentation produced by its office suite tools so that they are
recoverable ten years from now.

Word's .doc format is not, actually terribly standard or portable, as
anyone who has tried to reopen an old archived Word document has
doubtless already learned the hard way.  Document management is an issue
that many organizations INCLUDING ones that are otherwise competently
run handle very, very poorly, literally gambling that whatever document
format they are saving into archives today will be recoverable in a
decade.  The ability to actually file those documents in a
crossreferenced, keyword string searchable format is similarly lacking.
The fact that the documents tend to be scattered all over an
organization's mounted filespace is another problem.  Windows is far
from homogeneous here and notorious for its lack of backwards
compatibility, as it is ultimately this that "forces" an organization to
update WinXX including Office across the institution.

You have just as much difficulty with Tommy in accounting using Win98
and an old version of Office with Sally in management using WinXP Pro
with a sparkling modern version of Office.  Sally writes a memo too
Tommy and Tommy cannot read it or respond.  Open Office would do (if
anything) BETTER.  The difference is that updating Tommy's desk to the
latest greatest WinXX and Office will cost (very likely) $100's in
software, hours of sysadmin time, and a bit of training.  Updating
Tommy's desktop to e.g.  gnome and open office would take $0 in
software, ten minutes of sysadmin time (long enough to initiate a
pxe-driven boot), and somewhere longer in training.

Again, this is competence -- your argument is the homogeneity is cheaper
than heterogeneity, and I learned that the hard way back in the mid-80's
so I can hardly disagree now.  However, inhomogeneity can have benefits
as well, so to competently determine the correct degree of INhomogeneity
an institution should seek requires a cost-benefit analysis.  Is it
cheaper in the long run for the institution to invest in the two people
required to get linux started so that it CAN update Tommy to linux AND
pay the training costs to get Tommy up to speed on linux-based
replacements for his standard tools?  Not a simple question to answer,
and no SINGLE answer will be universal.  However, it is undeniable that
it is a lot easier to make the move if the organization already has a
couple or three linux jocks on its IT staff, perhaps to run servers,
perhaps to run dedicated function linux clusters, perhaps because
engineering insists on using linux regardless of what accounting wants
to run.  This reduces the MARGINAL cost of linux still further and
provides that dangerous pathway towards a phase transition.

The "phase transition" approach is one that paradoxically works best in
tightly controlled topdown fascist management schemes.  In order to
achieve clearly visible CBA wins and achieve corporate goals, a
corporation bites the bullet and installs some linux systems -- probably
a linux cluster of one sort (HA) or another (HPC), maybe a few desktops
in technical departments.  They hire 5 linux superheros to run all of
this.  A year later they notice that those superheros are mostly playing
video games because yum is doing all of the nightly maintenance, the
hardware and software profiles of the systems they manage rarely change,
the software is stable and works well, and once their users got weaned
from winsux and retrained in linux, they seemed happy enough.

The IT person then asks "could we become a linux-only shop"?  Next thing
you see, it's the "Burlington Coat Factory" story -- 5000 linux POS
systems, 2000 servers, and linux desktops everywhere possible, at the
cost of millions up front plus millions more in operational
efficiencies.

This is the MS nightmare -- so far there haven't been THAT many total
conversions, but EVERY total conversion is a template for ten more.  Up
until the last two years, MS was reluctantly conceding parts of the
server marketplace, arguing correctly that linux was growing more at the
expense of Unix than of MS.  Over the last couple of years, though, the
linux desktop has significantly improved and (as Joe has noted) linux
has proven capable of integrating more or less seamlessly into a
win/lin mixed environment.

The remaining strikes against linux continue to be hardware (which tends
to be a lot more differentiated at the desktop level and which linux
does poorly for features like e.g. dvd support, printer support, camera
support that are likely to be randomly requested by various groups or
individuals) and midlevel business applications.  THIS, not the desktop,
is what I personally have seen as the last bastion of resistance in at
least the company I sit on the board of.  They converted to linux
servers and are happy as clams.  They converted to linux POS and linux
desktops and are happy enough -- their employees needed at most a few
extra days of training, as if you can use Explorer you can use Mozilla
or galeon or firefox or netscape, if you can use MS Office you can use
Open Office for just about anything anybody is likely to need to do in
most organizations (a memo or letter being pretty easy in ANY WP).

The killer is middleware -- accounting applications, office (non-Office)
applications, personnel management applications, database applications,
integrated applications.  There are CHOICES out there for Windows --
many of them quite expensive, of course, but they are there.  There
aren't a lot of choices there for linux.  Either there is an open source
effort or there is nothing.  If there IS and OS effort, either it works
and is supported pretty well and can be implemented without a lot of
hackery or (for most organizations) there is nothing.  This is a bigger
issue for small and midsized corps than it is for large ones -- the
large ones have the opportunity cost systems programmer time required to
do the hackery/glue to make the OS solutions work, the smaller ones need
shrinkwrap solutions.  Either one can use consultants to make up the
difference, but there is a cost here as well.

If/when linux solves the device driver issue, one major barrier to
linux-only or linux growth at the direct expense of winsux in mixed
environments goes away.  If/when consultancies like Joe's branch out
into the corporate middleware market and/or start to market software
(maybe even CLOSED source software) for linux at the corporate level,
another one goes away.  Then we'll see what happens.

In the meantime, look out for a task-distribution, mosix-like addition
to Windows.  Turn your lan into a NOW, at only $100/seat and with no
impact to your existing utilization!

That's been one of the major advantages of Unix from way back in the
late 80's and early 90's when I was routinely doing this across Unix
LANs.  One of the major DISadvantages of MS-based systems is that they
have NEVER been able to do this.  Adding this feature to Windows won't
even be a serious programming challenge, and they can probably arrange
for it to happen using tools and libraries that they fully own and
control so that applications written to use it (as opposed to run EP
from the interface) are non-portable.

That should make things "interesting", don't you think?

     rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list