[Beowulf] cloud: ho hum?

Wed Feb 1 08:23:16 PST 2012

On 2/1/12 7:59 AM, "Mark Hahn" <hahn at mcmaster.ca> wrote:
>
>> - Deployment speed. We have customers who wait weeks after making an IT
>> helpdesk request for a new VM to be created. Other customers take 1+
>
>no.  there's nothing technical here: dysfunctional IT orgs should simply
>be fixed.  outsourcing as a workaround for BOFHishness is stupid...
>

The IT org in this situation isn't necessarily dysfunctional.  Say you're
an R&D group of 100 people in a company with 200k employees. Their IT org
is optimized for the 200k, not for the 100.

Outsourcing is a logical choice here.

(this is the specialization, vertical vs horizontal integration, etc.
discussion).

Yes, there are inefficient service organizations everywhere, and there
always will be.  The hardest thing for project managers to learn is that
you MUST plan for average, not above average, performance.  The fact that
sometimes you get above average helps counteract the unknowable problems
that result in below average.

Example from NASA.. Pathfinder put a rover on Mars for (ostensibly) $25M
and set a mind bendingly aggressively low bar for future missions.  That's
not because Pathfinder was particularly well managed (it was well managed,
but that's not why the cost was low).. It's more because of a happy
coincidence of lots of circumstances that made something that
realistically should have cost around $150M cost 1/6 of that. They got
lucky with people to work on it, they got lucky with spare parts from
other missions, they got lucky in being small, so avoiding a lot of
oversight costs.

Next Mars missions in 1998.. Hey Faster, Better, Cheaper, we can do it
again. We'll put TWO probes at Mars for the cost of one $100M mission.
Oops, one crashed into the surface, the other missed orbit injection and
probably burned up.  Much soul searching and reflection..

Next Mars mission (MER 2003) costs over $1B for two rovers.  (and you can
bet there was a LOT more reviews and oversight)  MER got unlucky, in a lot
of ways. Original estimates of costs (from Pathfinder) turned out to be
inappropriate (some examples below). But the real story is that Pathfinder
happened to be out on the tail of the probability distribution of cost,
and MER was more in the middle.  Pathfinder's probability of failure was
MUCH higher than MERs.

- You can't just scale up airbags and parachutes
- The fast, low documentation approach of Pathfinder means you don't
actually have drawings from which you can build stuff with no changes.
- Parts that survived for Pathfinder, when actually tested for
environments, had a high probability of failing, so Pathfinder "got lucky"
and the parts had to get redesigned.
- MER was a lot bigger, so the "average" performance of the team
inevitably showed the applicability of the central limit theorem.
- MER was a lot bigger, so the N^k, where k>1, communications costs rose
faster than the job size.
- As the job costs more, it gets more attention, so more management
controls and reviews are put into place. There's a big difference between
a failure of a mission flying one or two instruments on a cheap and
cheerful rover assembled from commercial parts and flying a dozen
instruments on a $400M rover.

>>