[Beowulf] how Google warps your brain

Thu Oct 21 07:13:48 PDT 2010

Comment inserted below

On 10/21/10 3:43 AM, "Eugen Leitl" <eugen at leitl.org> wrote:

> 
> In contrast, back at Harvard, there are discussions going on about building
> up new resources for scientific computing, and talk of converting precious
> office and lab space on campus (where space is extremely scarce) into machine
> rooms. I find this idea fairly misdirected, given that we should be able to
> either leverage a third-party cloud infrastructure for most of this, or at
> least host the machines somewhere off-campus (where it would be cheaper to
> get space anyway). There is rarely a need for the users of the machines to be
> anywhere physically close to them anymore.

There *is* a political reason and a funding stream reason.  When you use a
remote resource, then someone is measuring the use of that resource, and
typically one has a budget allocated for that resource.  Perhaps at google,
computing resources are free, but that's not the case at most places.  So,
someone who has been given X amount of resources to do task Y can't on the
spur of the moment use some fraction of that to do task Z (and, in fact, if
you're consuming government funds, using resources allocated for Y to do Z
is illegal).

However, if you've used the dollars to buy a local computer, typically, the
"accounting for use" stops at that point, and nobody much cares what you use
that computer for, as long as Y gets done.  In the long term, yes, there
will be an evaluation of whether you bought too much or too little for the X
amount of resources, but in the short run, you've got some potential "free"
excess resources.

This is a bigger deal than you might think. Let's take a real life example.
You have a small project, funded at, say, $150k for a year (enough to
support a person working maybe 1/3 time, plus resources) for a couple years.
You decide to use an institutionally provided desktop computer and store all
your terabytes of data on an institutional server and pay the nominal
$500/month (which pays for backups, etc. and all the admin stuff you
shouldn't really be fooling with anyway). You toil happily for the year
(spending around $6k of your budget on computing resources), and then the
funding runs out, a little earlier than you had hoped (oops, the institution
decided to retroactively change the chargeback rates, so now that monthly
charge is $550).  And someone comes to you and says: Hey, you are out of
money, we're deleting the data you have stored in the cloud, and by the way,
give back that computer on your desk.

You're going to need to restart your work next year, when next year's money
arrives (depending on the funding agency's grant cycle, there is a random
delay in this.. Maybe they're waiting for Congress to pass a continuing
resolution or the California Legislature to pass the budget, or whatever..),
but in the mean time, you're out of luck.   And yes, a wise project manager
(even for this $300k task) would have set aside some reserves, etc.  But
that doesn't always happen.

At least if you OWN the computing resources, you have the option of
mothballing, deferring maintenance, etc. to ride through a funding stream
hiccup.

Unless you really don't believe in
> remote management tools, the idea that we're going to displace students or
> faculty lab space to host machines that don't need to be on campus makes no
> sense to me.
> 
> The tools are surprisingly good.
> Log first, ask questions later. It should come as no surprise that debugging
> a large parallel job running on thousands of remote processors is not easy.
> So, printf() is your friend.

This works in a "resources are free" environment.  But what if you are
paying for every byte of storage for all those log messages? What if you're
paying for compute cycles to scan those logs?

Remote computing on a large scale works *great* if the only cost is a
"connectivity endpoint"

Look at the change in phone costs over the past few decades.  Back in the
70s, phone call and data line cost was (roughly) proportional to distance,
because you were essentially paying for a share of the physical copper (or
equivalent) along the way. As soon, though, as there was substantial fiber
available, there was a huge oversupply of capacity, so the pricing model
changed to "pay for the endpoint" (or "point of presence/POP"), leading to
"5c/min long distance anywhere in the world".  I was at a talk by a guy from
AT&T in 1993 and he mentioned that the new fiber link across the Atlantic
cost about $3/phone line (in terms of equivalent capacity, e.g. 64kbps), and
that was the total lifecycle cost.. The financial model was: if you paid $3,
you'd have 64kbbps across the atlantic in perpetuity, or close to it.   Once
you'd paid your $3, nobody cared if it was busy or idle, etc.  So the
"incremental cost" to use the circuit was essentially zero.  Compare this to
the incredibly expensive physical copper wires with limited bandwidth, where
they could charge dollars/minute, which is pretty close to the actual cost
to provide the service.

If you go back through the archives of this list, this kind of "step
function in costs" has been discussed a lot.  You've already got someone
sysadmin'ing a cluster with 32 nodes, and they're not fully busy, so adding
another cluster only increases your costs by the hardware purchase (since
the electrical and HVAC costs are covered by overheads).

But the approach of "low incremental cost to consume excess capacity" only
lasts so long: when you get to sufficiently large scales, there is *no*
excess capacity, because you're able to spend your money in sufficiently
small granules (compared to overall size).  Or, returning to my original
point, the person giving you your money is able to account for your usage in
sufficiently small granules that you have no "hidden excess" to "play with".

Rare is the cost sensitive organization that voluntarily allocates resources
to unconstrained "fooling around".  Basically, it's the province of
patronage.

Log everything your program does, and if
> something seems to go wrong, scour the logs to figure it out. Disk is cheap,
> so better to just log everything and sort it out later if something seems to
> be broken. There's little hope of doing real interactive debugging in this
> kind of environment, and most developers don't get shell access to the
> machines they are running on anyway. For the same reason I am now a huge
> believer in unit tests -- before launching that job all over the planet, it's
> really nice to see all of the test lights go green.