[Beowulf] how Google warps your brain
eugen at leitl.org
Thu Oct 21 03:43:23 PDT 2010
Tuesday, October 19, 2010
Computing at scale, or, how Google has warped my brain
A number of people at Google have stickers on their laptops that read "my
other computer is a data center." Having been at Google for almost four
months, I realize now that my whole concept of computing has radically
changed since I started working here. I now take it for granted that I'll be
able to run jobs on thousands of machines, with reliable job control and
sophisticated distributed storage readily available.
Most of the code I'm writing is in Python, but makes heavy use of Google
technologies such as MapReduce, BigTable, GFS, Sawzall, and a bunch of other
things that I'm not at liberty to discuss in public. Within about a week of
starting at Google, I had code running on thousands of machines all over the
planet, with surprisingly little overhead.
As an academic, I have spent a lot of time thinking about and designing
"large scale systems", though before coming to Google I rarely had a chance
to actually work on them. At Berkeley, I worked on the 200-odd node NOW and
Millennium clusters, which were great projects, but pale in comparison to the
scale of the systems I use at Google every day.
A few lessons and takeaways from my experience so far...
The cloud is real. The idea that you need a physical machine close by to get
any work done is completely out the window at this point. My only machine at
Google is a Mac laptop (with a big honking monitor and wireless keyboard and
trackpad when I am at my desk). I do all of my development work on a virtual
Linux machine running in a datacenter somewhere -- I am not sure exactly
where, not that it matters. I ssh into the virtual machine to do pretty much
everything: edit code, fire off builds, run tests, etc. The systems I build
are running in various datacenters and I rarely notice or care where they are
physically located. Wide-area network latencies are low enough that this
works fine for interactive use, even when I'm at home on my cable modem.
In contrast, back at Harvard, there are discussions going on about building
up new resources for scientific computing, and talk of converting precious
office and lab space on campus (where space is extremely scarce) into machine
rooms. I find this idea fairly misdirected, given that we should be able to
either leverage a third-party cloud infrastructure for most of this, or at
least host the machines somewhere off-campus (where it would be cheaper to
get space anyway). There is rarely a need for the users of the machines to be
anywhere physically close to them anymore. Unless you really don't believe in
remote management tools, the idea that we're going to displace students or
faculty lab space to host machines that don't need to be on campus makes no
sense to me.
The tools are surprisingly good. It is amazing how easy it is to run large
parallel jobs on massive datasets when you have a simple interface like
MapReduce at your disposal. Forget about complex shared-memory or message
passing architectures: that stuff doesn't scale, and is so incredibly brittle
anyway (think about what happens to an MPI program if one core goes offline).
The other Google technologies, like GFS and BigTable, make large-scale
storage essentially a non-issue for the developer. Yes, there are tradeoffs:
you don't get the same guarantees as a traditional database, but on the other
hand you can get something up and running in a matter of hours, rather than
Log first, ask questions later. It should come as no surprise that debugging
a large parallel job running on thousands of remote processors is not easy.
So, printf() is your friend. Log everything your program does, and if
something seems to go wrong, scour the logs to figure it out. Disk is cheap,
so better to just log everything and sort it out later if something seems to
be broken. There's little hope of doing real interactive debugging in this
kind of environment, and most developers don't get shell access to the
machines they are running on anyway. For the same reason I am now a huge
believer in unit tests -- before launching that job all over the planet, it's
really nice to see all of the test lights go green.
More information about the Beowulf