[Beowulf] Spanner: synchronizing the largest global database

Eugen Leitl eugen at leitl.org
Tue Nov 27 05:25:12 PST 2012


(it's about time)

Exclusive: Inside Google Spanner, the Largest Single Database on Earth

By Cade Metz
6:30 AM

Each morning, when Andrew Fikes sat down at his desk inside Google headquarters in Mountain View, California, he turned on the “VC” link to New York.

VC is Google shorthand for video conference. Looking up at the screen on his desk, Fikes could see Wilson Hsieh sitting inside a Google office in Manhattan, and Hsieh could see him. They also ran VC links to a Google office in Kirkland, Washington, near Seattle. Their engineering team spanned three offices in three different parts of the country, but everyone could still chat and brainstorm and troubleshoot without a moment’s delay, and this is how Google built Spanner.

“You walk into our cubes, and we’ve got VC on — all the time,” says Fikes, who joined Google in 2001 and now ranks among the company’s distinguished software engineers. “We’ve been doing this for years. It lowers all the barriers to communication that you typically have.”

‘As a distributed-systems developer, you’re taught from — I want to say childhood — not to trust time. What we did is find a way that we could trust time — and understand what it meant to trust time.’

— Andrew Fikes

The arrangement is only appropriate. Much like the engineering team that created it, Spanner is something that stretches across the globe while behaving as if it’s all in one place. Unveiled this fall after years of hints and rumors, it’s the first worldwide database worthy of the name — a database designed to seamlessly operate across hundreds of data centers and millions of machines and trillions of rows of information.

Spanner is a creation so large, some have trouble wrapping their heads around it. But the end result is easily explained: With Spanner, Google can offer a web service to a worldwide audience, but still ensure that something happening on the service in one part of the world doesn’t contradict what’s happening in another.

Google’s new-age database is already part of the company’s online ad system — the system that makes its millions — and it could signal where the rest of the web is going. Google caused a stir when it published a research paper detailing Spanner in mid-September, and the buzz was palpable among the hard-core computer systems engineers when Wilson Hsieh presented the paper at a conference in Hollywood, California, a few weeks later.

“It’s definitely interesting,” says Raghu Murty, one of the chief engineers working on the massive software platform that underpins Facebook — though he adds that Facebook has yet to explore the possibility of actually building something similar.

Google’s web operation is significantly more complex than most, and it’s forced to build custom software that’s well beyond the scope of most online outfits. But as the web grows, its creations so often trickle down to the rest of the world.

Before Spanner was revealed, many didn’t even think it was possible. Yes, we had “NoSQL” databases capable of storing information across multiple data centers, but they couldn’t do so while keeping that information “consistent” — meaning that someone looking at the data on one side of the world sees the same thing as someone on the other side. The assumption was that consistency was barred by the inherent delays that come when sending information between data centers.

But in building a database that was both global and consistent, Google’s Spanner engineers did something completely unexpected. They have a history of doing the unexpected. The team includes not only Fikes and Hsieh, who oversaw the development of BigTable, Google’s seminal NoSQL database, but also legendary Googlers Jeff Dean and Sanjay Ghemawat and a long list of other engineers who worked on such groundbreaking data-center platforms as Megastore and Dremel.

This time around, they found a new way of keeping time.

“As a distributed systems developer, you’re taught from — I want to say childhood — not to trust time,” says Fikes. “What we did is find a way that we could trust time — and understand what it meant to trust time.”

Time Is of the Essence

On the net, time is of the essence. Yes, in running a massive web service, you need things to happen quickly. But you also need a means of accurately keeping track of time across the many machines that underpin your service. You have to synchronize the many processes running on each server, and you have to synchronize the servers themselves, so that they too can work in tandem. And that’s easier said than done.

Typically, data-center operators keep their servers in sync using what’s called the Network Time Protocol, or NTP. This is essentially an online service that connects machines to the official atomic clocks that keep time for organizations across the world. But because it takes time to move information across a network, this method is never completely accurate, and sometimes, it breaks altogether. In July, several big-name web operations experienced problems — including Reddit, Gawker, and Mozilla — because their software wasn’t prepared to handle a “leap second” that was added to the world’s atomic clocks.

‘We wanted something that we were confident in. It’s a time reference that’s owned by Google.’

— Andrew Fikes

But with Spanner, Google discarded the NTP in favor of its own time-keeping mechanism. It’s called the TrueTime API. “We wanted something that we were confident in,” Fikes says. “It’s a time reference that’s owned by Google.”

Rather than rely on outside clocks, Google equips its Spannerized data centers with its own atomic clocks and GPS (global positioning system) receivers, not unlike the one in your iPhone. Tapping into a network of satellites orbiting the Earth, a GPS receiver can pinpoint your location, but it can also tell time.

These time-keeping devices connect to a certain number of master servers, and the master servers shuttle time readings to other machines running across the Google network. Basically, each machine on the network runs a daemon — a background software process — that is constantly checking with masters in the same data center and in other Google data centers, trying to reach a consensus on what time it is. In this way, machines across the Google network can come pretty close to running a common clock.

‘The System Responds — And Not a Human’

How does this bootstrap a worldwide database? Thanks to the TrueTime service, Google can keep its many machines in sync — even when they span multiple data centers — and this means they can quickly store and retrieve data without stepping on each other’s toes.

“We can commit data at two different locations — say the West Coast [of the United States] and Europe — and still have some agreed upon ordering between them,” Fikes says, “So, if the West Coast write happens first and then the one in Europe happens, the whole system knows that — and there’s no possibility of them being viewed in a different order.”

‘By using highly accurate clocks and a very clever time API, Spanner allows server nodes to coordinate without a whole lot of communication.’

— Andy Gross

According to Andy Gross — the principal architect at Basho, an outfit that builds an open source database called Riak that’s designed to run across thousands of servers — database designers typically seek to synchronize information across machines by having them talk to each other. “You have to a do a whole lot of communication to decide the correct order for all the transactions,” he told us this fall, when Spanner was first revealed.

The problem is that this communication can bog down the network — and the database. As Max Schireson — the president of 10gen, maker of the NoSQL database MongoDB — told us: “If you have large numbers of people accessing large numbers of systems that are globally distributed so that the delay in communications between them is relatively long, it becomes very hard to keep everything synchronized. If you increase those factors, it gets even harder.”

So Google took a completely different tack. Rather than struggle to improve communication between servers, it gave them a new way to tell time. “That was probably the coolest thing about the paper: using atomic clocks and GPS to provide a time API,” says Facebook’s Raghu Murty.

In harnessing time, Google can build a database that’s both global and consistent, but it can also make its services more resistant in the face of network delays, data-center outages, and other software and hardware snafus. Basically, Google uses Spanner to accurately replicate its data across multiple data centers — and quickly move between replicas as need be. In other words, the replicas are consistent too.

When one replica is unavailable, Spanner can rapidly shift to another. But it will also move between replicas simply to improve performance. “If you have one replica and it gets busy, your latency is going to be high. But if you have four other replicas, you can choose to go to a different one, and trim that latency,” Fikes says.

One effect, Fikes explains, is that Google spends less money managing its system. “When there are outages, things just sort of flip — client machines access other servers in the system,” he says. “It’s a much easier service story…. The system responds — and not a human.”

Spanning Google’s Footsteps

Some have questioned whether others can follow in Google’s footsteps — and whether they would even want to. When we spoke to Andy Gross, he guessed that even Google’s atomic clocks and GPS receivers would be prohibitively expensive for most operations.

Yes, rebuilding the platform would be a massive undertaking. Google has already spent four and half years on the project, and Fikes — who helped build Google’s web history tool, its first product search service, and Google Answers, as well as BigTable — calls Spanner the most difficult thing he has ever worked on. What’s more, there are countless logistical issues that need dealing with.

‘The important thing to think about is that this is a service that is provided to the data center. The costs of that are amortized across all the servers in your fleet. The cost per server is some incremental amount — and you weigh that against the types of things we can do for that.’

— Andrew Fikes

As Fikes points out, Google had to install GPS antennas on the roofs of its data centers and connect them to the hardware below. And, yes, you do need two separate types of time keepers. Hardware always fails, and your time keepers must fail at, well, different times. “The atomic clocks provide stability if there is a GPS issue,” he says.

But according to Fikes, these are relatively inexpensive devices. The GPS units aren’t as cheap as those in your iPhone, but like Google’s atomic clocks, they cost no more than a few thousand dollars apiece. “They’re sort of in the order of the cost of an enterprise server,” he says, “and there are a lot of different vendors of these devices.” When we discussed the matter with Jeff Dean — one of Google primary infrastructure engineers and another name on the Spanner paper — he indicated much the same.

Fikes also makes a point of saying that the TrueTime service does not require specialized servers. The time keepers are kept in racks onside the servers, and again, they need only connect to some machines in the data center.

“You can think of it as only a handful of these devices being in each data center. They’re boxes. You buy them. You plug them into your rack. And you’re going to connect to them over Ethernet,” Fikes says. “The important thing to think about is that this is a service that is provided to the data center. The costs of that are amortized across all the servers in your fleet. The cost per server is some incremental amount — and you weigh that against the types of things we can do for that.”

No, Spanner isn’t something every website needs today. But the world is moving in its general direction. Though Facebook has yet to explore something like Spanner, it is building a software platform called Prism that will run the company’s massive number crunching tasks across multiple data centers.

Yes, Google’s ad system is enormous, but it benefits from Spanner in ways that could benefit so many other web services. The Google ad system is an online auction — where advertisers bid to have their ads displayed as someone searches for a particular item or visits particular websites — and the appearance of each ad depends on data describing the behavior of countless advertisers and web surfers across the net. With Spanner, Google can juggle this data on a global scale, and it can still keep the whole system in sync.

As Fikes put it, Spanner is just the first example of Google taking advantage of its new hold on time. “I expect there will be many others,” he says. He means other Google services, but there’s a reason the company has now shared its Spanner paper with the rest of the world.

Illustration by Ross Patton

More information about the Beowulf mailing list