[Beowulf] One big cluster: How CloudFlare launched 10 data centers in 30 days

Wed Oct 24 01:04:53 PDT 2012

http://arstechnica.com/information-technology/2012/10/one-big-cluster-how-cloudflare-launched-10-data-centers-in-30-days/

One big cluster: How CloudFlare launched 10 data centers in 30 days

With high-performance computing, "pixie-booting" servers a half-world away.

by Sean Gallagher - Oct 19 2012, 0:13am -200

Big Data Cloud Open Source Supercomputing The Web

The inside of Equinix's co-location facility in San Jose—the home of
CloudFlare's primary data center.

Photo: Peter McCollough/Wired.com

On August 22, CloudFlare, a content delivery network, turned on a brand new
data center in Seoul, Korea—the last of ten new facilities started across
four continents in a span of thirty days. The Seoul data center brought
CloudFlare's number of data centers up to 23, nearly doubling the company's
global reach—a significant feat in itself for a company of just 32 employees.

But there was something else relatively significant about the Seoul data
center and the other 9 facilities set up this summer: despite the fact that
the company owned every router and every server in their racks, and each had
been configured with great care to handle the demands of CloudFlare's CDN and
security services, no one from CloudFlare had ever set foot in them. All that
came from CloudFlare directly was a six-page manual instructing facility
managers and local suppliers on how to rack and plug in the boxes shipped to
them.

"We have nobody stationed in Stockholm or Seoul or Sydney, or a lot of the
places that we put these new data centers," CloudFlare CEO Matthew Prince
told Ars. "In fact, no CloudFlare employees have stepped foot in half of the
facilities where we've launched." The totally remote-controlled data center
approach used by the company is one of the reasons that CloudFlare can afford
to provide its services for free to most of its customers—and still make a 75
percent profit margin.

In the two years since its launch, the content delivery network and
denial-of-service protection company has helped keep all sorts of sites
online during global attacks, both famous and infamous—including recognition
from both Davos and LulzSec. And all that attention has amounted to
Yahoo-sized traffic—the CloudFlare service has handled over 581 billion
pageviews since its launch.

Yet CloudFlare does all this without the sort of Domain Name Service "black
magic" that Akamai and other content delivery networks use to
forward-position content—and with only 32 employees. To reach that level of
efficiency, CloudFlare has done some black magic of a different sort, relying
on open-source software from the realm of high-performance computing, storage
tricks from the world of "big data," a bit of network peering arbitrage and
clever use of a core Internet routing technology.

In the process, it has created an ever-expanding army of remote-controlled
service points around the globe that can eat 60-gigabit-per-second
distributed denial of service attacks for breakfast.

Routing with Anycast

CloudFlare's CDN is based on Anycast, a standard defined in the Border
Gateway Protocol—the routing protocol that's at the center of how the
Internet directs traffic. Anycast is part of how BGP supports the
multi-homing of IP addresses, in which multiple routers connect a network to
the Internet; through the broadcasts of IP addresses available through a
router, other routers determine the shortest path for network traffic to take
to reach that destination.

Using Anycast means that CloudFlare makes the servers it fronts appear to be
in many places, while only using one IP address. "If you do a traceroute to
Metallica.com (a CloudFlare customer), depending on where you are in the
world, you would hit a different data center," Prince said. "But you're
getting back the same IP address."

That means that as CloudFlare adds more data centers, and those data centers
advertise the IP addresses of the websites that are fronted by the service,
the Internet's core routers automatically re-map the routes to the IP
addresses of the sites. There's no need to do anything special with the
Domain Name Service to handle load-balancing of network traffic to sites
other than point the hostname for a site at CloudFlare's IP address. It also
means that when a specific data center needs to be taken down for an upgrade
or maintenance (or gets knocked offline for some other reason), the routes
can be adjusted on the fly.

That makes it much harder for distributed denial of service attacks to go
after servers behind CloudFlare's CDN network; if they're geographically
widespread, the traffic they generate gets spread across all of CloudFlare's
data centers—as long as the network connections at each site aren't overcome.

In September, Prince said, "there was a brand new botnet out there launching
big attacks, and it targeted one of our customers. It generated 65 gigabits
per second of traffic hitting our network. But none of that traffic was
focused in one place—it was split fairly evenly across our 23 data centers,
so each of those facilities only had to deal with about 3 gigs of traffic.
That's much more manageable."

Net-rich, power-poor

Making CloudFlare's approach work requires that it put its networks as close
as possible to the core routers of the Internet—at least in terms of network
hops. While companies like Google, Facebook, Microsoft, and Yahoo have gone
to great lengths to build their own custom data centers in places where power
is cheap and where they can take advantage of the economies of scale,
CloudFlare looks to use existing facilities that "your network traffic would
be passing through even if you weren't using our service," Prince said.

As a result, the company's "data centers" are usually at most a few racks of
hardware, installed at co-location facilities that are major network exchange
points. Prince said that most of his company's data centers are set up at
Equinix IBX co-location facilities in the US, including CloudFlare's primary
facility in San Jose—a facility also used by Google and other major cloud
players as an on-ramp to the Internet.

CloudFlare looks for co-location facilities with the same sort of
capabilities wherever it can. But these sorts of facilities tend to be older,
without the kind of power distribution density that a custom-built data
center would have. "That means that to get as much compute power as possible
into any given rack, we're spending a lot of time paying attention to what
power decisions we make," Prince said.

The other factor driving what goes into those racks is the need to maximize
the utilization of CloudFlare's outbound Internet connections. CloudFlare
buys its bandwidth wholesale from network transit providers, committing to a
certain level of service. "We're paying for that no matter what," Prince
said, "so it's optimal to fill that pipe up."

That means that the computing power of CloudFlare's servers is less of a
priority than networking and cache input/output and power consumption. And
since CloudFlare depends heavily on the facility providers overseas or other
partners to do hardware installations and swap-outs, the company needed to
make its servers as simple as possible to install—bringing it down to that
six-page manual. To make that possible, CloudFlare's engineering team drew on
experience and technology from the high-performance computing world.

The magical pixie-booted data center

"A lot of our team comes from the HPC space," Prince said. "They include
people who built HPC networks for the Department of Energy, where they have
an 80 thousand node cluster, and had to figure out how to get 80,000
computers, fit them into one space, cable them in a really reliable way, and
make sure that you can manage them from a single location."

One of the things that CloudFlare brought over from the team's DoE experience
was the Perceus Provisioning System, an open-source provisioning system for
Linux used by DoE for its HPC environments. All of CloudFlare's servers are
"pixie-booted"  (using a Preboot eXecution Environment, or PXE) across a
virtual private network between data centers; servers are delivered with no
operating system or configuration whatsoever, other than a bootloader that
calls back to Perceus for provisioning. "The servers come from whatever
equipment vendor we buy them from completely bare," Prince said. "All we get
from them is the MAC address."

CloudFlare's servers run on a custom-built Linux distribution based on
Debian. For security purposes, the servers are "statelessly" provisioned with
Perceus—that is, the operating system is loaded completely in RAM. The mass
storage on CloudFlare servers (which is universally based on SSD drives) is
used exclusively for caching data from clients' sites.

The gear deployed to data centers that gets significant pre-installation
attention from CloudFlare's engineers is the routers—primarily supplied by
Juniper Networks, which works with CloudFlare to preconfigure them before
being shipped to new data centers. Part of the configuration is to create
virtual network connections over the Internet to the other CloudFlare data
centers, which allows each data center to use its nearest peer to pull
software from during provisioning and updating.

"When we booted up Vienna, for example," said Prince, "the closest data
center was Frankfurt, so we used the Frankfurt facility to boot the new
Vienna facility." One server in Vienna was booted first as the "master node,"
with provisioning instructions for each of the other machines. Once the
servers are all provisioned and loaded, "they call back to our central
facility (in San Jose) and say, 'Here are our MAC addresses, what do you need
us to do?'"

Once the machines have passed a final set of tests, each gets designated with
an operational responsibility: acting as a proxy for Web requests to clients'
servers, managing the cache of content to speed responses, DNS and logging
services. Each of those services can be run on any server in the stack and
step up to take over a service if one of its comrades fails.

Caching and hashing

Caching is part of the job for every server in each CloudFlare facility, and
being able to scale up the size of the cache is another reason for the
modular nature of how the company thinks of servers. Rather than storing
cached webpage objects in a traditional database or file system, CloudFlare
uses a hash-based database that works in a fashion similar to "NoSQL"
databases like 10gen's MongoDB and Amazon's Dynamo storage system.

When a request for a webpage comes in for the first time, CloudFlare
retrieves the site contents. A consistent hashing algorithm in CloudFlare's
caching engine then converts the URL used to call each element into a value,
which is used as the key under which the content is stored locally at each
data center. Each server in the stack is assigned a range of hashes to store
content for, and subsequent requests for the content are routed to the
appropriate server for that cache.

Unlike most database applications, the cache stored at each CloudFlare
facility has an undefined expiration date—and because of the nature of those
facilities, it isn't a simple matter to add more storage. To keep the
utilization level of installed storage high, the cache system simply purges
older cache data when it needs to store new content.

The downside of the hash-based cache's simplicity is that it has no built-in
logging system to track content. CloudFlare can't tell customers which data
centers have copies of which content they've posted. "A customer will ask me,
'Tell me all of the files you have in cache,'" Prince said. "For us, all we
know is there are a whole bunch of hashes sitting on a disk somewhere—we
don't keep track of which object belongs to what site."

The upside, however, is that the system has a very low overhead as a result
and can retrieve site content quickly and keep those outbound pipes full. And
when you're scaling a 32-person company to fight the speed of light
worldwide, it helps to keep things as simple as possible.