Archive for the 'cloud' Category

PortLand Code Release

The amount of interest in data centers and data center networking continues to grow.  For the past decade plus, the most savvy Internet companies have been focusing on infrastructure.  Essentially, planetary scale services such as search, social networking, and e-commerce require a tremendous amount of computation and storage.  When operating at the scale of tens of thousands of computers and petabytes of storage, small gains in efficiency can result in millions of dollars of annual savings.  On the other extreme, efficient access to tremendous amounts of computation can enable companies to deliver more valuable content.  For example, Amazon is famous for tailoring web page contents to individual customers based on both their history and potentially the history of similar users.  Doing so while maintaining interactive response times (typically responding in less than 300 ms) requires fast, parallel access to data potentially spread across hundreds or even thousands of computers.  In an earlier post, I described the Facebook architecture and its reliance on clustering for delivering social networking content.

Over the last few years, academia has become increasingly interested in data centers and cloud computing. One reason is the opportunity for impact; it is clear, that the entire computing industry is undergoing another paradigm shift.  Five years from now, it is clear that the way we build out computing and storage infrastructures will be radically different.  Another allure of the data center is the fact that it is possible to do “clean slate” research and deployment.  One frustration of the networking research community has been the inability to deploy novel architectures and protocols because of the need to be backward compatible and friendly to legacy systems.  Check out this paper for an excellent discussion. In the data center, it is at least possible to deploy entirely new architectures without the need to be compatible with every protocol developed over the years.

Of course, there are difficulties with performing data center research as well.  One is having access to the necessary infrastructure to perform research at scale.  With companies deploying data centers at the scale of tens of thousands of computers, it is difficult for most universities and even research labs to have access to the necessary infrastructure.  In our own experience, we have found that it is possible to consider problems of scale even with a relatively modest number of machines.  Research infrastructures such Emulab and OpenCirrus are open compute platforms that provide significant amount of computing infrastructure to the research community.

Another challenge is the lack of software infrastructure for performing data center research, particularly in networking.  Eucalyptus provides an EC2-compatible environment for cloud computing.  However, there is a relative void of available research software for research in networking.  Rebuilding every aspect of the protocol stack before performing research in fundamental algorithms and protocols is a challenge.

To partially address this shortcoming, we are release an alpha version of our PortLand protocol.  This work was published in SIGCOMM 2009 and targets delivering a unified Layer 2 environment for easier management and support for basic functionality such as virtual machine migration.  I discussed our work on PortLand in an earlier post here and some of the issues of Layer 2 versus Layer 3 deployment here.

The page for downloading PortLand is now up.  It reflects the hard work of two graduate students in my group, Sambit Das and Malveeka Tewari, who took our research code and ported it HP ProCurve switches running OpenFlow.  The same codebase runs on NetFPGA switches as well.  We hope the community can confirm that the same code runs on a variety of other OpenFlow-enabled switches.  Our goal is for PortLand to be a piece of the puzzle for a software environment for performing research in data center networking.  We encourage you to try it out and give us feedback.  In the meantime, Sambit and Malveeka are hard at work in adding Hedera functionality for flow scheduling for our next code release.

Presentation Summary “High Performance at Massive Scale: Lessons Learned at Facebook”

Recently, we were fortunate to host Jeff Rothschild, the Vice President of Technology at Facebook, for a visit for the CNS lecture series.  Jeff’s talk, “High Performance at Massive Scale: Lessons Learned at Facebook” was highly detailed, providing real insights into the Facebook architecture. Jeff spoke to a packed house of faculty, staff, and students interested in the technology and research challenges associated with running and Internet service at scale.  The talk is archived here as part of the CNS lecture series.  I encourage you to check it out; below are my notes on the presentation.
Site Statistics:
  • Facebook is the #2 property on the Internet as measured by the time users spend on the site.
  • Over 200 billion monthly page views.
  • >3.9 trillion feed actions proceessed per day.
  • Over 15,000 websites use Facebook content
  • In 2004, the shape of the curve plotting user population as a function of time showed exponential growth to 2M users.  5 years later they have stayed on the same exponetial curve with >300M users.
  • Facebook is a global site, with 70% of users outside of the US.
  • Today, there are 1.3B people in the world who have quality Internet connectivity, so there is at least another factor of 4 growth that Facebook is going after. Jeff presented statistics for the number of users that each engineer supports at a variety of high-profile Internet companies: 1.1M for Facebook, 190,000 Google, 94,000 Amazon, 75,000 Microsoft.
Photo sharing on Facebook:
  • Facebook stores 20 billion photos in 4 resolutions
  • 2-3 billion new photos uploaded every month
  • Originally provisioned photo storage for 6 months, but blew through available storage in 1.5 weeks.
  • Facebook serves 600k photos/second –> serving them is more difficult than storing them.
Scaling photos, first the easy way:
  • Upload tier: handles uploads, scales the images, sotres on NFS tier
  • Serving tier: Images are served from NFS via HTTP
  • NFS Storage tier built from commercial products
  • Filesystems aren’t really good at supporting large numbers of files
Scaling photos, 2nd generation:
  • Cachr: cache the high volume smaller images to offload the main storage systems.
  • Only 300M images in 3 resolutions
  • Distribute these through a CDN to reduce network latency.
  • Cache them in memory.
Scaling photos, 3rd Generation System: Haystack
  • How many IO’s do you need to serve an image?  Originally, 10 I/O’s at Facebook because of the complex directory structure.
  • Optimizations got it down to 2-4 IOs per file served
  • Facebook built a better version called Haystack by merging multiple files into a single large file. In the common case, serving a photo now requires 1 I/O operation.  Haystack is available as open source.
Facebook architecture consists of:
  • Load balancers as front end requests are distributed to Web Servers retrieve actual content from a large memcached layer because of the latency requirements for individual requests.
  • Presentation Layer employs PHP
  • Simple to learn: small set of expressions and statements
  • Simple to write: loose typing and universal “array”
  • Simple to read
But this comes at a cost:
  • High CPU and memory consumption.
  • C++ Interoperability Challenging.
  • PHP does not encourage good programming in the large (at 3M lines of code it is a significant organizational challenge).
  • Initialization cost of each page scales with size of code base
Thus Facebook engineers undertook implementing optimizations to PHP:
  • Lazy loading
  • Cache priming
  • More efficient locking semantics for variable cache
  • Memcache client extension
  • Asynchrnous event-handling
Back-end services that require the performance are implemente in C++. Services Philosophy:
  • Create a service iff required.
  • Real overhead for deployment, maintenance, separate code base.
  • Another failure point.
  • Create a common framework and toolset that will allow for easier creation of services: Thrift (open source).
A number of things break at scale, one example: syslog
  • Became impossible to push large amounts of data through the logging infrastructure.
  • Implemented Scribe for logging.
  • Today, Scribe processes 25TB of messages/day.
Site Architecture
Overall, Facebook currently runs approximately 30k servers, with the bulk of them acting as web servers.
The Facebook Web Server, running PHP, is responsible for retrieving all of the data required to compose the web page.  The data itself is stored authoritatively in a large cluster of MySQL servers.  However, to hit performance targets, most of the data is also stored in memory across an array of memcached servers. For traditional websites, each user interacts with his or her own data.  And for most web sites, only 1-2% of registered users concurrently access the site at any given time.  Thus, the site only needs to cache 1-2% of all data in RAM.  However, data at Facebook is deeply interconnected; each user is interested in the state of hundreds of other users.  Hence, even with only 1-2% of the user population at any given time, virtually all data must still be available in RAM.
Data partitioning was easy when Facebook was a college web site, simply partition data at the level of individual colleges.  After considering a variety of data clustering algorithms, found that there was very little win for the additional complexity of clustering.  So at Facebook, user data is randomly partitioned across indiviual databases and machines across the cluster.  Hence, each user access requires retrieving data corresponding to user state spread across hundreds of machines.  Intra-cluster network performance is hence critical to site performance. Facebook employs memcache to store the vast majority of user data in memory spread across thousands of machines in the cluster.  In essence, nodes maintain a distributed hash table to determine the machine responsible for a particular users data.  Hot data from MySQL is stored in the cache.  The cache supports get/set/incr/decr and
multiget/multiset operations.
Initially, the architecture needed to support 15-20k requests/sec/machine but that number has scaled to approximately 250k requests/sec/machine today.  Servers have gotten faster to keep up to some but Facebook engineers also had to perform some fundamental re-engineering of memcached to improve its performance.  System performance improved from 50k requests/sec/machine to 150k to 200k to 250k by adding multithreading, polling device drivers, stats locking, and batched packet handling respectively. In aggregate, Memcache at Facebook processes in 120M requests/sec.
One networking challenge with memcached was so-called Network Incast. A front-end web server would collect responses from hundreds of memcache machines in parallel to compose an individual HTTP response. All responses would come back within the same approximately 40 microsecond window.  Hence, while overall network utilization was low at Facebook, even at short time scales, there were significant, correlated packet losses at very fine timescales.  These microbursts overflowed the limited packet buffering in commodity switches (see my earlier post for more discussion on this issue).
To deal with the significant slow down that resulted by synchronized loss in relatively small TCP windows, Facebook built a custom congestion-aware UDP-based transport that managed congestion across multiple requests rather than within a single connection. This optimization allowed Facebook to avoid the, for example, 200 ms timeouts associated with the loss of an entire window’s worth of data in TCP.
Authoritative Storage
Authoritative Facebook data is stored in a pool of MySQL servers. The overall experience with MySQL has been very positive at Facebook, with thousands of MySQL servers in multiple datacenters.  It is simple, fast, and reliable.  Facebook currently has 8,000 server-yearas of runtime experience without data loss or corruption.
Facebook has learned a number of lessons about data management:
  • Shared architecture should be avoided; there are no joins in the code.
  • Storing dynamically changing data in a central database should be avoided.
  • Similarly, heavily-referenced static data should not be stored in a central database.
There are a number of challenges with MySQL as well, including:
  • Logical migration of data is very difficult.
  • Creating a large number of logical dbs, load balance them over varying number of physical nodes.
  • Easier to scale CPU on web tier than on the DB tier.
  • Data driven schemas make for happy programmers and difficult operations.

Lots of examples of Facebook’s contribution back to open source here.

Given its global user population, Facebook eventually had to move to replicating its content across multiple data centers.  Facebook now runs two large data centers, one on the West coast of the US and one on the East coast.  However, this introduces the age-old problem of data consistency. Facebook adopts a primary/slave replication scheme where the West coast MySQL replicas are the authoritative stores for data.  All updates are applied to these master replicas and asynchronously replicated to the slaves on the East coast.  However, without synchronous updates, consecutive requests to the same data item from the same user can return inconsistent or stale results.
The approach taken at Facebook is to set a cookie on user update requests that will redirect all subsequent requests from that user to the West coast master for some configurable time period to ensure that read operations do not return inconsistent results.  More details on this approach is detailed on the Facebook blog.
Areas for future research at Facebook:
  • Load balancing
  • Middle tier: balance between programmer productivity and machine efficiency
  • Graph-based caching and storage systems
  • Search relevance via the social graph
  • Object discovery and ranking
  • Storage systems
  • Personalization
Jeff also relayed an interesting philosophy from Mark Zuckerberg: “Work fast and don’t be afraid to break things.”  Overall, the idea to avoid working cautiously the entire year, delivering rock-solid code, but not much of it.  A corollary: if you take the entire site down, it’s not the end of your career.

The Ever Changing Face of the Internet

Craig Labovitz made a very interesting presentation e the recent NANOG meeting on the most recent measurements from Arbor’s ATLAS Internet observatory.  ATLAS takes real time Internet traffic measurements from 110+ ISPs with real-time access to more than 14 Tbps of Internet access.  One of the things that makes working in and around Internet research so interesting (and gratifying) is that the set of problems are constantly changing because the way that we use the Internet and the requirements of the applications that we run on the Internet are constantly evolving.  The rate of evolution has thus far been so rapid that we constantly seem to be hitting new tipping points in the set of “burning” problems that we need to address.

Craig, currently Chief Scientist at Arbor Networks, has long been at the forefront of identifying important architectural challenges in the Internet.  His modus operandi has been to conduct measurement studies at a scale far beyond what might have been considered feasible at any particular point in time.  His paper on Delayed Internet Routing Convergence from SIGCOMM 2000 is a classic, among the first to demonstrate the problems with wide-area Internet routing using a 2-year study of the effects of simulated failure and repair events injected from a “dummy” ISP and the many peering relationships that MERIT enjoyed with TIER-1 ISPs.  The paper showed that Internet routing, previously thought to be robust to failure, would often take minutes to converge after a failure event as a result of shortcomings of BGP and the way that ISPs typically configured their border routers.  This paper spawned a whole cottage industry on research into improved inter-domain routing protocols.

This presentation had three high level findings on Internet traffic:

  • Consolidation of Content Contributors: 50% of Internet traffic now originates from just 150 Autonomous Systems (down from thousands just two years ago).  More and more content is being aggregated through big players and content distribution networks.  As a group, CDN’s account for approximately 10% of Internet traffic.
  • Consolidation of Applications: The browser is increasingly running applications.  HTTP and and Flash are the predominant protocols for application delivery.  One of the most interesting findings from the presentation is that P2P traffic as a category is declining fairly rapidly.  As a result of efforts by ISPs and others to rate-limit P2P traffic, in a strict “classifiable” sense (by port number), P2P traffic accounts for less than 1% of Internet traffic in 2009.  However the actual number is likely closer to 18% when accounting for various obfuscation techniques.  Still this is down significantly from estimates just a few years ago that 40-50% of Internet traffic consisted of P2P downloads.  Today, with a number of sites providing both paid and advertiser-supported audio and video content, the fraction of users turning to P2P for their content is declining rapidly.  Instead, streaming of audio and video over Flash/HTTP is one of the fastest growing application segments on the Internet.
  • Evolution of Internet Core: Increasingly, content is being delivered directly from providers to consumers without going through traditional ISPs.  Anecdotally, content providers such as Google, Microsoft, Yahoo!, etc. are peering directly with thousands of Autonomous Systems so that web content from these companies to consumers skips any intermediary tier-X ISPs in going from source to destination.
    When ranking AS’s by the total amount of data either originated or transited, Google ranked third and Comcast 6th in 2009, meaning that for the first time, a non-ISP ranked in the top 10.  Google accounts for 6% of Internet traffic, driven largely by YouTube videos.

Measurements are valuable in providing insight into what is happening in the network but also suggest interesting future directions.  I outline a few of the potential implications below:

  • Internet routing: with content providers taking on ever larger presence in the Internet topology, one important question is the resiliency of the Internet routing infrastructure.  In the past, domains that wishes to remain resilient to individual link and router failures would “multi-home” by connecting to two or more ISPs.  Content providers such as Google would similarly receive transit from multiple ISPs, typically at multiple points in the network.  However, with an increasing fraction of Internet content and “critical” services provided by an ever-smaller number of Internet sites and with these content-providers directly peering with end customers rather than going through ISPs, there is the potential for reduced fault tolerance for the network as a whole.  While it is now possible for clients to receive better quality of service with direct connections to content providers, a single failure or perhaps a small number of correlated failures can potentially have much more impact on the resiliency of network services.
  • CDN architecture: The above trend can be even more worrisome if the cloud computing vision becomes reality and content providers begin to run on a small number of infrastructure providers.  Companies such as Google and Amazon are already operating their own content distribution networks to some extent and clearly they and others will be significant players in future cloud hosting services.  It will be interesting to consider the architectural challenges of a combined CDN and cloud hosting infrastructure.
  • Video is king: with an increasing fraction of Internet traffic devoted to video, there is significant opportunity in improved video and audio codecs, caching, and perhaps the adaptation of peer-to-peer protocols for fixed infrastructure settings.


The Blurring of Layer 2 and Layer 3

Back when I took my graduate course on computer networks (from the tremendous Domenico Ferrari at UC Berkeley), the material was still taught strictly based on the seven-layer OSI protocol stack.  Essentially, our textbook had one chapter for each of the seven layers.  The running joke about the OSI model is that no one understands exactly what layers 5 (the session layer) and the layer 6 (the presentation layer) were all about.  In networking, we spend lots of time talking about layers 1, 2, 3, 4, and 7, but almost none about layers 5 and 6.  Recently, people have even started talking about layer 0, e.g., material scientists that create some of the physical substrates that support high levels of bandwidth on optical networks, and layer 8, the higher-level meaning that might be extracted from collections of applications and data, e.g., the Semantic Web.

What I have found interesting as of late however is that the line between two of the more well-defined layers, layer 2, the network layer, and layer 3, the internetwork layer has become increasingly blurred.  In fact, I would argue that much of the functionality that was traditionally relegated to either layer 2 or layer 3 has become blurred.  In the past, layer 2 was about getting data to/from hosts on the same physical network.  Layer 3 was about getting data among hosts on different physical networks.  Presumably, delivering data for hosts on, for instance, the same LAN segment should allow for simplifying assumptions relative to delivering data between networks.

However, technology forces have pushed us to a point where everything is about “inter-networking”.  A single physical LAN in isolation is just not interesting.  One would think that this would mean that layer 2 protocols would become increasingly marginalized and less important.  All the action should be at layer 3, because inter-networking is where all the action is.

However, just the opposite is in fact happening.  Just about all traditional layer 3/inter-networking functionality is migrating to layer 2 protocols.  So if one were to squint just a little bit, functionality at layer 2 and layer 3 is virtually indistinguishable and often duplicated.  Just as interesting perhaps is that layer 2 may in fact be the place where inter-networking takes place by default, at least within the campus, the enterprise, and the data center.  It would be too radical (for now) for me to make claims about it extending to the Internet as a whole, though a number of projects, including the 100×100 effort, have considered this very position.

Here, I will consider some of the reasons why inter-networking is migrating to layer 2.  There are at least two major forces at work here.

  • The first issue goes back to the original design of the Internet and its protocol suite.  The designers of the Internet made a crucial, and at the time entirely justified, design decision/optimization.  They used a host’s IP address to encode both its globally unique address and its hierarchical position in the global network.  That is, a host’s 32-bit IP address would be both the guaranteed unique handle for all potential senders and the basis for scalable routing/forwarding in Internet routers.  I recently heard a talk from Vint Cerf where he said that this was the one decision that he most wishes he could revisit.This design point was perfectly reasonable, and in fact a very nice optimization, as long as Internet hosts never, or at least very rarely, changed  locations in the network.  As soon as hosts could move from physical network to physical network with some frequency, then conflating host location with host identity introduces a number of challenges.  And of course today, we have exactly this situation with WiFi, smart phones, and virtual machine migration.  The problem stems from the fact that scalable Internet routing relies on hierarchically encoding IP addresses.  All hosts on the same LAN share the same prefix in their IP address; all hosts in the same organization share the same (typically shorter) prefix; etc.

    When a host moves from one layer 2 domain (previously one physical network) to another layer 2 domain, it must change its IP address (or use fairly clumsy forwarding schemes originally developed to support IP mobility with home agents, etc.).  Changing a host’s IP address breaks all outstanding TCP connections to that host and of course invalidates all network state that remote hosts were maintaining regarding a supposedly globally unique name.  Of course, it is worth noting that when the Internet protocols were being designed in the 70’s, an optimization targeting the case where host mobility was considered to be rare was entirely justified and even very clever!

  • The second major force at work in pushing inter-networking functionality into layer 2 is the relative difficulty of managing large layer-3 networks.  Essentially, because of the hierarchy imposed on the IP address name space, layer 3 devices in enterprise settings have to be configured with the unique subnet number corresponding to the prefix the switches are uniquely responsible for.  Similarly, end hosts must be configured through DHCP to receive an IP address corresponding to the first hop switch they connect to.

It is for these reasons that network designers and administrators became interested in managing multiple physical networks as a single layer 2 domain, even going back to some of the original work on layer 2 bridging and spanning tree protocols. In an extended LAN, any host could be assigned any IP address and it could maintain its IP address as it moved from switch to switch.  For instance, consider a campus WiFi network.  Technically, each WiFi base station forms its own distinct physical network.  If each base station were to be managed as a separate LAN, then hosts moving from one base station to another would need to be assigned a new IP address corresponding to the new subnet.  Similarly, with the advent of virtualization in the enterprise and data center, it is no longer necessary for a host to physically migrate from one network to another.  For load balancing, planned upgrades, and thermal management, it is desirable to migrate virtual machines from one physical host to another.  Once again, migrating a virtual machine should not necessitate resetting the machines globally unique name.

Of course, putting inter-networking functionality into layer 2 comes with significant challenges, especially when considering “textbook” Ethernet perhaps the most popular layer 2 network protocol:

  • Forwarding across LANs at layer 2 involves a single spanning tree that may result in sub-optimal routes and worse admits only path between each source and destination.
  • A number of support protocols, such as ARP, require broadcasting to the entire layer 2 domain, potentially limiting overall scalability.
  • Aggregation of forwarding entries becomes difficult/impossible because of flat MAC addresses increasing the amount of state in forwarding tables.  An earlier post discusses the memory limitations in modern switch hardware that makes this issue a significant challenge.
  • Forwarding loops can go on forever since layer 2 protocols do not have a TTL or Hop Count field in the header to enable looping packets to eventually be discarded.  This is especially problematic for broadcast packets.

In a subsequent post, I will discuss some of the techniques being explored to address these challenges.

David vs. Goliath, UCSD vs. Microsoft?

In high school, I took journalism and worked on the school newspaper.  This means that I know the value of a good headline, that the headline may not reflect reality, and that the author of an article often does not write the headline.  In fact, the headline may not even reflect the contents of the article.  Still, I was surprised to see the headline to a recent Network World article, It’s Microsoft vs. the professors with competing data center architectures.  The article invokes an image of one side or the other throwing down the gauntlet and declaring war. In fact, nothing could be further from the truth.  If it were true, I would definitely feel worried about taking on Microsoft.

My group has been active in data center research.  The article does a very nice job of describing the architecture of PortLand, our recent work on a layer 2 network fabric designed to scale to very large data centers.  The article also describes recent work on VL2 from my colleagues at Microsoft Research.  Both papers appeared at SIGCOMM 2009 and were presented back-to-back in the same session at the conference.

I will leave detailed comparison between the two approaches to the papers themselves.  However, at a high level, both efforts start with a similar premise: the data center networking fabric, at the scale of 10k-100k ports, should be managed as a single network fabric.  One desirable goal here is to manage the netwrok as a single layer 2 domain.  However, conventional wisdom dictates that you cannot go beyond a few 100’s of ports for a layer 2 domain because of scalability and performance problems with traditional layer 2 protocols.  I described one such scalability limitation, limited switch state for forwarding tables, in an earlier post.  There are other challenges including spanning tree protocols, and broadcast overhead of ARP.

So the main takeaway is that we cannot scale a layer 2 network to target levels without changing some of the underlying protocols, at least a bit.  With perfect hindsight, the key difference between PortLand and VL2 is one of philosophy.  Both groups agree that the network should consist of unmodified switch hardware.  However, we believe that the end hosts should also remain unmodified, instead implementing new functionality by modifying switch software.  All switch hardware vendors export some API for programming switch forwarding tables and recently, systems such as OpenFlow export standard APIs for programming switch forwarding tables.  In fact, we implemented our prototype of PortLand using OpenFlow with the goal of maintaining the boundary between system and network administration.  VL2, on the other hand, prefers to leave the switch software unmodified and instead introduces its new functionality by modifying the end hosts themselves.  This leads to different architectural techniques and different designs.

One of our overriding goals is to reduce management burden, so we further introduce a decentralized Location Discovery Protocol (LDP) to automatically assign hierarchical prefixes to switches and end hosts.  These prefixes are the basis for compact forwarding tables in intermediate switches.  Both VL2 and PortLand leverage a directory service to essentially find an efficient path between a source and destination without resorting to broadcast (as would be required by default with ARP).

I consider the VL2 paper to be excellent.  I certainly learned a lot from reading the paper.  Perhaps the ultimate complement I can give is that I plan to assign it to my class in the spring when I teach graduate computer networks again.

Still, it is true that one of the best things about research is that we live in a marketplace of ideas and hence there must be some implicit competition.  We can only get better knowing that the folks at Microsoft are working on similar problems and certainly the “truth” as ascertained with 20/20 hindsight in 5-10 years will consist of some mixture of the competing techniques.  That way, everyone can declare victory.

Yahoo!’s Geo-Replication Service, PNUTS

A few weeks ago, I had the chance to visit Yahoo! Research.  I had nice conversations with Brian Cooper and Raghu Ramakrishnan regarding their new storage infrastructure, PNUTS.  I had a great time during my visit and wanted to write a bit about PNUTS after going through their paper in more detail.  Their work is addressing what I consider to be an increasingly important problem, delivering applications to a global audience from data centers spread all across the planet.

Such geo-replication of application data is required because no single data center can provide requisite levels of availability to clients and because speed of light delays and wide-area network congestion make it impossible to deliver interactive response times for clients potentially half ways across the planet.

PNUTS goal is to provide a hosted storage infrastructure exporting a record-based API.  Clients may insert records into tables following a loose scheme (not all columns have to be specified for all records).  Each record has a primary key and an assigned owner, used to deliver PNUTS’s consistency guarantees.  A table’s primary keys may be ordered or hashed, with ordering more naturally supporting range queries and hashing lending itself to load balancing.

Perhaps the primary question for any wide-area replication service is the consistency model.  Because the Yahoo! services leveraging PNUTS have strict performance requirements, the PNUTS designers deemed the overhead of providing strong consistency to be too high.  Instead, individual individual records export “timeline consistency.”  Essentially, all updates are forwarded to a per-record master.  Once the write is applied at the master, the synchronous portion of a write completes and success is returned to the client.  PNUTS then propagates the writes asynchronously to the other data centers replicating the record.  While reads to remote data centers may return stale data, updates will be ordered at the master (hence no conflicts) and pushed to remote replicas in order.

PNUTS aims to scale to ten+ wide-area data centers, each with 1,000 storage machines (petabyte-scale storage).  PNUTS targets record-based storage for online access and hence is complementary to storage systems such as HDFS that target batch-based analysis or other storage systems that target large “blob” storage (video, audio, etc.).

I find this space to be extremely interesting and really in its infancy.  Kudos to the folks at Yahoo! for being among the first to tackle this important space. There are a number of alternative techniques.  Dynamo from Amazon targets single data-center storage and exports an eventual consistency model that may leave updates applied “out of order” from the perspective of a client.  BigTable/GFS provide stronger consistency guarantees but synchronously apply updates to multiple replicas within the data center, making them less appropriate for geo-replication.

In my own group, we are also building a system targeting geo-replication across multiple wide-area data centers.  Our goal is to quantify the exact costs of strong consistency is for web services leveraging data replicated across multiple data centers.  We feel that there are applications that would benefit from strong consistency and sacrifice it in terms of significant additional complexity.  From Yahoo!’s internal measurements from the paper, we see that 85% of writes to a record originate from the same data center.  This certainly justifies locating the master at this location.  However, with timeline consistency, the remaining writes must go across the wide-area anyway to the master, making it difficult to enforce SLAs that are often set at the 99 or even 99.9%.  Further, unavailability of the master makes a record either unavailable or imposes the need to “fork” the timeline, requiring the client application to potentially reconcile conflicting updates.

Can we architect a system that enforces strong consistency, delivers acceptable performance, and maintains availability in the face of any single replica failure?  Clearly, there will be cases where the answer will be a resounding “No!”  We want to understand the scenarios where achieving these properties is possible and, with the appropriate architecture and design, expand the space.

Update: I recently found another nice writeup on PNUTS here.

The Push Toward Cloud Computing and Mega Data Centers

The computing industry is evolving toward a world where an increasing fraction of global computation and storage will be delivered from a relatively small number of dense data centers.  I think that is yet another very exciting to be in computing as we are once again in the process of reinventing what the computing model will look like ten years out.  Here I will talk about some of the drivers behind this trend.  A comprehensive discussion of these issues and more can be found in James Hamilton’s excellent Perspectives.

The driving forces behind this evolution are economics and convenience.  From an economic perspective, energy costs, rather than initial capital outlay, can dominate the cost of operating computing and storage equipment.  In this environment, operating large numbers of computers in regions of the world with cheap and clean power can incur an order of magnitude less cost while simultaneously making computing more environmentally friendly. Similarly, a well-run, dense data center may require an order of magnitude fewer people to operate than similar amounts of computing spread across multiple organizations and physical locations.

Finally, most computers operate at less than 10% (and often less than 1%) overall utilization when measured over long time scales. The advent of virtualization technologies allows for the multiplexing of multiple logical virtual machines onto individual physical machines, allowing for more efficient hardware utilization and also enabling end users and organization to only pay for the resources that they actually require at fine granularity.  Using 100 computers for 1 hour costs the same as using 1 computer for 100 hours without the need to procure and manage 100 computers for the long term.

Much of the computation running in data centers today run in parallel on hundreds, thousands, or even tens of thousands of processors.  A simple search request to Google, Yahoo!, etc. runs in parallel across a multi-petabyte dataset on thousands of computers.  Results must be returned interactively (e.g., less than 300 ms) for queries that require significant computation and communication.  At the other end of the interactivity spectrum, companies wish to run very large-scale data processing and data mining on their own petabyte-scale datasets.  For example, consider queries running over all the items stocked and sold by Walmart.

My group has become very interested in some of the issues in:

  • building out large-scale data centers, particularly the data center network infrastructure;
  • programming and managing applications running across multiple wide area data centers.

I will summarize some of our work in these areas in subsequent posts.

Amin Vahdat is a Professor in Computer Science and Engineering at UC San Diego.

May 2020