Yahoo!’s Geo-Replication Service, PNUTS

A few weeks ago, I had the chance to visit Yahoo! Research.  I had nice conversations with Brian Cooper and Raghu Ramakrishnan regarding their new storage infrastructure, PNUTS.  I had a great time during my visit and wanted to write a bit about PNUTS after going through their paper in more detail.  Their work is addressing what I consider to be an increasingly important problem, delivering applications to a global audience from data centers spread all across the planet.

Such geo-replication of application data is required because no single data center can provide requisite levels of availability to clients and because speed of light delays and wide-area network congestion make it impossible to deliver interactive response times for clients potentially half ways across the planet.

PNUTS goal is to provide a hosted storage infrastructure exporting a record-based API.  Clients may insert records into tables following a loose scheme (not all columns have to be specified for all records).  Each record has a primary key and an assigned owner, used to deliver PNUTS’s consistency guarantees.  A table’s primary keys may be ordered or hashed, with ordering more naturally supporting range queries and hashing lending itself to load balancing.

Perhaps the primary question for any wide-area replication service is the consistency model.  Because the Yahoo! services leveraging PNUTS have strict performance requirements, the PNUTS designers deemed the overhead of providing strong consistency to be too high.  Instead, individual individual records export “timeline consistency.”  Essentially, all updates are forwarded to a per-record master.  Once the write is applied at the master, the synchronous portion of a write completes and success is returned to the client.  PNUTS then propagates the writes asynchronously to the other data centers replicating the record.  While reads to remote data centers may return stale data, updates will be ordered at the master (hence no conflicts) and pushed to remote replicas in order.

PNUTS aims to scale to ten+ wide-area data centers, each with 1,000 storage machines (petabyte-scale storage).  PNUTS targets record-based storage for online access and hence is complementary to storage systems such as HDFS that target batch-based analysis or other storage systems that target large “blob” storage (video, audio, etc.).

I find this space to be extremely interesting and really in its infancy.  Kudos to the folks at Yahoo! for being among the first to tackle this important space. There are a number of alternative techniques.  Dynamo from Amazon targets single data-center storage and exports an eventual consistency model that may leave updates applied “out of order” from the perspective of a client.  BigTable/GFS provide stronger consistency guarantees but synchronously apply updates to multiple replicas within the data center, making them less appropriate for geo-replication.

In my own group, we are also building a system targeting geo-replication across multiple wide-area data centers.  Our goal is to quantify the exact costs of strong consistency is for web services leveraging data replicated across multiple data centers.  We feel that there are applications that would benefit from strong consistency and sacrifice it in terms of significant additional complexity.  From Yahoo!’s internal measurements from the paper, we see that 85% of writes to a record originate from the same data center.  This certainly justifies locating the master at this location.  However, with timeline consistency, the remaining writes must go across the wide-area anyway to the master, making it difficult to enforce SLAs that are often set at the 99 or even 99.9%.  Further, unavailability of the master makes a record either unavailable or imposes the need to “fork” the timeline, requiring the client application to potentially reconcile conflicting updates.

Can we architect a system that enforces strong consistency, delivers acceptable performance, and maintains availability in the face of any single replica failure?  Clearly, there will be cases where the answer will be a resounding “No!”  We want to understand the scenarios where achieving these properties is possible and, with the appropriate architecture and design, expand the space.

Update: I recently found another nice writeup on PNUTS here.

5 Responses to “Yahoo!’s Geo-Replication Service, PNUTS”


  1. 1 rick parker August 18, 2009 at 5:46 pm

    An alternative consideration is what is the best size for a datacenter in # of servers? Of course the answer is it depends on the application but in most case cases I propose a larger number of smaller datacenters is better then a smaller number of larger datacenters ie. a primary and backup / RAID 1. (RAID 1 is a level of redundancy for a Redundant Array of Inexpensive Disks) What I am proposing is RAIC at the datacenter level
    (Redundant Array of Inexpensive Clouds / Datacenters) that supports geographic redundency. I think this can be down by combining WAN optimizers (Riverbed) and virtual servers.

    Please let me know what you think

    • 2 aminvahdat August 18, 2009 at 6:42 pm

      Thanks for your comment Rick. I think the question of “right-sizing” data centers is an important one, and one that likely deserves its own post. Individual applications are unlikely to require more than a few thousand nodes at any given time today. However, it is likely that individual data centers will want to run multiple multiple applications to leverage statistical multiplexing and the associated efficiencies. Right now, there is a push toward increasing consolidation in “mega” data centers because of the increased efficiencies of scale. At the very least, we will want multiple data centers and geo-replication to deal with individual data center failures or network failures that prevent a fraction of the global client population from accessing a given data center. Put another way, geo-replication is required because even if someone manages to build a data center with infinite computation/storage with an infinitely fast connection to the rest of the Internet, it will still not provide requisite levels of performance and availability. Latency will still be a limiting factor as will the increasing probability of encountering congestion as a global client base attempts to access the centralized service.

      Now, the question is: what is “smaller” and what is “larger” in terms of right-sizing individual data centers. My sense is that, right now, it certainly makes sense to target individual data centers with 10k-100k servers and to perform geo-replication across multiple sites. There are some scenarios (video distribution, online gaming, etc.) where even wider-distribution of smaller data centers may make sense (where latency and low jitter are even more important).

      • 3 rick parker August 18, 2009 at 9:58 pm

        What I am interested in is virtualizing everything, firewalls, routers, switches, tape libraries, ssl vpn systems. I have found there are models of all these products
        that already support / can be virtualized. What this means is being able to divide up mega physical datacenters into virtual datacenters. I think 200-300 virtual servers is about maximum for a virtual datacenter to be able to support relatively fast failover

  2. 4 Rajaram Satyanarayanan November 11, 2009 at 12:26 pm

    Hi,
    For large blob streaming of data or DW/analytical queries data could be searched or retreived in chunks.
    a better approach may be to use bloom filters with network/system response latency historical and current attributes. This statistically computed value could be fed to run time optimizer to shard queries to create a “best fit” shared query for execution on geo located data center distributed database. Other attributes like cache hit / miss ratio could also be factored in if available and not stale enough or use data reduction technique to average out certain parameters which could affect QOS requirement for a query response or thruput.
    Just some thoughts, feedback is appreciated
    Raj


  1. 1 Reading diary, 29 August 2012 | skipperkongen.dk Trackback on August 29, 2012 at 3:30 am
Comments are currently closed.



Amin Vahdat is a Professor in Computer Science and Engineering at UC San Diego.

August 2009
M T W T F S S
 12
3456789
10111213141516
17181920212223
24252627282930
31