A few weeks ago, I had the chance to visit Yahoo! Research. I had nice conversations with Brian Cooper and Raghu Ramakrishnan regarding their new storage infrastructure, PNUTS. I had a great time during my visit and wanted to write a bit about PNUTS after going through their paper in more detail. Their work is addressing what I consider to be an increasingly important problem, delivering applications to a global audience from data centers spread all across the planet.
Such geo-replication of application data is required because no single data center can provide requisite levels of availability to clients and because speed of light delays and wide-area network congestion make it impossible to deliver interactive response times for clients potentially half ways across the planet.
PNUTS goal is to provide a hosted storage infrastructure exporting a record-based API. Clients may insert records into tables following a loose scheme (not all columns have to be specified for all records). Each record has a primary key and an assigned owner, used to deliver PNUTS’s consistency guarantees. A table’s primary keys may be ordered or hashed, with ordering more naturally supporting range queries and hashing lending itself to load balancing.
Perhaps the primary question for any wide-area replication service is the consistency model. Because the Yahoo! services leveraging PNUTS have strict performance requirements, the PNUTS designers deemed the overhead of providing strong consistency to be too high. Instead, individual individual records export “timeline consistency.” Essentially, all updates are forwarded to a per-record master. Once the write is applied at the master, the synchronous portion of a write completes and success is returned to the client. PNUTS then propagates the writes asynchronously to the other data centers replicating the record. While reads to remote data centers may return stale data, updates will be ordered at the master (hence no conflicts) and pushed to remote replicas in order.
PNUTS aims to scale to ten+ wide-area data centers, each with 1,000 storage machines (petabyte-scale storage). PNUTS targets record-based storage for online access and hence is complementary to storage systems such as HDFS that target batch-based analysis or other storage systems that target large “blob” storage (video, audio, etc.).
I find this space to be extremely interesting and really in its infancy. Kudos to the folks at Yahoo! for being among the first to tackle this important space. There are a number of alternative techniques. Dynamo from Amazon targets single data-center storage and exports an eventual consistency model that may leave updates applied “out of order” from the perspective of a client. BigTable/GFS provide stronger consistency guarantees but synchronously apply updates to multiple replicas within the data center, making them less appropriate for geo-replication.
In my own group, we are also building a system targeting geo-replication across multiple wide-area data centers. Our goal is to quantify the exact costs of strong consistency is for web services leveraging data replicated across multiple data centers. We feel that there are applications that would benefit from strong consistency and sacrifice it in terms of significant additional complexity. From Yahoo!’s internal measurements from the paper, we see that 85% of writes to a record originate from the same data center. This certainly justifies locating the master at this location. However, with timeline consistency, the remaining writes must go across the wide-area anyway to the master, making it difficult to enforce SLAs that are often set at the 99 or even 99.9%. Further, unavailability of the master makes a record either unavailable or imposes the need to “fork” the timeline, requiring the client application to potentially reconcile conflicting updates.
Can we architect a system that enforces strong consistency, delivers acceptable performance, and maintains availability in the face of any single replica failure? Clearly, there will be cases where the answer will be a resounding “No!” We want to understand the scenarios where achieving these properties is possible and, with the appropriate architecture and design, expand the space.
Update: I recently found another nice writeup on PNUTS here.