Archive for the 'storage' Category

Gray Sort: The Most Fun I’ve Ever Had with (a few racks of) Computers

Things have been quite on the blog, but not because there has not been a lot to say.  In fact, there has been so much happening that I have not had the idle cycles to write about them.  However, I do want to highlight some of the interesting things that have taken place over the past few months.

There has been significant recent interest in large-scale data processing.  Many would snicker that this is far from a new problem and indeed the database community has been pioneering in this space for decades.  However, I believe it is the case that there has been an uptick in commercial interest in this space, for example to index and analyze the wealth of information available on the Internet or to process the multiple billions of requests per day made to popular web services.  MapReduce and open source tools like Hadoop have significantly intensified the debate over the right way to perform large-scale data processing (see my earlier post on this topic).

Observing this recent trend along with my group’s recent focus on data center networking (along with a healthy dose of naivete) led us to go after the world record in data sorting.  The team recently set the records for both Gray Sort (fastest time to sort 100 TB of data) and Minute Sort (most data sorted in one minute) in the “Indy” category. See the sort benchmark page for details. This has been one of the most gratifying projects I have ever been involved with.  The work was of course really interesting but the best part was seeing the team (Alex Rasmussen, Radhika Niranjan Mysore, Harsha V. Madhyastha, Alexander Pucher, Michael Conley, and George Porter) go after a really challenging problem. While some members of the team would disagree, it was also at least interesting to set the records with just minutes to spare before the 2010 deadline.

Our focus in this work was not so much to set the record (though we are happy to have done so) but to go after high-levels of efficiency while operating at scale. Recently, setting the sort record has largely been a test of how much computing resources an organization could throw at the problem, often sacrificing on per-server efficiency. For example, Yahoo’s record for Gray sort used an impressive 3452 servers to sort 100 TB of data in less than 3 hours.  However, per server throughput worked out to less than 3 MB/s, a factor of 30 less bandwidth than available even from a single disk.  Large-scale data sorting involves carefully balancing all per-server resources (CPU, memory capacity, disk capacity, disk I/O, and network I/O), all while maintaining overall system scale.  We wanted to determine the limits of a scalable and efficient data processing system. Given current commodity server capacity, is it feasible to run at 30 MB/s or 300 MB/s per server?  That is, could we reduce the required number of machines for sorting 100 TB of data by a factor of 10 or even 100?

The interesting thing about large-scale data sorting is that it exercises all aspect of the computer system.

  • CPU is required to perform the O(n log n) operation to sort the data.  While not the most compute-intensive application, CPU requirements nonetheless cannot be ignored.
  • Disk Bandwidth: earlier work proves that external memory sort (the case where the data set size is larger than aggregate physical memory) requires at least two reads of the data and two writes of the data.  One of the banes of system efficiency is the orders of magnitude difference in I/O performance for sequential versus random disk I/O.  A key requirement for high-performance sort is ensuring that disks are performing sequential I/O (either read or write) near continuously.
  • Disk capacity: Sorting 100 TB of data requires at least 200 TB of storage, 300 TB if the input data cannot be erased.  While not an enormous amount of data by modern standards, simply storing this amount of data amounts to an interesting systems challenge.
  • Memory capacity: certainly in our architecture, and perhaps fundamentally, ensuring streaming I/O while simultaneously limiting the number of disk operations to 2 reads and 2 writes per tuple requires a substantial amount of memory and careful memory management to buffer data in preparation for large, contiguous writes to disk.
  • Network bandwidth: in a parallel sort system, data must be shuffled in an all-to-all manner across all servers.  Saturating available per-server CPU and storage capacity, requires significant network bandwidth, approaching 10 Gb/s of sustained network throughput per server in our configuration.

Managing the interaction of these disparate resources along with parallelism both within a single server and across a cluster of machines was far more challenging than we anticipated.  Our goal was to use commodity servers to break the sort record while focusing on high efficiency. We constructed a cluster with dual-socket, four-core Intel processors, initially 12GB RAM (later upgraded to 24GB RAM once we realized we good not maintain sequential I/O with just 12GB RAM/server), 2x10GE NIC (only one port active for the experiment), and 16 500GB drives.  The number of hard drives per server was key to delivering high levels of performance.  Each of our drives could sustain approximately 100 MB/s of sequential read or write throughput.  We knew that, in the optimal case (see this paper), we would read and write the data twice in two discrete phases separated by a barrier.  So, if we managed everything perfectly, in the first phase, we would read data from 8 drives at an aggregate rate of 800 MB/s (8*100 MB/s) while simultaneously writing it out to the remaining 8 disks at an identical rate.  In the second phase, we would similarly read the data at 800 MB/s while writing the fully-sorted data out at 800 MB/s.  Once again, in the best case, we would average 400 MB/s of sorting per server.

Interestingly, the continuing chasm between CPU performance and disk I/O (even in the streaming case) means that building a “balanced” data-intensive processing cluster requires a large number of drives per server to maintain overall system balance. While 16 disks per server seems large, one conclusion of our work is that servers dedicated to large-scale data processing should likely have even more disks.  At the same time, significant work needs to be done in the operating system and disk controllers to harness the I/O bandwidth available from such large disk arrays in a scalable fashion.

Our initial goal was to break the record with just 30 servers.  This would correspond to 720 GB/min assuming 400 MB/s/server, allowing us to sort 100 TB of data in ~138 minutes. We did not quite get there (yet); our record-setting runs were on a 48-server configuration. For our “certified” record-setting run, we ran at 582 GB/min on 48 servers, or 200 MB/s/server.  This corresponds to 50% of the maximum efficiency/capacity of our underlying hardware.  Since the certified experiments, we have further tuned our code to sort at ~780 GB/min aggregate or 267 MB/s/server. These newest runs correspond to ~67% efficiency.  Now obsessed with squeezing the last ounce of efficiency from the system, we continue to target >90% efficiency or more than 1 TB/min of sorting on 48 machines.

While beyond the scope of this post, it has been very interesting just how much we had to do for even this level of performance.  In no particular order:

  • We had to revise, redesign, and fine tune both our architecture and implementation multiple times. There is no one right architecture because the right technique varies with evolving hardware capabilities and balance.
  • We had to experiment with multiple file systems and file system configuration before settling on ext4.
  • We were bit multiple times by the performance and caching behavior of our hardware RAID controllers.
  • While our job overall is not CPU bound, thread scheduling and core contention became a significant issue.  In the end, we had to come up with our own custom core allocation bypassing the Linux kernel’s own approach.  One interesting requirement was avoiding the core that by default performed most of the in-kernel system call work.
  • Performing all-to-all communication at near 10 Gb/s, even among 48 hosts on a single switch, is an unsolved challenge to the best of our knowledge.  We had to resort to brittle and arcane socket configuration to sustain even ~5Gb/s.
  • We had to run with virtual memory disabled because the operating system’s memory management behaved in unexpected ways close to capacity.  Of course, with virtual memory disabled, we had to tolerate kernel panics if we were not careful about memory allocation.

In the end, simultaneously addressing these challenges turned out to be a lot of fun, especially with a great group of people working on the project.  Large-scale sort exercises many aspects of the operating system, the network protocol stack, and distributed systems.  It is far from trivial, but it is also simple enough to (mostly) keep in your head at once. In addition to improving the efficiency of our system, we are also working to generalize our infrastructure to arbitrary MapReduce-style computation. Fundamentally, we are interested to determine how much efficiency and scale we can maintain in a general-purpose data processing infrastructure.

Presentation Summary “High Performance at Massive Scale: Lessons Learned at Facebook”

Recently, we were fortunate to host Jeff Rothschild, the Vice President of Technology at Facebook, for a visit for the CNS lecture series.  Jeff’s talk, “High Performance at Massive Scale: Lessons Learned at Facebook” was highly detailed, providing real insights into the Facebook architecture. Jeff spoke to a packed house of faculty, staff, and students interested in the technology and research challenges associated with running and Internet service at scale.  The talk is archived here as part of the CNS lecture series.  I encourage you to check it out; below are my notes on the presentation.
Site Statistics:
  • Facebook is the #2 property on the Internet as measured by the time users spend on the site.
  • Over 200 billion monthly page views.
  • >3.9 trillion feed actions proceessed per day.
  • Over 15,000 websites use Facebook content
  • In 2004, the shape of the curve plotting user population as a function of time showed exponential growth to 2M users.  5 years later they have stayed on the same exponetial curve with >300M users.
  • Facebook is a global site, with 70% of users outside of the US.
  • Today, there are 1.3B people in the world who have quality Internet connectivity, so there is at least another factor of 4 growth that Facebook is going after. Jeff presented statistics for the number of users that each engineer supports at a variety of high-profile Internet companies: 1.1M for Facebook, 190,000 Google, 94,000 Amazon, 75,000 Microsoft.
Photo sharing on Facebook:
  • Facebook stores 20 billion photos in 4 resolutions
  • 2-3 billion new photos uploaded every month
  • Originally provisioned photo storage for 6 months, but blew through available storage in 1.5 weeks.
  • Facebook serves 600k photos/second –> serving them is more difficult than storing them.
Scaling photos, first the easy way:
  • Upload tier: handles uploads, scales the images, sotres on NFS tier
  • Serving tier: Images are served from NFS via HTTP
  • NFS Storage tier built from commercial products
  • Filesystems aren’t really good at supporting large numbers of files
Scaling photos, 2nd generation:
  • Cachr: cache the high volume smaller images to offload the main storage systems.
  • Only 300M images in 3 resolutions
  • Distribute these through a CDN to reduce network latency.
  • Cache them in memory.
Scaling photos, 3rd Generation System: Haystack
  • How many IO’s do you need to serve an image?  Originally, 10 I/O’s at Facebook because of the complex directory structure.
  • Optimizations got it down to 2-4 IOs per file served
  • Facebook built a better version called Haystack by merging multiple files into a single large file. In the common case, serving a photo now requires 1 I/O operation.  Haystack is available as open source.
Facebook architecture consists of:
  • Load balancers as front end requests are distributed to Web Servers retrieve actual content from a large memcached layer because of the latency requirements for individual requests.
  • Presentation Layer employs PHP
  • Simple to learn: small set of expressions and statements
  • Simple to write: loose typing and universal “array”
  • Simple to read
But this comes at a cost:
  • High CPU and memory consumption.
  • C++ Interoperability Challenging.
  • PHP does not encourage good programming in the large (at 3M lines of code it is a significant organizational challenge).
  • Initialization cost of each page scales with size of code base
Thus Facebook engineers undertook implementing optimizations to PHP:
  • Lazy loading
  • Cache priming
  • More efficient locking semantics for variable cache
  • Memcache client extension
  • Asynchrnous event-handling
Back-end services that require the performance are implemente in C++. Services Philosophy:
  • Create a service iff required.
  • Real overhead for deployment, maintenance, separate code base.
  • Another failure point.
  • Create a common framework and toolset that will allow for easier creation of services: Thrift (open source).
A number of things break at scale, one example: syslog
  • Became impossible to push large amounts of data through the logging infrastructure.
  • Implemented Scribe for logging.
  • Today, Scribe processes 25TB of messages/day.
Site Architecture
Overall, Facebook currently runs approximately 30k servers, with the bulk of them acting as web servers.
The Facebook Web Server, running PHP, is responsible for retrieving all of the data required to compose the web page.  The data itself is stored authoritatively in a large cluster of MySQL servers.  However, to hit performance targets, most of the data is also stored in memory across an array of memcached servers. For traditional websites, each user interacts with his or her own data.  And for most web sites, only 1-2% of registered users concurrently access the site at any given time.  Thus, the site only needs to cache 1-2% of all data in RAM.  However, data at Facebook is deeply interconnected; each user is interested in the state of hundreds of other users.  Hence, even with only 1-2% of the user population at any given time, virtually all data must still be available in RAM.
Memcache
Data partitioning was easy when Facebook was a college web site, simply partition data at the level of individual colleges.  After considering a variety of data clustering algorithms, found that there was very little win for the additional complexity of clustering.  So at Facebook, user data is randomly partitioned across indiviual databases and machines across the cluster.  Hence, each user access requires retrieving data corresponding to user state spread across hundreds of machines.  Intra-cluster network performance is hence critical to site performance. Facebook employs memcache to store the vast majority of user data in memory spread across thousands of machines in the cluster.  In essence, nodes maintain a distributed hash table to determine the machine responsible for a particular users data.  Hot data from MySQL is stored in the cache.  The cache supports get/set/incr/decr and
multiget/multiset operations.
Initially, the architecture needed to support 15-20k requests/sec/machine but that number has scaled to approximately 250k requests/sec/machine today.  Servers have gotten faster to keep up to some but Facebook engineers also had to perform some fundamental re-engineering of memcached to improve its performance.  System performance improved from 50k requests/sec/machine to 150k to 200k to 250k by adding multithreading, polling device drivers, stats locking, and batched packet handling respectively. In aggregate, Memcache at Facebook processes in 120M requests/sec.
Incast
One networking challenge with memcached was so-called Network Incast. A front-end web server would collect responses from hundreds of memcache machines in parallel to compose an individual HTTP response. All responses would come back within the same approximately 40 microsecond window.  Hence, while overall network utilization was low at Facebook, even at short time scales, there were significant, correlated packet losses at very fine timescales.  These microbursts overflowed the limited packet buffering in commodity switches (see my earlier post for more discussion on this issue).
To deal with the significant slow down that resulted by synchronized loss in relatively small TCP windows, Facebook built a custom congestion-aware UDP-based transport that managed congestion across multiple requests rather than within a single connection. This optimization allowed Facebook to avoid the, for example, 200 ms timeouts associated with the loss of an entire window’s worth of data in TCP.
Authoritative Storage
Authoritative Facebook data is stored in a pool of MySQL servers. The overall experience with MySQL has been very positive at Facebook, with thousands of MySQL servers in multiple datacenters.  It is simple, fast, and reliable.  Facebook currently has 8,000 server-yearas of runtime experience without data loss or corruption.
Facebook has learned a number of lessons about data management:
  • Shared architecture should be avoided; there are no joins in the code.
  • Storing dynamically changing data in a central database should be avoided.
  • Similarly, heavily-referenced static data should not be stored in a central database.
There are a number of challenges with MySQL as well, including:
  • Logical migration of data is very difficult.
  • Creating a large number of logical dbs, load balance them over varying number of physical nodes.
  • Easier to scale CPU on web tier than on the DB tier.
  • Data driven schemas make for happy programmers and difficult operations.

Lots of examples of Facebook’s contribution back to open source here.

Given its global user population, Facebook eventually had to move to replicating its content across multiple data centers.  Facebook now runs two large data centers, one on the West coast of the US and one on the East coast.  However, this introduces the age-old problem of data consistency. Facebook adopts a primary/slave replication scheme where the West coast MySQL replicas are the authoritative stores for data.  All updates are applied to these master replicas and asynchronously replicated to the slaves on the East coast.  However, without synchronous updates, consecutive requests to the same data item from the same user can return inconsistent or stale results.
The approach taken at Facebook is to set a cookie on user update requests that will redirect all subsequent requests from that user to the West coast master for some configurable time period to ensure that read operations do not return inconsistent results.  More details on this approach is detailed on the Facebook blog.
Areas for future research at Facebook:
  • Load balancing
  • Middle tier: balance between programmer productivity and machine efficiency
  • Graph-based caching and storage systems
  • Search relevance via the social graph
  • Object discovery and ranking
  • Storage systems
  • Personalization
Jeff also relayed an interesting philosophy from Mark Zuckerberg: “Work fast and don’t be afraid to break things.”  Overall, the idea to avoid working cautiously the entire year, delivering rock-solid code, but not much of it.  A corollary: if you take the entire site down, it’s not the end of your career.

Yahoo!’s Geo-Replication Service, PNUTS

A few weeks ago, I had the chance to visit Yahoo! Research.  I had nice conversations with Brian Cooper and Raghu Ramakrishnan regarding their new storage infrastructure, PNUTS.  I had a great time during my visit and wanted to write a bit about PNUTS after going through their paper in more detail.  Their work is addressing what I consider to be an increasingly important problem, delivering applications to a global audience from data centers spread all across the planet.

Such geo-replication of application data is required because no single data center can provide requisite levels of availability to clients and because speed of light delays and wide-area network congestion make it impossible to deliver interactive response times for clients potentially half ways across the planet.

PNUTS goal is to provide a hosted storage infrastructure exporting a record-based API.  Clients may insert records into tables following a loose scheme (not all columns have to be specified for all records).  Each record has a primary key and an assigned owner, used to deliver PNUTS’s consistency guarantees.  A table’s primary keys may be ordered or hashed, with ordering more naturally supporting range queries and hashing lending itself to load balancing.

Perhaps the primary question for any wide-area replication service is the consistency model.  Because the Yahoo! services leveraging PNUTS have strict performance requirements, the PNUTS designers deemed the overhead of providing strong consistency to be too high.  Instead, individual individual records export “timeline consistency.”  Essentially, all updates are forwarded to a per-record master.  Once the write is applied at the master, the synchronous portion of a write completes and success is returned to the client.  PNUTS then propagates the writes asynchronously to the other data centers replicating the record.  While reads to remote data centers may return stale data, updates will be ordered at the master (hence no conflicts) and pushed to remote replicas in order.

PNUTS aims to scale to ten+ wide-area data centers, each with 1,000 storage machines (petabyte-scale storage).  PNUTS targets record-based storage for online access and hence is complementary to storage systems such as HDFS that target batch-based analysis or other storage systems that target large “blob” storage (video, audio, etc.).

I find this space to be extremely interesting and really in its infancy.  Kudos to the folks at Yahoo! for being among the first to tackle this important space. There are a number of alternative techniques.  Dynamo from Amazon targets single data-center storage and exports an eventual consistency model that may leave updates applied “out of order” from the perspective of a client.  BigTable/GFS provide stronger consistency guarantees but synchronously apply updates to multiple replicas within the data center, making them less appropriate for geo-replication.

In my own group, we are also building a system targeting geo-replication across multiple wide-area data centers.  Our goal is to quantify the exact costs of strong consistency is for web services leveraging data replicated across multiple data centers.  We feel that there are applications that would benefit from strong consistency and sacrifice it in terms of significant additional complexity.  From Yahoo!’s internal measurements from the paper, we see that 85% of writes to a record originate from the same data center.  This certainly justifies locating the master at this location.  However, with timeline consistency, the remaining writes must go across the wide-area anyway to the master, making it difficult to enforce SLAs that are often set at the 99 or even 99.9%.  Further, unavailability of the master makes a record either unavailable or imposes the need to “fork” the timeline, requiring the client application to potentially reconcile conflicting updates.

Can we architect a system that enforces strong consistency, delivers acceptable performance, and maintains availability in the face of any single replica failure?  Clearly, there will be cases where the answer will be a resounding “No!”  We want to understand the scenarios where achieving these properties is possible and, with the appropriate architecture and design, expand the space.

Update: I recently found another nice writeup on PNUTS here.