Archive for the 'research' Category

Gray Sort: The Most Fun I’ve Ever Had with (a few racks of) Computers

Things have been quite on the blog, but not because there has not been a lot to say.  In fact, there has been so much happening that I have not had the idle cycles to write about them.  However, I do want to highlight some of the interesting things that have taken place over the past few months.

There has been significant recent interest in large-scale data processing.  Many would snicker that this is far from a new problem and indeed the database community has been pioneering in this space for decades.  However, I believe it is the case that there has been an uptick in commercial interest in this space, for example to index and analyze the wealth of information available on the Internet or to process the multiple billions of requests per day made to popular web services.  MapReduce and open source tools like Hadoop have significantly intensified the debate over the right way to perform large-scale data processing (see my earlier post on this topic).

Observing this recent trend along with my group’s recent focus on data center networking (along with a healthy dose of naivete) led us to go after the world record in data sorting.  The team recently set the records for both Gray Sort (fastest time to sort 100 TB of data) and Minute Sort (most data sorted in one minute) in the “Indy” category. See the sort benchmark page for details. This has been one of the most gratifying projects I have ever been involved with.  The work was of course really interesting but the best part was seeing the team (Alex Rasmussen, Radhika Niranjan Mysore, Harsha V. Madhyastha, Alexander Pucher, Michael Conley, and George Porter) go after a really challenging problem. While some members of the team would disagree, it was also at least interesting to set the records with just minutes to spare before the 2010 deadline.

Our focus in this work was not so much to set the record (though we are happy to have done so) but to go after high-levels of efficiency while operating at scale. Recently, setting the sort record has largely been a test of how much computing resources an organization could throw at the problem, often sacrificing on per-server efficiency. For example, Yahoo’s record for Gray sort used an impressive 3452 servers to sort 100 TB of data in less than 3 hours.  However, per server throughput worked out to less than 3 MB/s, a factor of 30 less bandwidth than available even from a single disk.  Large-scale data sorting involves carefully balancing all per-server resources (CPU, memory capacity, disk capacity, disk I/O, and network I/O), all while maintaining overall system scale.  We wanted to determine the limits of a scalable and efficient data processing system. Given current commodity server capacity, is it feasible to run at 30 MB/s or 300 MB/s per server?  That is, could we reduce the required number of machines for sorting 100 TB of data by a factor of 10 or even 100?

The interesting thing about large-scale data sorting is that it exercises all aspect of the computer system.

  • CPU is required to perform the O(n log n) operation to sort the data.  While not the most compute-intensive application, CPU requirements nonetheless cannot be ignored.
  • Disk Bandwidth: earlier work proves that external memory sort (the case where the data set size is larger than aggregate physical memory) requires at least two reads of the data and two writes of the data.  One of the banes of system efficiency is the orders of magnitude difference in I/O performance for sequential versus random disk I/O.  A key requirement for high-performance sort is ensuring that disks are performing sequential I/O (either read or write) near continuously.
  • Disk capacity: Sorting 100 TB of data requires at least 200 TB of storage, 300 TB if the input data cannot be erased.  While not an enormous amount of data by modern standards, simply storing this amount of data amounts to an interesting systems challenge.
  • Memory capacity: certainly in our architecture, and perhaps fundamentally, ensuring streaming I/O while simultaneously limiting the number of disk operations to 2 reads and 2 writes per tuple requires a substantial amount of memory and careful memory management to buffer data in preparation for large, contiguous writes to disk.
  • Network bandwidth: in a parallel sort system, data must be shuffled in an all-to-all manner across all servers.  Saturating available per-server CPU and storage capacity, requires significant network bandwidth, approaching 10 Gb/s of sustained network throughput per server in our configuration.

Managing the interaction of these disparate resources along with parallelism both within a single server and across a cluster of machines was far more challenging than we anticipated.  Our goal was to use commodity servers to break the sort record while focusing on high efficiency. We constructed a cluster with dual-socket, four-core Intel processors, initially 12GB RAM (later upgraded to 24GB RAM once we realized we good not maintain sequential I/O with just 12GB RAM/server), 2x10GE NIC (only one port active for the experiment), and 16 500GB drives.  The number of hard drives per server was key to delivering high levels of performance.  Each of our drives could sustain approximately 100 MB/s of sequential read or write throughput.  We knew that, in the optimal case (see this paper), we would read and write the data twice in two discrete phases separated by a barrier.  So, if we managed everything perfectly, in the first phase, we would read data from 8 drives at an aggregate rate of 800 MB/s (8*100 MB/s) while simultaneously writing it out to the remaining 8 disks at an identical rate.  In the second phase, we would similarly read the data at 800 MB/s while writing the fully-sorted data out at 800 MB/s.  Once again, in the best case, we would average 400 MB/s of sorting per server.

Interestingly, the continuing chasm between CPU performance and disk I/O (even in the streaming case) means that building a “balanced” data-intensive processing cluster requires a large number of drives per server to maintain overall system balance. While 16 disks per server seems large, one conclusion of our work is that servers dedicated to large-scale data processing should likely have even more disks.  At the same time, significant work needs to be done in the operating system and disk controllers to harness the I/O bandwidth available from such large disk arrays in a scalable fashion.

Our initial goal was to break the record with just 30 servers.  This would correspond to 720 GB/min assuming 400 MB/s/server, allowing us to sort 100 TB of data in ~138 minutes. We did not quite get there (yet); our record-setting runs were on a 48-server configuration. For our “certified” record-setting run, we ran at 582 GB/min on 48 servers, or 200 MB/s/server.  This corresponds to 50% of the maximum efficiency/capacity of our underlying hardware.  Since the certified experiments, we have further tuned our code to sort at ~780 GB/min aggregate or 267 MB/s/server. These newest runs correspond to ~67% efficiency.  Now obsessed with squeezing the last ounce of efficiency from the system, we continue to target >90% efficiency or more than 1 TB/min of sorting on 48 machines.

While beyond the scope of this post, it has been very interesting just how much we had to do for even this level of performance.  In no particular order:

  • We had to revise, redesign, and fine tune both our architecture and implementation multiple times. There is no one right architecture because the right technique varies with evolving hardware capabilities and balance.
  • We had to experiment with multiple file systems and file system configuration before settling on ext4.
  • We were bit multiple times by the performance and caching behavior of our hardware RAID controllers.
  • While our job overall is not CPU bound, thread scheduling and core contention became a significant issue.  In the end, we had to come up with our own custom core allocation bypassing the Linux kernel’s own approach.  One interesting requirement was avoiding the core that by default performed most of the in-kernel system call work.
  • Performing all-to-all communication at near 10 Gb/s, even among 48 hosts on a single switch, is an unsolved challenge to the best of our knowledge.  We had to resort to brittle and arcane socket configuration to sustain even ~5Gb/s.
  • We had to run with virtual memory disabled because the operating system’s memory management behaved in unexpected ways close to capacity.  Of course, with virtual memory disabled, we had to tolerate kernel panics if we were not careful about memory allocation.

In the end, simultaneously addressing these challenges turned out to be a lot of fun, especially with a great group of people working on the project.  Large-scale sort exercises many aspects of the operating system, the network protocol stack, and distributed systems.  It is far from trivial, but it is also simple enough to (mostly) keep in your head at once. In addition to improving the efficiency of our system, we are also working to generalize our infrastructure to arbitrary MapReduce-style computation. Fundamentally, we are interested to determine how much efficiency and scale we can maintain in a general-purpose data processing infrastructure.

Doing “Big Science” In Academia

Recently, there has been a lot of handwringing in the systems community about the work that we can do in the age of mega-scale data centers and cloud computing.  The worry is that the really interesting systems today consist of tens of thousands of machines interconnected both within data centers and across the wide area.  Further, appropriate system architectures are heavily dependent on the workloads imposed by millions of users on particular software architectures.  The worry is that  we in academia cannot perform good research because we do not have access to either systems of the appropriate scale or application workloads to inform appropriate system architectures.

The concern further goes that systems research is increasingly being co-opted by industry, with many (sometimes most) of the papers in top systems and networking conferences being written by our colleagues in industry.

One of my colleagues hypothesized that perhaps the void in the systems community was partially caused by the void in “big funding” that was historically available to the academic systems community from DARPA. Starting in about 2000, DARPA moved to more focused funding to efforts likely to have direct impact in the near term.  Though, it looks that this policy is changing under new DARPA leadership, the effects in the academic community have yet to be felt.

My feeling is that all this worry is entirely misplaced.  I will outline some of the opportunities that go along with the challenges that we currently face in academic research.

First, for me, this may in fact be another golden age in systems research, borne out of tremendous opportunity to address a whole new scale of problems collaboratively between industry and academia. Personally, I find interactions with my colleagues in industry to be a terrific source of concrete problems to work on.  For example, our recent work on data center networking could never have happened without detailed understanding of the real problems faced in large-scale network deployments.  While we had to carry out a significant systems building effort as part of the work, we did not need to build a 10,000-node network to carry out interesting work in this space.  Even the terrific work coming out of Microsoft Research on related efforts such as VL2, DCell, and BCube typically employ relatively modest-sized system implementations as proofs of concepts of their designs.

A related approach is to draw inspiration from a famous baseball quote by Willie Keeler, “I keep my eyes clear and I hit ’em where they ain’t.” The analog in systems research is to focus on topics that may not currently be addressed by industry.  For example, while there has been tremendous interest and effort in building systems that scale seemingly arbitrarily, there has been relatively little focus on per-node efficiency.  So a recent focus of my group has been on building scalable systems that do not necessarily sacrifice efficiency.  More on this in a subsequent post.

The last, and perhaps best, strategy is to actively seek out collaborations with industry to increase overall impact on both sides. One of the best papers I read in the set of submissions to SIGCOMM 2010 was on DCTCP, a variant of TCP targeting the data center.  This work was a collaboration between Microsoft Research and Stanford with the protocol deployed live on a cluster consisting of thousands of machines.  The best paper award from IMC 2009 was on a system called WhyHigh, a system for diagnosing performance problems in Google’s Content Distribution Network.  This was a multi-way collaboration between Google, UC San Diego, University of Washington, and Stony Brook.  Such examples of fruitful collaborations abound.  Companies like Akamai and AT&T are famous for multiple very successful academic collaborations with actual impact on business operations.  I have personally benefitted from insights and collaborations with HP Labs on topics such as virtualization and system performance debugging.

I think the big thing to note is that industry and academia have long lived in a symbiotic relationship. When I was a PhD student at Berkeley, many of the must read systems papers came out of industry: the Alto, Grapevine, RPC, NFS, Firefly, Logic of Authentication, Pilot, etc., just as systems such as GFS, MapReduce, Dynamo, PNUTS, and Dryad are heavily influencing academic research today.  At the same time, GFS likely could not have happened without the lineage of academic file systems research, from AFS, Coda, LFS, and Zebra to xFS.  Similarly, Dynamo would not have been as straightforward if it had not been informed by Chord, Pastry, Tapestry, CAN, and all the peer to peer systems that came afterward.  The novel consistency model in PNUTS that enables its scalability was informed by decades of research in strong and weak data consistency models.

Sometimes things go entirely full circle multiple times between industry and academia.  IBM’s seminal work on virtual machines in the 1960’s lay dormant for a few decades before inspiring some of the top academic work of the 1990’s, SimOS and DISCO.  This work in turn led to the founding of VMWare, perhaps one of the most influential companies to directly come out of the systems community.  And of course, VMWare has helped define part of the research agenda for the system’s community in the past decade, through academic efforts like Xen.  Interestingly, academic work on Xen led to a second high-profile company, XenSource.

This is all to say that I believe that the symbiotic relationship between industry and academia in systems and networking will continue.  We in academia do not need a 100,000-node data center to do good research, especially by focusing on direct collaboration with industry where it makes sense and otherwise on topics that may not be being directly addressed by industry.  And the fact that there are so many great systems and networking papers from industry in top conferences should only serve as inspiration, both to define important areas for further research and to set the bar higher for the quality of our own work in academia.

Finally, and only partially in jest, all the fundamental work in industrial research is perhaps further affirmation of the important role that academia plays, since many of the people carrying out the work were MS and PhD students in academia not so long ago.

SIGCOMM 2010 Travel Grants and VISA Workshop

This year, I had the pleasure of serving on the SIGCOMM 2010 program committee.  I may write more about the experience later, but the short version is that I really enjoyed reading the papers and was particularly impressed by the deep discussions at the two-day program committee last month.  K.K. Ramakrishnan and Geoff Voelker did a terrific job as co-chairs and I believe their efforts are well reflected in a very strong program.

The conference will be held in New Delhi this year and the organizing committee has been fortunate to secure some generous support for travel grants.  This year, grants will be available not just for students, but also for post docs and junior faculty.  The deadline for application has been extended to June 12, 2010.  Full details are available here.  On behalf of the SIGCOMM organizing committee, I encourage everyone interested to apply.

If you do attend SIGCOMM, let me also put in a plug for the VISA workshop.  This is the second workshop on Virtualized Infrastructure Systems and Architecture, building on the successful program we had last year.  I was the co-program chair this year with Guru Parulkar and Cedric Westphal.  Virtualization remains an important topic and VISA is playing an important role for discussion of important problems across systems and networks.

PortLand Code Release

The amount of interest in data centers and data center networking continues to grow.  For the past decade plus, the most savvy Internet companies have been focusing on infrastructure.  Essentially, planetary scale services such as search, social networking, and e-commerce require a tremendous amount of computation and storage.  When operating at the scale of tens of thousands of computers and petabytes of storage, small gains in efficiency can result in millions of dollars of annual savings.  On the other extreme, efficient access to tremendous amounts of computation can enable companies to deliver more valuable content.  For example, Amazon is famous for tailoring web page contents to individual customers based on both their history and potentially the history of similar users.  Doing so while maintaining interactive response times (typically responding in less than 300 ms) requires fast, parallel access to data potentially spread across hundreds or even thousands of computers.  In an earlier post, I described the Facebook architecture and its reliance on clustering for delivering social networking content.

Over the last few years, academia has become increasingly interested in data centers and cloud computing. One reason is the opportunity for impact; it is clear, that the entire computing industry is undergoing another paradigm shift.  Five years from now, it is clear that the way we build out computing and storage infrastructures will be radically different.  Another allure of the data center is the fact that it is possible to do “clean slate” research and deployment.  One frustration of the networking research community has been the inability to deploy novel architectures and protocols because of the need to be backward compatible and friendly to legacy systems.  Check out this paper for an excellent discussion. In the data center, it is at least possible to deploy entirely new architectures without the need to be compatible with every protocol developed over the years.

Of course, there are difficulties with performing data center research as well.  One is having access to the necessary infrastructure to perform research at scale.  With companies deploying data centers at the scale of tens of thousands of computers, it is difficult for most universities and even research labs to have access to the necessary infrastructure.  In our own experience, we have found that it is possible to consider problems of scale even with a relatively modest number of machines.  Research infrastructures such Emulab and OpenCirrus are open compute platforms that provide significant amount of computing infrastructure to the research community.

Another challenge is the lack of software infrastructure for performing data center research, particularly in networking.  Eucalyptus provides an EC2-compatible environment for cloud computing.  However, there is a relative void of available research software for research in networking.  Rebuilding every aspect of the protocol stack before performing research in fundamental algorithms and protocols is a challenge.

To partially address this shortcoming, we are release an alpha version of our PortLand protocol.  This work was published in SIGCOMM 2009 and targets delivering a unified Layer 2 environment for easier management and support for basic functionality such as virtual machine migration.  I discussed our work on PortLand in an earlier post here and some of the issues of Layer 2 versus Layer 3 deployment here.

The page for downloading PortLand is now up.  It reflects the hard work of two graduate students in my group, Sambit Das and Malveeka Tewari, who took our research code and ported it HP ProCurve switches running OpenFlow.  The same codebase runs on NetFPGA switches as well.  We hope the community can confirm that the same code runs on a variety of other OpenFlow-enabled switches.  Our goal is for PortLand to be a piece of the puzzle for a software environment for performing research in data center networking.  We encourage you to try it out and give us feedback.  In the meantime, Sambit and Malveeka are hard at work in adding Hedera functionality for flow scheduling for our next code release.

Achieving Adaptive Multipath Forwarding in the Data Center

Later this month, we will be presenting our work on Hedera at NSDI 2010. The goal of the work is to improve data center network performance under a range of dynamically shifting communication patterns. Below I will present a quick overview of the work starting with some of the motivation for it.

The promise of adaptively choosing the right path for a packet based on dynamically changing network performance conditions is at least as old as the ARPANET.  The original goal was to track the current levels of congestion on all available paths between a source and destination and to then forward individual packets along the path likely to deliver the best performance on a packet by packet basis.

In the ARPANET, researchers attempted to achieve this functionality by distributing queue lengths and capacity as part of the routing protocol.  Each router would then have a view of not just network connectivity but also the performance available on individual paths.  Forwarding entries for each destination would then be calculated not just based on shortest number of hops but also dynamically changing performance measures.  Unfortunately, this approach suffered from stability issues.  Distribution of current queue lengths as part of the routing protocol was too coarse grained and deterministic.  Hence, packets would oscillate from all simultaneously chasing the path that had the best performance in the previous measurement epoch, leaving other paths idle.  The previously best performing path would in turn often become overwhelmed as part of this herd effect.  The next measurement cycle would reveal the resulting congestion and lead to yet another oscillation.

As a result of this early experience, inter-domain routing protocols such as BGP settled on hop count as the metric for packet delivery, eschewing performance goals in favor of simple connectivity.  Intra-domain routing protocols such as OSPF also initially opted for simplicity, again aiming for the shortest path between a source and destination.  Administrators could however set weights for individual links as a way to make particular paths more or less appealing by default.

More recently administrators perform coarse-grained traffic engineering among available paths using MPLS. With the rise of ISPs and the cost of operating hundreds or thousands of expensive long-haul links and customers with strict performance requirements, it became important to make better use of network resources within each domain/ISP.  Traffic engineering (TE) extensions to OSPF allowed for bundles of flows from the same ingress to egress points in the network to follow the same path, leveraging the long-term relative stability in traffic between various points of presence in a long-haul network.  For example, the amount of traffic from Los Angeles to Chicago aggregated over many customers might demonstrate stability modulo diurnal variations.  OSPF-TE allowed network operators to balance aggregations of flows among available paths to smooth the load across available links in a wide-area network.  Rebalancing of forwarding preferences could be done on a coarse granularity, perhaps with human oversight, given the relative stability in aggregate traffic characteristics.

Our recent focus has been on the data center and in that environment, the network displays much more bursty communication patterns with rapid shifts in load from one portion of the network to another.  At the same time, data center networks only achieve scalability through topologies that inherently provide multiple paths between every source and destination.   Leveraging coarse-grained stability on the order of days is not an option for performing traffic engineering in the data center.  And yet, attempting to send each packet along the best available path also seems like a non-starter from both a scalability perspective and a TCP compatibility perspective.  On the second point, TCP does not behave well when packets may potentially be delivered out of order as the common case.

The state of the art in load balancing in data center is the Equal Cost Multipath (ECMP) extension to OSPF.  Here, each switch tracks the set of available next hops to a particular destination.  For each arriving packet, it extracts a potentially configurable set of headers (e.g., source and destination IP address, source and destination port, etc.) with the goal of deterministically identifying all of the packets that belong to the same logical higher-level flow.  The switch then applies a hash function to the concatenated flow identifier to assign the flow to one of the available output ports.

ECMP has the effect of load balancing flows among the available paths.  It can perform well under certain conditions, for example when flows are mostly of uniform, small size and when hosts communicate with one another with uniform probability.  However, long-term hash collisions can leave certain links oversubscribed while others remain idle.  In production networks, network administrators are sometimes left to manually tweak the ECMP hash function to achieve good performance for a particular communication pattern (though of course, the appropriate hash function depends on globally shifting communication patterns).

In our work, we have found that ECMP can under utilize network bandwidth by a factor of 2-4 for moderate sized networks.  The worst-case overhead grows with network size.

Our work on Hedera shows how to improve network performance with small communication overhead to maintain overall network scalability.  The key idea, detailed in the paper, is to leverage a central network fabric manager that tracks the behavior of large flows.  By default, new flows that are initiated are considered small and scheduled using a technique similar to ECMP.  However, once a flow grows beyond a certain threshold, the fabric manager attempts to schedule it in light of the behavior of all other large flows in the network.  The fabric manager communicates with individual switches in the topology to track resource utilization using OpenFlow. This ensures that in the future our approach can be backward compatible with a range of commercially available switches.

An important consideration in our work is the ability to estimate the inherent demand of a TCP flow independent of its measured consumed bandwidth.  That is, the fabric manager cannot perform scheduling of flows based on observations of observed bandwidth on a per-flow basis.  This bandwidth can be off from what a flow would ideally achieve by a large factor because of poor previous scheduling decisions.  Hence, we designed an algorithm to estimate the best case bandwidth that would be available to a TCP flow assuming the presence of a perfect scheduler.  This demand estimator is then the input to our scheduling algorithm rather than any observed performance characteristics.

The final piece of the puzzle is an efficient scheduling algorithm for placing large flows in the network.  One important consideration is the length of the control loop.  That is, how quickly can we measure the behavior of existing flows and respond with a new placement of flows.  If network communication patterns are shifting more rapidly than we are able to observe and react, we will be left, in effect, continuously reacting to no longer meaningful network conditions.  We currently are able to measure and react at the granularity of approximately one second, but this is driven by some limitations in our switch hardware.  As part of future work, we hope to drive the overhead down to approximately 100ms.  It will likely take some hardware support, perhaps using an FPGA, to go much below 100 ms.

Overall, we have found that Hedera can deliver near-optimal network utilization for a range of communication patterns, with significant improvements relative to ECMP.  It remains an open question whether the network scheduling problem we need to solve is NP-hard or not.  But our current algorithms are reasonably efficient with acceptable performance under the conditions we have experimented with thus far.

We hope that the relative simplicity of our architecture along with its backward compatibility with existing switch hardware will enable more dynamic scheduling of data center network fabrics with higher levels of delivered performance and faster reaction to any network failures.

Presentation Summary “High Performance at Massive Scale: Lessons Learned at Facebook”

Recently, we were fortunate to host Jeff Rothschild, the Vice President of Technology at Facebook, for a visit for the CNS lecture series.  Jeff’s talk, “High Performance at Massive Scale: Lessons Learned at Facebook” was highly detailed, providing real insights into the Facebook architecture. Jeff spoke to a packed house of faculty, staff, and students interested in the technology and research challenges associated with running and Internet service at scale.  The talk is archived here as part of the CNS lecture series.  I encourage you to check it out; below are my notes on the presentation.
Site Statistics:
  • Facebook is the #2 property on the Internet as measured by the time users spend on the site.
  • Over 200 billion monthly page views.
  • >3.9 trillion feed actions proceessed per day.
  • Over 15,000 websites use Facebook content
  • In 2004, the shape of the curve plotting user population as a function of time showed exponential growth to 2M users.  5 years later they have stayed on the same exponetial curve with >300M users.
  • Facebook is a global site, with 70% of users outside of the US.
  • Today, there are 1.3B people in the world who have quality Internet connectivity, so there is at least another factor of 4 growth that Facebook is going after. Jeff presented statistics for the number of users that each engineer supports at a variety of high-profile Internet companies: 1.1M for Facebook, 190,000 Google, 94,000 Amazon, 75,000 Microsoft.
Photo sharing on Facebook:
  • Facebook stores 20 billion photos in 4 resolutions
  • 2-3 billion new photos uploaded every month
  • Originally provisioned photo storage for 6 months, but blew through available storage in 1.5 weeks.
  • Facebook serves 600k photos/second –> serving them is more difficult than storing them.
Scaling photos, first the easy way:
  • Upload tier: handles uploads, scales the images, sotres on NFS tier
  • Serving tier: Images are served from NFS via HTTP
  • NFS Storage tier built from commercial products
  • Filesystems aren’t really good at supporting large numbers of files
Scaling photos, 2nd generation:
  • Cachr: cache the high volume smaller images to offload the main storage systems.
  • Only 300M images in 3 resolutions
  • Distribute these through a CDN to reduce network latency.
  • Cache them in memory.
Scaling photos, 3rd Generation System: Haystack
  • How many IO’s do you need to serve an image?  Originally, 10 I/O’s at Facebook because of the complex directory structure.
  • Optimizations got it down to 2-4 IOs per file served
  • Facebook built a better version called Haystack by merging multiple files into a single large file. In the common case, serving a photo now requires 1 I/O operation.  Haystack is available as open source.
Facebook architecture consists of:
  • Load balancers as front end requests are distributed to Web Servers retrieve actual content from a large memcached layer because of the latency requirements for individual requests.
  • Presentation Layer employs PHP
  • Simple to learn: small set of expressions and statements
  • Simple to write: loose typing and universal “array”
  • Simple to read
But this comes at a cost:
  • High CPU and memory consumption.
  • C++ Interoperability Challenging.
  • PHP does not encourage good programming in the large (at 3M lines of code it is a significant organizational challenge).
  • Initialization cost of each page scales with size of code base
Thus Facebook engineers undertook implementing optimizations to PHP:
  • Lazy loading
  • Cache priming
  • More efficient locking semantics for variable cache
  • Memcache client extension
  • Asynchrnous event-handling
Back-end services that require the performance are implemente in C++. Services Philosophy:
  • Create a service iff required.
  • Real overhead for deployment, maintenance, separate code base.
  • Another failure point.
  • Create a common framework and toolset that will allow for easier creation of services: Thrift (open source).
A number of things break at scale, one example: syslog
  • Became impossible to push large amounts of data through the logging infrastructure.
  • Implemented Scribe for logging.
  • Today, Scribe processes 25TB of messages/day.
Site Architecture
Overall, Facebook currently runs approximately 30k servers, with the bulk of them acting as web servers.
The Facebook Web Server, running PHP, is responsible for retrieving all of the data required to compose the web page.  The data itself is stored authoritatively in a large cluster of MySQL servers.  However, to hit performance targets, most of the data is also stored in memory across an array of memcached servers. For traditional websites, each user interacts with his or her own data.  And for most web sites, only 1-2% of registered users concurrently access the site at any given time.  Thus, the site only needs to cache 1-2% of all data in RAM.  However, data at Facebook is deeply interconnected; each user is interested in the state of hundreds of other users.  Hence, even with only 1-2% of the user population at any given time, virtually all data must still be available in RAM.
Data partitioning was easy when Facebook was a college web site, simply partition data at the level of individual colleges.  After considering a variety of data clustering algorithms, found that there was very little win for the additional complexity of clustering.  So at Facebook, user data is randomly partitioned across indiviual databases and machines across the cluster.  Hence, each user access requires retrieving data corresponding to user state spread across hundreds of machines.  Intra-cluster network performance is hence critical to site performance. Facebook employs memcache to store the vast majority of user data in memory spread across thousands of machines in the cluster.  In essence, nodes maintain a distributed hash table to determine the machine responsible for a particular users data.  Hot data from MySQL is stored in the cache.  The cache supports get/set/incr/decr and
multiget/multiset operations.
Initially, the architecture needed to support 15-20k requests/sec/machine but that number has scaled to approximately 250k requests/sec/machine today.  Servers have gotten faster to keep up to some but Facebook engineers also had to perform some fundamental re-engineering of memcached to improve its performance.  System performance improved from 50k requests/sec/machine to 150k to 200k to 250k by adding multithreading, polling device drivers, stats locking, and batched packet handling respectively. In aggregate, Memcache at Facebook processes in 120M requests/sec.
One networking challenge with memcached was so-called Network Incast. A front-end web server would collect responses from hundreds of memcache machines in parallel to compose an individual HTTP response. All responses would come back within the same approximately 40 microsecond window.  Hence, while overall network utilization was low at Facebook, even at short time scales, there were significant, correlated packet losses at very fine timescales.  These microbursts overflowed the limited packet buffering in commodity switches (see my earlier post for more discussion on this issue).
To deal with the significant slow down that resulted by synchronized loss in relatively small TCP windows, Facebook built a custom congestion-aware UDP-based transport that managed congestion across multiple requests rather than within a single connection. This optimization allowed Facebook to avoid the, for example, 200 ms timeouts associated with the loss of an entire window’s worth of data in TCP.
Authoritative Storage
Authoritative Facebook data is stored in a pool of MySQL servers. The overall experience with MySQL has been very positive at Facebook, with thousands of MySQL servers in multiple datacenters.  It is simple, fast, and reliable.  Facebook currently has 8,000 server-yearas of runtime experience without data loss or corruption.
Facebook has learned a number of lessons about data management:
  • Shared architecture should be avoided; there are no joins in the code.
  • Storing dynamically changing data in a central database should be avoided.
  • Similarly, heavily-referenced static data should not be stored in a central database.
There are a number of challenges with MySQL as well, including:
  • Logical migration of data is very difficult.
  • Creating a large number of logical dbs, load balance them over varying number of physical nodes.
  • Easier to scale CPU on web tier than on the DB tier.
  • Data driven schemas make for happy programmers and difficult operations.

Lots of examples of Facebook’s contribution back to open source here.

Given its global user population, Facebook eventually had to move to replicating its content across multiple data centers.  Facebook now runs two large data centers, one on the West coast of the US and one on the East coast.  However, this introduces the age-old problem of data consistency. Facebook adopts a primary/slave replication scheme where the West coast MySQL replicas are the authoritative stores for data.  All updates are applied to these master replicas and asynchronously replicated to the slaves on the East coast.  However, without synchronous updates, consecutive requests to the same data item from the same user can return inconsistent or stale results.
The approach taken at Facebook is to set a cookie on user update requests that will redirect all subsequent requests from that user to the West coast master for some configurable time period to ensure that read operations do not return inconsistent results.  More details on this approach is detailed on the Facebook blog.
Areas for future research at Facebook:
  • Load balancing
  • Middle tier: balance between programmer productivity and machine efficiency
  • Graph-based caching and storage systems
  • Search relevance via the social graph
  • Object discovery and ranking
  • Storage systems
  • Personalization
Jeff also relayed an interesting philosophy from Mark Zuckerberg: “Work fast and don’t be afraid to break things.”  Overall, the idea to avoid working cautiously the entire year, delivering rock-solid code, but not much of it.  A corollary: if you take the entire site down, it’s not the end of your career.

The Ever Changing Face of the Internet

Craig Labovitz made a very interesting presentation e the recent NANOG meeting on the most recent measurements from Arbor’s ATLAS Internet observatory.  ATLAS takes real time Internet traffic measurements from 110+ ISPs with real-time access to more than 14 Tbps of Internet access.  One of the things that makes working in and around Internet research so interesting (and gratifying) is that the set of problems are constantly changing because the way that we use the Internet and the requirements of the applications that we run on the Internet are constantly evolving.  The rate of evolution has thus far been so rapid that we constantly seem to be hitting new tipping points in the set of “burning” problems that we need to address.

Craig, currently Chief Scientist at Arbor Networks, has long been at the forefront of identifying important architectural challenges in the Internet.  His modus operandi has been to conduct measurement studies at a scale far beyond what might have been considered feasible at any particular point in time.  His paper on Delayed Internet Routing Convergence from SIGCOMM 2000 is a classic, among the first to demonstrate the problems with wide-area Internet routing using a 2-year study of the effects of simulated failure and repair events injected from a “dummy” ISP and the many peering relationships that MERIT enjoyed with TIER-1 ISPs.  The paper showed that Internet routing, previously thought to be robust to failure, would often take minutes to converge after a failure event as a result of shortcomings of BGP and the way that ISPs typically configured their border routers.  This paper spawned a whole cottage industry on research into improved inter-domain routing protocols.

This presentation had three high level findings on Internet traffic:

  • Consolidation of Content Contributors: 50% of Internet traffic now originates from just 150 Autonomous Systems (down from thousands just two years ago).  More and more content is being aggregated through big players and content distribution networks.  As a group, CDN’s account for approximately 10% of Internet traffic.
  • Consolidation of Applications: The browser is increasingly running applications.  HTTP and and Flash are the predominant protocols for application delivery.  One of the most interesting findings from the presentation is that P2P traffic as a category is declining fairly rapidly.  As a result of efforts by ISPs and others to rate-limit P2P traffic, in a strict “classifiable” sense (by port number), P2P traffic accounts for less than 1% of Internet traffic in 2009.  However the actual number is likely closer to 18% when accounting for various obfuscation techniques.  Still this is down significantly from estimates just a few years ago that 40-50% of Internet traffic consisted of P2P downloads.  Today, with a number of sites providing both paid and advertiser-supported audio and video content, the fraction of users turning to P2P for their content is declining rapidly.  Instead, streaming of audio and video over Flash/HTTP is one of the fastest growing application segments on the Internet.
  • Evolution of Internet Core: Increasingly, content is being delivered directly from providers to consumers without going through traditional ISPs.  Anecdotally, content providers such as Google, Microsoft, Yahoo!, etc. are peering directly with thousands of Autonomous Systems so that web content from these companies to consumers skips any intermediary tier-X ISPs in going from source to destination.
    When ranking AS’s by the total amount of data either originated or transited, Google ranked third and Comcast 6th in 2009, meaning that for the first time, a non-ISP ranked in the top 10.  Google accounts for 6% of Internet traffic, driven largely by YouTube videos.

Measurements are valuable in providing insight into what is happening in the network but also suggest interesting future directions.  I outline a few of the potential implications below:

  • Internet routing: with content providers taking on ever larger presence in the Internet topology, one important question is the resiliency of the Internet routing infrastructure.  In the past, domains that wishes to remain resilient to individual link and router failures would “multi-home” by connecting to two or more ISPs.  Content providers such as Google would similarly receive transit from multiple ISPs, typically at multiple points in the network.  However, with an increasing fraction of Internet content and “critical” services provided by an ever-smaller number of Internet sites and with these content-providers directly peering with end customers rather than going through ISPs, there is the potential for reduced fault tolerance for the network as a whole.  While it is now possible for clients to receive better quality of service with direct connections to content providers, a single failure or perhaps a small number of correlated failures can potentially have much more impact on the resiliency of network services.
  • CDN architecture: The above trend can be even more worrisome if the cloud computing vision becomes reality and content providers begin to run on a small number of infrastructure providers.  Companies such as Google and Amazon are already operating their own content distribution networks to some extent and clearly they and others will be significant players in future cloud hosting services.  It will be interesting to consider the architectural challenges of a combined CDN and cloud hosting infrastructure.
  • Video is king: with an increasing fraction of Internet traffic devoted to video, there is significant opportunity in improved video and audio codecs, caching, and perhaps the adaptation of peer-to-peer protocols for fixed infrastructure settings.