Archive for the 'data center' Category

Gray Sort: The Most Fun I’ve Ever Had with (a few racks of) Computers

Things have been quite on the blog, but not because there has not been a lot to say.  In fact, there has been so much happening that I have not had the idle cycles to write about them.  However, I do want to highlight some of the interesting things that have taken place over the past few months.

There has been significant recent interest in large-scale data processing.  Many would snicker that this is far from a new problem and indeed the database community has been pioneering in this space for decades.  However, I believe it is the case that there has been an uptick in commercial interest in this space, for example to index and analyze the wealth of information available on the Internet or to process the multiple billions of requests per day made to popular web services.  MapReduce and open source tools like Hadoop have significantly intensified the debate over the right way to perform large-scale data processing (see my earlier post on this topic).

Observing this recent trend along with my group’s recent focus on data center networking (along with a healthy dose of naivete) led us to go after the world record in data sorting.  The team recently set the records for both Gray Sort (fastest time to sort 100 TB of data) and Minute Sort (most data sorted in one minute) in the “Indy” category. See the sort benchmark page for details. This has been one of the most gratifying projects I have ever been involved with.  The work was of course really interesting but the best part was seeing the team (Alex Rasmussen, Radhika Niranjan Mysore, Harsha V. Madhyastha, Alexander Pucher, Michael Conley, and George Porter) go after a really challenging problem. While some members of the team would disagree, it was also at least interesting to set the records with just minutes to spare before the 2010 deadline.

Our focus in this work was not so much to set the record (though we are happy to have done so) but to go after high-levels of efficiency while operating at scale. Recently, setting the sort record has largely been a test of how much computing resources an organization could throw at the problem, often sacrificing on per-server efficiency. For example, Yahoo’s record for Gray sort used an impressive 3452 servers to sort 100 TB of data in less than 3 hours.  However, per server throughput worked out to less than 3 MB/s, a factor of 30 less bandwidth than available even from a single disk.  Large-scale data sorting involves carefully balancing all per-server resources (CPU, memory capacity, disk capacity, disk I/O, and network I/O), all while maintaining overall system scale.  We wanted to determine the limits of a scalable and efficient data processing system. Given current commodity server capacity, is it feasible to run at 30 MB/s or 300 MB/s per server?  That is, could we reduce the required number of machines for sorting 100 TB of data by a factor of 10 or even 100?

The interesting thing about large-scale data sorting is that it exercises all aspect of the computer system.

  • CPU is required to perform the O(n log n) operation to sort the data.  While not the most compute-intensive application, CPU requirements nonetheless cannot be ignored.
  • Disk Bandwidth: earlier work proves that external memory sort (the case where the data set size is larger than aggregate physical memory) requires at least two reads of the data and two writes of the data.  One of the banes of system efficiency is the orders of magnitude difference in I/O performance for sequential versus random disk I/O.  A key requirement for high-performance sort is ensuring that disks are performing sequential I/O (either read or write) near continuously.
  • Disk capacity: Sorting 100 TB of data requires at least 200 TB of storage, 300 TB if the input data cannot be erased.  While not an enormous amount of data by modern standards, simply storing this amount of data amounts to an interesting systems challenge.
  • Memory capacity: certainly in our architecture, and perhaps fundamentally, ensuring streaming I/O while simultaneously limiting the number of disk operations to 2 reads and 2 writes per tuple requires a substantial amount of memory and careful memory management to buffer data in preparation for large, contiguous writes to disk.
  • Network bandwidth: in a parallel sort system, data must be shuffled in an all-to-all manner across all servers.  Saturating available per-server CPU and storage capacity, requires significant network bandwidth, approaching 10 Gb/s of sustained network throughput per server in our configuration.

Managing the interaction of these disparate resources along with parallelism both within a single server and across a cluster of machines was far more challenging than we anticipated.  Our goal was to use commodity servers to break the sort record while focusing on high efficiency. We constructed a cluster with dual-socket, four-core Intel processors, initially 12GB RAM (later upgraded to 24GB RAM once we realized we good not maintain sequential I/O with just 12GB RAM/server), 2x10GE NIC (only one port active for the experiment), and 16 500GB drives.  The number of hard drives per server was key to delivering high levels of performance.  Each of our drives could sustain approximately 100 MB/s of sequential read or write throughput.  We knew that, in the optimal case (see this paper), we would read and write the data twice in two discrete phases separated by a barrier.  So, if we managed everything perfectly, in the first phase, we would read data from 8 drives at an aggregate rate of 800 MB/s (8*100 MB/s) while simultaneously writing it out to the remaining 8 disks at an identical rate.  In the second phase, we would similarly read the data at 800 MB/s while writing the fully-sorted data out at 800 MB/s.  Once again, in the best case, we would average 400 MB/s of sorting per server.

Interestingly, the continuing chasm between CPU performance and disk I/O (even in the streaming case) means that building a “balanced” data-intensive processing cluster requires a large number of drives per server to maintain overall system balance. While 16 disks per server seems large, one conclusion of our work is that servers dedicated to large-scale data processing should likely have even more disks.  At the same time, significant work needs to be done in the operating system and disk controllers to harness the I/O bandwidth available from such large disk arrays in a scalable fashion.

Our initial goal was to break the record with just 30 servers.  This would correspond to 720 GB/min assuming 400 MB/s/server, allowing us to sort 100 TB of data in ~138 minutes. We did not quite get there (yet); our record-setting runs were on a 48-server configuration. For our “certified” record-setting run, we ran at 582 GB/min on 48 servers, or 200 MB/s/server.  This corresponds to 50% of the maximum efficiency/capacity of our underlying hardware.  Since the certified experiments, we have further tuned our code to sort at ~780 GB/min aggregate or 267 MB/s/server. These newest runs correspond to ~67% efficiency.  Now obsessed with squeezing the last ounce of efficiency from the system, we continue to target >90% efficiency or more than 1 TB/min of sorting on 48 machines.

While beyond the scope of this post, it has been very interesting just how much we had to do for even this level of performance.  In no particular order:

  • We had to revise, redesign, and fine tune both our architecture and implementation multiple times. There is no one right architecture because the right technique varies with evolving hardware capabilities and balance.
  • We had to experiment with multiple file systems and file system configuration before settling on ext4.
  • We were bit multiple times by the performance and caching behavior of our hardware RAID controllers.
  • While our job overall is not CPU bound, thread scheduling and core contention became a significant issue.  In the end, we had to come up with our own custom core allocation bypassing the Linux kernel’s own approach.  One interesting requirement was avoiding the core that by default performed most of the in-kernel system call work.
  • Performing all-to-all communication at near 10 Gb/s, even among 48 hosts on a single switch, is an unsolved challenge to the best of our knowledge.  We had to resort to brittle and arcane socket configuration to sustain even ~5Gb/s.
  • We had to run with virtual memory disabled because the operating system’s memory management behaved in unexpected ways close to capacity.  Of course, with virtual memory disabled, we had to tolerate kernel panics if we were not careful about memory allocation.

In the end, simultaneously addressing these challenges turned out to be a lot of fun, especially with a great group of people working on the project.  Large-scale sort exercises many aspects of the operating system, the network protocol stack, and distributed systems.  It is far from trivial, but it is also simple enough to (mostly) keep in your head at once. In addition to improving the efficiency of our system, we are also working to generalize our infrastructure to arbitrary MapReduce-style computation. Fundamentally, we are interested to determine how much efficiency and scale we can maintain in a general-purpose data processing infrastructure.

PortLand Code Release

The amount of interest in data centers and data center networking continues to grow.  For the past decade plus, the most savvy Internet companies have been focusing on infrastructure.  Essentially, planetary scale services such as search, social networking, and e-commerce require a tremendous amount of computation and storage.  When operating at the scale of tens of thousands of computers and petabytes of storage, small gains in efficiency can result in millions of dollars of annual savings.  On the other extreme, efficient access to tremendous amounts of computation can enable companies to deliver more valuable content.  For example, Amazon is famous for tailoring web page contents to individual customers based on both their history and potentially the history of similar users.  Doing so while maintaining interactive response times (typically responding in less than 300 ms) requires fast, parallel access to data potentially spread across hundreds or even thousands of computers.  In an earlier post, I described the Facebook architecture and its reliance on clustering for delivering social networking content.

Over the last few years, academia has become increasingly interested in data centers and cloud computing. One reason is the opportunity for impact; it is clear, that the entire computing industry is undergoing another paradigm shift.  Five years from now, it is clear that the way we build out computing and storage infrastructures will be radically different.  Another allure of the data center is the fact that it is possible to do “clean slate” research and deployment.  One frustration of the networking research community has been the inability to deploy novel architectures and protocols because of the need to be backward compatible and friendly to legacy systems.  Check out this paper for an excellent discussion. In the data center, it is at least possible to deploy entirely new architectures without the need to be compatible with every protocol developed over the years.

Of course, there are difficulties with performing data center research as well.  One is having access to the necessary infrastructure to perform research at scale.  With companies deploying data centers at the scale of tens of thousands of computers, it is difficult for most universities and even research labs to have access to the necessary infrastructure.  In our own experience, we have found that it is possible to consider problems of scale even with a relatively modest number of machines.  Research infrastructures such Emulab and OpenCirrus are open compute platforms that provide significant amount of computing infrastructure to the research community.

Another challenge is the lack of software infrastructure for performing data center research, particularly in networking.  Eucalyptus provides an EC2-compatible environment for cloud computing.  However, there is a relative void of available research software for research in networking.  Rebuilding every aspect of the protocol stack before performing research in fundamental algorithms and protocols is a challenge.

To partially address this shortcoming, we are release an alpha version of our PortLand protocol.  This work was published in SIGCOMM 2009 and targets delivering a unified Layer 2 environment for easier management and support for basic functionality such as virtual machine migration.  I discussed our work on PortLand in an earlier post here and some of the issues of Layer 2 versus Layer 3 deployment here.

The page for downloading PortLand is now up.  It reflects the hard work of two graduate students in my group, Sambit Das and Malveeka Tewari, who took our research code and ported it HP ProCurve switches running OpenFlow.  The same codebase runs on NetFPGA switches as well.  We hope the community can confirm that the same code runs on a variety of other OpenFlow-enabled switches.  Our goal is for PortLand to be a piece of the puzzle for a software environment for performing research in data center networking.  We encourage you to try it out and give us feedback.  In the meantime, Sambit and Malveeka are hard at work in adding Hedera functionality for flow scheduling for our next code release.

Achieving Adaptive Multipath Forwarding in the Data Center

Later this month, we will be presenting our work on Hedera at NSDI 2010. The goal of the work is to improve data center network performance under a range of dynamically shifting communication patterns. Below I will present a quick overview of the work starting with some of the motivation for it.

The promise of adaptively choosing the right path for a packet based on dynamically changing network performance conditions is at least as old as the ARPANET.  The original goal was to track the current levels of congestion on all available paths between a source and destination and to then forward individual packets along the path likely to deliver the best performance on a packet by packet basis.

In the ARPANET, researchers attempted to achieve this functionality by distributing queue lengths and capacity as part of the routing protocol.  Each router would then have a view of not just network connectivity but also the performance available on individual paths.  Forwarding entries for each destination would then be calculated not just based on shortest number of hops but also dynamically changing performance measures.  Unfortunately, this approach suffered from stability issues.  Distribution of current queue lengths as part of the routing protocol was too coarse grained and deterministic.  Hence, packets would oscillate from all simultaneously chasing the path that had the best performance in the previous measurement epoch, leaving other paths idle.  The previously best performing path would in turn often become overwhelmed as part of this herd effect.  The next measurement cycle would reveal the resulting congestion and lead to yet another oscillation.

As a result of this early experience, inter-domain routing protocols such as BGP settled on hop count as the metric for packet delivery, eschewing performance goals in favor of simple connectivity.  Intra-domain routing protocols such as OSPF also initially opted for simplicity, again aiming for the shortest path between a source and destination.  Administrators could however set weights for individual links as a way to make particular paths more or less appealing by default.

More recently administrators perform coarse-grained traffic engineering among available paths using MPLS. With the rise of ISPs and the cost of operating hundreds or thousands of expensive long-haul links and customers with strict performance requirements, it became important to make better use of network resources within each domain/ISP.  Traffic engineering (TE) extensions to OSPF allowed for bundles of flows from the same ingress to egress points in the network to follow the same path, leveraging the long-term relative stability in traffic between various points of presence in a long-haul network.  For example, the amount of traffic from Los Angeles to Chicago aggregated over many customers might demonstrate stability modulo diurnal variations.  OSPF-TE allowed network operators to balance aggregations of flows among available paths to smooth the load across available links in a wide-area network.  Rebalancing of forwarding preferences could be done on a coarse granularity, perhaps with human oversight, given the relative stability in aggregate traffic characteristics.

Our recent focus has been on the data center and in that environment, the network displays much more bursty communication patterns with rapid shifts in load from one portion of the network to another.  At the same time, data center networks only achieve scalability through topologies that inherently provide multiple paths between every source and destination.   Leveraging coarse-grained stability on the order of days is not an option for performing traffic engineering in the data center.  And yet, attempting to send each packet along the best available path also seems like a non-starter from both a scalability perspective and a TCP compatibility perspective.  On the second point, TCP does not behave well when packets may potentially be delivered out of order as the common case.

The state of the art in load balancing in data center is the Equal Cost Multipath (ECMP) extension to OSPF.  Here, each switch tracks the set of available next hops to a particular destination.  For each arriving packet, it extracts a potentially configurable set of headers (e.g., source and destination IP address, source and destination port, etc.) with the goal of deterministically identifying all of the packets that belong to the same logical higher-level flow.  The switch then applies a hash function to the concatenated flow identifier to assign the flow to one of the available output ports.

ECMP has the effect of load balancing flows among the available paths.  It can perform well under certain conditions, for example when flows are mostly of uniform, small size and when hosts communicate with one another with uniform probability.  However, long-term hash collisions can leave certain links oversubscribed while others remain idle.  In production networks, network administrators are sometimes left to manually tweak the ECMP hash function to achieve good performance for a particular communication pattern (though of course, the appropriate hash function depends on globally shifting communication patterns).

In our work, we have found that ECMP can under utilize network bandwidth by a factor of 2-4 for moderate sized networks.  The worst-case overhead grows with network size.

Our work on Hedera shows how to improve network performance with small communication overhead to maintain overall network scalability.  The key idea, detailed in the paper, is to leverage a central network fabric manager that tracks the behavior of large flows.  By default, new flows that are initiated are considered small and scheduled using a technique similar to ECMP.  However, once a flow grows beyond a certain threshold, the fabric manager attempts to schedule it in light of the behavior of all other large flows in the network.  The fabric manager communicates with individual switches in the topology to track resource utilization using OpenFlow. This ensures that in the future our approach can be backward compatible with a range of commercially available switches.

An important consideration in our work is the ability to estimate the inherent demand of a TCP flow independent of its measured consumed bandwidth.  That is, the fabric manager cannot perform scheduling of flows based on observations of observed bandwidth on a per-flow basis.  This bandwidth can be off from what a flow would ideally achieve by a large factor because of poor previous scheduling decisions.  Hence, we designed an algorithm to estimate the best case bandwidth that would be available to a TCP flow assuming the presence of a perfect scheduler.  This demand estimator is then the input to our scheduling algorithm rather than any observed performance characteristics.

The final piece of the puzzle is an efficient scheduling algorithm for placing large flows in the network.  One important consideration is the length of the control loop.  That is, how quickly can we measure the behavior of existing flows and respond with a new placement of flows.  If network communication patterns are shifting more rapidly than we are able to observe and react, we will be left, in effect, continuously reacting to no longer meaningful network conditions.  We currently are able to measure and react at the granularity of approximately one second, but this is driven by some limitations in our switch hardware.  As part of future work, we hope to drive the overhead down to approximately 100ms.  It will likely take some hardware support, perhaps using an FPGA, to go much below 100 ms.

Overall, we have found that Hedera can deliver near-optimal network utilization for a range of communication patterns, with significant improvements relative to ECMP.  It remains an open question whether the network scheduling problem we need to solve is NP-hard or not.  But our current algorithms are reasonably efficient with acceptable performance under the conditions we have experimented with thus far.

We hope that the relative simplicity of our architecture along with its backward compatibility with existing switch hardware will enable more dynamic scheduling of data center network fabrics with higher levels of delivered performance and faster reaction to any network failures.


Get every new post delivered to your Inbox.

Join 32 other followers