Archive for November, 2010

Gray Sort: The Most Fun I’ve Ever Had with (a few racks of) Computers

Things have been quite on the blog, but not because there has not been a lot to say.  In fact, there has been so much happening that I have not had the idle cycles to write about them.  However, I do want to highlight some of the interesting things that have taken place over the past few months.

There has been significant recent interest in large-scale data processing.  Many would snicker that this is far from a new problem and indeed the database community has been pioneering in this space for decades.  However, I believe it is the case that there has been an uptick in commercial interest in this space, for example to index and analyze the wealth of information available on the Internet or to process the multiple billions of requests per day made to popular web services.  MapReduce and open source tools like Hadoop have significantly intensified the debate over the right way to perform large-scale data processing (see my earlier post on this topic).

Observing this recent trend along with my group’s recent focus on data center networking (along with a healthy dose of naivete) led us to go after the world record in data sorting.  The team recently set the records for both Gray Sort (fastest time to sort 100 TB of data) and Minute Sort (most data sorted in one minute) in the “Indy” category. See the sort benchmark page for details. This has been one of the most gratifying projects I have ever been involved with.  The work was of course really interesting but the best part was seeing the team (Alex Rasmussen, Radhika Niranjan Mysore, Harsha V. Madhyastha, Alexander Pucher, Michael Conley, and George Porter) go after a really challenging problem. While some members of the team would disagree, it was also at least interesting to set the records with just minutes to spare before the 2010 deadline.

Our focus in this work was not so much to set the record (though we are happy to have done so) but to go after high-levels of efficiency while operating at scale. Recently, setting the sort record has largely been a test of how much computing resources an organization could throw at the problem, often sacrificing on per-server efficiency. For example, Yahoo’s record for Gray sort used an impressive 3452 servers to sort 100 TB of data in less than 3 hours.  However, per server throughput worked out to less than 3 MB/s, a factor of 30 less bandwidth than available even from a single disk.  Large-scale data sorting involves carefully balancing all per-server resources (CPU, memory capacity, disk capacity, disk I/O, and network I/O), all while maintaining overall system scale.  We wanted to determine the limits of a scalable and efficient data processing system. Given current commodity server capacity, is it feasible to run at 30 MB/s or 300 MB/s per server?  That is, could we reduce the required number of machines for sorting 100 TB of data by a factor of 10 or even 100?

The interesting thing about large-scale data sorting is that it exercises all aspect of the computer system.

  • CPU is required to perform the O(n log n) operation to sort the data.  While not the most compute-intensive application, CPU requirements nonetheless cannot be ignored.
  • Disk Bandwidth: earlier work proves that external memory sort (the case where the data set size is larger than aggregate physical memory) requires at least two reads of the data and two writes of the data.  One of the banes of system efficiency is the orders of magnitude difference in I/O performance for sequential versus random disk I/O.  A key requirement for high-performance sort is ensuring that disks are performing sequential I/O (either read or write) near continuously.
  • Disk capacity: Sorting 100 TB of data requires at least 200 TB of storage, 300 TB if the input data cannot be erased.  While not an enormous amount of data by modern standards, simply storing this amount of data amounts to an interesting systems challenge.
  • Memory capacity: certainly in our architecture, and perhaps fundamentally, ensuring streaming I/O while simultaneously limiting the number of disk operations to 2 reads and 2 writes per tuple requires a substantial amount of memory and careful memory management to buffer data in preparation for large, contiguous writes to disk.
  • Network bandwidth: in a parallel sort system, data must be shuffled in an all-to-all manner across all servers.  Saturating available per-server CPU and storage capacity, requires significant network bandwidth, approaching 10 Gb/s of sustained network throughput per server in our configuration.

Managing the interaction of these disparate resources along with parallelism both within a single server and across a cluster of machines was far more challenging than we anticipated.  Our goal was to use commodity servers to break the sort record while focusing on high efficiency. We constructed a cluster with dual-socket, four-core Intel processors, initially 12GB RAM (later upgraded to 24GB RAM once we realized we good not maintain sequential I/O with just 12GB RAM/server), 2x10GE NIC (only one port active for the experiment), and 16 500GB drives.  The number of hard drives per server was key to delivering high levels of performance.  Each of our drives could sustain approximately 100 MB/s of sequential read or write throughput.  We knew that, in the optimal case (see this paper), we would read and write the data twice in two discrete phases separated by a barrier.  So, if we managed everything perfectly, in the first phase, we would read data from 8 drives at an aggregate rate of 800 MB/s (8*100 MB/s) while simultaneously writing it out to the remaining 8 disks at an identical rate.  In the second phase, we would similarly read the data at 800 MB/s while writing the fully-sorted data out at 800 MB/s.  Once again, in the best case, we would average 400 MB/s of sorting per server.

Interestingly, the continuing chasm between CPU performance and disk I/O (even in the streaming case) means that building a “balanced” data-intensive processing cluster requires a large number of drives per server to maintain overall system balance. While 16 disks per server seems large, one conclusion of our work is that servers dedicated to large-scale data processing should likely have even more disks.  At the same time, significant work needs to be done in the operating system and disk controllers to harness the I/O bandwidth available from such large disk arrays in a scalable fashion.

Our initial goal was to break the record with just 30 servers.  This would correspond to 720 GB/min assuming 400 MB/s/server, allowing us to sort 100 TB of data in ~138 minutes. We did not quite get there (yet); our record-setting runs were on a 48-server configuration. For our “certified” record-setting run, we ran at 582 GB/min on 48 servers, or 200 MB/s/server.  This corresponds to 50% of the maximum efficiency/capacity of our underlying hardware.  Since the certified experiments, we have further tuned our code to sort at ~780 GB/min aggregate or 267 MB/s/server. These newest runs correspond to ~67% efficiency.  Now obsessed with squeezing the last ounce of efficiency from the system, we continue to target >90% efficiency or more than 1 TB/min of sorting on 48 machines.

While beyond the scope of this post, it has been very interesting just how much we had to do for even this level of performance.  In no particular order:

  • We had to revise, redesign, and fine tune both our architecture and implementation multiple times. There is no one right architecture because the right technique varies with evolving hardware capabilities and balance.
  • We had to experiment with multiple file systems and file system configuration before settling on ext4.
  • We were bit multiple times by the performance and caching behavior of our hardware RAID controllers.
  • While our job overall is not CPU bound, thread scheduling and core contention became a significant issue.  In the end, we had to come up with our own custom core allocation bypassing the Linux kernel’s own approach.  One interesting requirement was avoiding the core that by default performed most of the in-kernel system call work.
  • Performing all-to-all communication at near 10 Gb/s, even among 48 hosts on a single switch, is an unsolved challenge to the best of our knowledge.  We had to resort to brittle and arcane socket configuration to sustain even ~5Gb/s.
  • We had to run with virtual memory disabled because the operating system’s memory management behaved in unexpected ways close to capacity.  Of course, with virtual memory disabled, we had to tolerate kernel panics if we were not careful about memory allocation.

In the end, simultaneously addressing these challenges turned out to be a lot of fun, especially with a great group of people working on the project.  Large-scale sort exercises many aspects of the operating system, the network protocol stack, and distributed systems.  It is far from trivial, but it is also simple enough to (mostly) keep in your head at once. In addition to improving the efficiency of our system, we are also working to generalize our infrastructure to arbitrary MapReduce-style computation. Fundamentally, we are interested to determine how much efficiency and scale we can maintain in a general-purpose data processing infrastructure.