Gray Sort: The Most Fun I’ve Ever Had with (a few racks of) Computers

Published November 25, 2010 academia , data center , networking , research , storage 8 Comments

Things have been quite on the blog, but not because there has not been a lot to say. In fact, there has been so much happening that I have not had the idle cycles to write about them. However, I do want to highlight some of the interesting things that have taken place over the past few months.

There has been significant recent interest in large-scale data processing. Many would snicker that this is far from a new problem and indeed the database community has been pioneering in this space for decades. However, I believe it is the case that there has been an uptick in commercial interest in this space, for example to index and analyze the wealth of information available on the Internet or to process the multiple billions of requests per day made to popular web services. MapReduce and open source tools like Hadoop have significantly intensified the debate over the right way to perform large-scale data processing (see my earlier post on this topic).

Observing this recent trend along with my group’s recent focus on data center networking (along with a healthy dose of naivete) led us to go after the world record in data sorting. The team recently set the records for both Gray Sort (fastest time to sort 100 TB of data) and Minute Sort (most data sorted in one minute) in the “Indy” category. See the sort benchmark page for details. This has been one of the most gratifying projects I have ever been involved with. The work was of course really interesting but the best part was seeing the team (Alex Rasmussen, Radhika Niranjan Mysore, Harsha V. Madhyastha, Alexander Pucher, Michael Conley, and George Porter) go after a really challenging problem. While some members of the team would disagree, it was also at least interesting to set the records with just minutes to spare before the 2010 deadline.

Our focus in this work was not so much to set the record (though we are happy to have done so) but to go after high-levels of efficiency while operating at scale. Recently, setting the sort record has largely been a test of how much computing resources an organization could throw at the problem, often sacrificing on per-server efficiency. For example, Yahoo’s record for Gray sort used an impressive 3452 servers to sort 100 TB of data in less than 3 hours. However, per server throughput worked out to less than 3 MB/s, a factor of 30 less bandwidth than available even from a single disk. Large-scale data sorting involves carefully balancing all per-server resources (CPU, memory capacity, disk capacity, disk I/O, and network I/O), all while maintaining overall system scale. We wanted to determine the limits of a scalable and efficient data processing system. Given current commodity server capacity, is it feasible to run at 30 MB/s or 300 MB/s per server? That is, could we reduce the required number of machines for sorting 100 TB of data by a factor of 10 or even 100?

The interesting thing about large-scale data sorting is that it exercises all aspect of the computer system.

CPU is required to perform the O(n log n) operation to sort the data. While not the most compute-intensive application, CPU requirements nonetheless cannot be ignored.
Disk Bandwidth: earlier work proves that external memory sort (the case where the data set size is larger than aggregate physical memory) requires at least two reads of the data and two writes of the data. One of the banes of system efficiency is the orders of magnitude difference in I/O performance for sequential versus random disk I/O. A key requirement for high-performance sort is ensuring that disks are performing sequential I/O (either read or write) near continuously.
Disk capacity: Sorting 100 TB of data requires at least 200 TB of storage, 300 TB if the input data cannot be erased. While not an enormous amount of data by modern standards, simply storing this amount of data amounts to an interesting systems challenge.
Memory capacity: certainly in our architecture, and perhaps fundamentally, ensuring streaming I/O while simultaneously limiting the number of disk operations to 2 reads and 2 writes per tuple requires a substantial amount of memory and careful memory management to buffer data in preparation for large, contiguous writes to disk.
Network bandwidth: in a parallel sort system, data must be shuffled in an all-to-all manner across all servers. Saturating available per-server CPU and storage capacity, requires significant network bandwidth, approaching 10 Gb/s of sustained network throughput per server in our configuration.

Managing the interaction of these disparate resources along with parallelism both within a single server and across a cluster of machines was far more challenging than we anticipated. Our goal was to use commodity servers to break the sort record while focusing on high efficiency. We constructed a cluster with dual-socket, four-core Intel processors, initially 12GB RAM (later upgraded to 24GB RAM once we realized we good not maintain sequential I/O with just 12GB RAM/server), 2x10GE NIC (only one port active for the experiment), and 16 500GB drives. The number of hard drives per server was key to delivering high levels of performance. Each of our drives could sustain approximately 100 MB/s of sequential read or write throughput. We knew that, in the optimal case (see this paper), we would read and write the data twice in two discrete phases separated by a barrier. So, if we managed everything perfectly, in the first phase, we would read data from 8 drives at an aggregate rate of 800 MB/s (8*100 MB/s) while simultaneously writing it out to the remaining 8 disks at an identical rate. In the second phase, we would similarly read the data at 800 MB/s while writing the fully-sorted data out at 800 MB/s. Once again, in the best case, we would average 400 MB/s of sorting per server.

Interestingly, the continuing chasm between CPU performance and disk I/O (even in the streaming case) means that building a “balanced” data-intensive processing cluster requires a large number of drives per server to maintain overall system balance. While 16 disks per server seems large, one conclusion of our work is that servers dedicated to large-scale data processing should likely have even more disks. At the same time, significant work needs to be done in the operating system and disk controllers to harness the I/O bandwidth available from such large disk arrays in a scalable fashion.

Our initial goal was to break the record with just 30 servers. This would correspond to 720 GB/min assuming 400 MB/s/server, allowing us to sort 100 TB of data in ~138 minutes. We did not quite get there (yet); our record-setting runs were on a 48-server configuration. For our “certified” record-setting run, we ran at 582 GB/min on 48 servers, or 200 MB/s/server. This corresponds to 50% of the maximum efficiency/capacity of our underlying hardware. Since the certified experiments, we have further tuned our code to sort at ~780 GB/min aggregate or 267 MB/s/server. These newest runs correspond to ~67% efficiency. Now obsessed with squeezing the last ounce of efficiency from the system, we continue to target >90% efficiency or more than 1 TB/min of sorting on 48 machines.

While beyond the scope of this post, it has been very interesting just how much we had to do for even this level of performance. In no particular order:

We had to revise, redesign, and fine tune both our architecture and implementation multiple times. There is no one right architecture because the right technique varies with evolving hardware capabilities and balance.
We had to experiment with multiple file systems and file system configuration before settling on ext4.
We were bit multiple times by the performance and caching behavior of our hardware RAID controllers.
While our job overall is not CPU bound, thread scheduling and core contention became a significant issue. In the end, we had to come up with our own custom core allocation bypassing the Linux kernel’s own approach. One interesting requirement was avoiding the core that by default performed most of the in-kernel system call work.
Performing all-to-all communication at near 10 Gb/s, even among 48 hosts on a single switch, is an unsolved challenge to the best of our knowledge. We had to resort to brittle and arcane socket configuration to sustain even ~5Gb/s.
We had to run with virtual memory disabled because the operating system’s memory management behaved in unexpected ways close to capacity. Of course, with virtual memory disabled, we had to tolerate kernel panics if we were not careful about memory allocation.

In the end, simultaneously addressing these challenges turned out to be a lot of fun, especially with a great group of people working on the project. Large-scale sort exercises many aspects of the operating system, the network protocol stack, and distributed systems. It is far from trivial, but it is also simple enough to (mostly) keep in your head at once. In addition to improving the efficiency of our system, we are also working to generalize our infrastructure to arbitrary MapReduce-style computation. Fundamentally, we are interested to determine how much efficiency and scale we can maintain in a general-purpose data processing infrastructure.

8 Responses to “Gray Sort: The Most Fun I’ve Ever Had with (a few racks of) Computers”

Feed for this Entry

1 Matt Welsh November 26, 2010 at 4:43 am

This is fantastic work – is there a technical writeup anywhere? Kind of funny that we’re still working on Gray sort – I remember Remzi and Eric Anderson working on this around 1999.

I really like the focus on per-server efficiency as I think that is key. It is amazing how our computer systems are not inherently designed to treat the flow of data from disk-CPU-network-CPU-disk efficiently. You have different data abstractions at different levels; lots of layers of virtualization; and decoupled scheduling across the layers. I’ve often thought that the right approach would be to orchestrate processing across these layers in a much more tightly coupled fashion, which is no doubt what your resulting architecture has to do.

Still, I’m worried about the emphasis on the sorting benchmark. It sounds like you had to go to extremes (disabling VM!) to get this level of performance, and it’s not at all clear how this would extend to other workloads. Real server farms are rarely dedicated to running a single job, load is often bursty, and the challenge is to obtain good resource utilization across a diverse workload mix rather than simply minimizing the time for a single job (though we could argue about that too). When I was working on fast Web servers, it was all too easy to focus on cranking out as much performance from that single application by fine-tuning everything, which would end up getting disrupted if I changed the application code in any substantial way.

Do you feel that there are techniques here that generalize across workloads and mixtures of workloads? Without giving up our precious virtual memory? 🙂
- 2 aminvahdat November 26, 2010 at 8:49 pm
  
  Matt, thanks for your comments.
  
  First, you are exactly right: we are trying to coordinate data movement across the various layers while minimizing data copies and allocating CPU to the appropriate portions of the pipeline as appropriate. Aspects of our architecture our in fact somewhat reminiscent of some nice work on SEDA that appeared way back in SOSP 2001. While we didn’t have Java get in our way, we paid for it with many epic C++ battles.
  
  The VM issue is a bit of a red herring as it allowed us to get a few more percent of performance from the system (5% as I recall). Essentially, we needed all the memory we could get to ensure sequential disk I/O. Since we were right on the cusp with 24GB/server, we had to go to extremes. If we had double the memory per server, we would have run with virtual memory for sure. At the same time, this does point to the inefficiencies that pile up across the OS. 5% here and 5% there start to add up to some real performance limitations.
  
  We do feel that our architecture generalizes and in fact we have a first version of a generalized MapReduce infrastructure running across our baseline architecture. The key question is how much of our efficiency do we give up in a general framework.
  
  No publication on the work yet, but hopefully we’ll have something ready soon.
3 Kevin Klues November 26, 2010 at 1:49 pm

I agree with Matt, in that one of the key benefits of this work is its focus on improving per-server efficiency. It’s amazing that a few simple changes to systems software (i.e. simple in the sense that once you know what you need to do, the changes are not that earth shattering) can have such a significant impact. The fact that these changes allow you to move from sorting 100TB of data on 3452 servers, to sorting it FASTER on just 48 servers is quite impressive (and these servers are still only running at ~67% efficiency!). Just think what could be accomplished if the OS, file system, and networking stacks were specifically co-designed with data transfer efficiency as a top priority.
- 4 aminvahdat November 26, 2010 at 8:57 pm
  
  Right; so one of the most interesting aspects of the work was just how much remains to be done to get the performance to perform at scale. In many different settings, I have seen servers running at dramatically less than hardware capacity because the people running the hardware were afraid of what would happen to the system if pushed.
  
  To be fair to the Hadoop results, our system, as measured, is neither general purpose nor fault tolerant. One of the key questions is how much performance one has to inherently give up in becoming incrementally more general purpose or fault tolerant.
5 Randall December 18, 2010 at 10:40 pm

My low-end estimates for what that a 47-node setup must have cost about $400k, estimating seven thousand something per node for server and disks and $20k for the big Cisco switch. Of course, an academic project, or anybody buying that much gear, might not pay list price.

If rate of sorting/$ were the *only* thing you were trying to optimize, do you think you’d pick an architecture like this or more but smaller machines (or fewer bigger)? (Hey, another SortBenchmark category!: speed/$ on a 100TB sort!)

I’m kind of wondering what kind of efficiency we can expect when our systems aren’t tuned specifically for sorting, particularly where we’re talking about kernel-level changes that make the system less usable for other stuff. If you had a tuned sort binary, but only minimal OS tweaking (couple of JBOD partitions and the right FS), do you think you’d be talking a 10-15% hit to efficiency? 50%?

(Not that you necessarily have time to answer all these random questions 🙂

Also, Gray sort has 100-byte records — trivial observation, but smaller records, or a workload that involves additional CPU work on the input or output stream, might tilt the balance slightly further towards CPU, and larger records would tilt things towards having more disks.
- 6 aminvahdat December 19, 2010 at 9:22 pm
  
  One of our goals in this work is to demonstrate the need to provision a large number of IO devices to keep the memory system and CPU “well fed”. While sort is not a CPU intensive workload, we actually had plenty of CPU cycles. This suggests that more complex data workloads could also be supported in our infrastructure. The bottom line for us is that we believe for I/O workloads, a large number of I/O devices seems to be warranted. In our case, we could support streaming large amounts of data so disk was the right choice. For other algorithms, flash may be a more appropriate choice.
7 Randall December 20, 2010 at 1:16 am

I didn’t mean my last paragraph argumentatively, if it read that way. Agree with that point — there’s some external-memory genomics stuff I’m kind of interested in, and I’d want what are otherwise “low-end” computers packed with lots of spinning disks to do it.

(And interested in the questions — roughly what kind of efficiency hit you’d take with an untuned or less-tuned kernel, and whether anything would be different if cost/speed had been the *only* concern. That said, I should wait for the writeup and not hassle you. 🙂 )

I should have made it clearer I’m really impressed by the work — huge accomplishment to set a record with a much smaller cluster, and finding and overcoming and starting to document all these practical obstacles is hard work and a valuable contribution. And I appreciate the effort to put this post together. It’s too easy to come off as negative when I just jump into questions and ruminations.
- 8 aminvahdat December 20, 2010 at 7:02 am
  
  Ah sorry, I did not view anything in your comment as argumentative, just responding quickly.
  
  Re: efficiency, all of our non-standard tweaks wind up saving 10%, save for perhaps tweaks to the transport at scale. It turns out (as far as we can tell) TCP out of the box does not work well for our communication patterns.
  
  The biggest challenge for performance from the operating systems perspective is the file system. Rightfully so, the most important thing about a file system is reliability (never lose a bit) and performance is a distance second. Since relatively few people wish to push 16 disks at 100MB/s each, systems are typically not well tuned for it.