Later this month, we will be presenting our work on Hedera at NSDI 2010. The goal of the work is to improve data center network performance under a range of dynamically shifting communication patterns. Below I will present a quick overview of the work starting with some of the motivation for it.
The promise of adaptively choosing the right path for a packet based on dynamically changing network performance conditions is at least as old as the ARPANET. The original goal was to track the current levels of congestion on all available paths between a source and destination and to then forward individual packets along the path likely to deliver the best performance on a packet by packet basis.
In the ARPANET, researchers attempted to achieve this functionality by distributing queue lengths and capacity as part of the routing protocol. Each router would then have a view of not just network connectivity but also the performance available on individual paths. Forwarding entries for each destination would then be calculated not just based on shortest number of hops but also dynamically changing performance measures. Unfortunately, this approach suffered from stability issues. Distribution of current queue lengths as part of the routing protocol was too coarse grained and deterministic. Hence, packets would oscillate from all simultaneously chasing the path that had the best performance in the previous measurement epoch, leaving other paths idle. The previously best performing path would in turn often become overwhelmed as part of this herd effect. The next measurement cycle would reveal the resulting congestion and lead to yet another oscillation.
As a result of this early experience, inter-domain routing protocols such as BGP settled on hop count as the metric for packet delivery, eschewing performance goals in favor of simple connectivity. Intra-domain routing protocols such as OSPF also initially opted for simplicity, again aiming for the shortest path between a source and destination. Administrators could however set weights for individual links as a way to make particular paths more or less appealing by default.
More recently administrators perform coarse-grained traffic engineering among available paths using MPLS. With the rise of ISPs and the cost of operating hundreds or thousands of expensive long-haul links and customers with strict performance requirements, it became important to make better use of network resources within each domain/ISP. Traffic engineering (TE) extensions to OSPF allowed for bundles of flows from the same ingress to egress points in the network to follow the same path, leveraging the long-term relative stability in traffic between various points of presence in a long-haul network. For example, the amount of traffic from Los Angeles to Chicago aggregated over many customers might demonstrate stability modulo diurnal variations. OSPF-TE allowed network operators to balance aggregations of flows among available paths to smooth the load across available links in a wide-area network. Rebalancing of forwarding preferences could be done on a coarse granularity, perhaps with human oversight, given the relative stability in aggregate traffic characteristics.
Our recent focus has been on the data center and in that environment, the network displays much more bursty communication patterns with rapid shifts in load from one portion of the network to another. At the same time, data center networks only achieve scalability through topologies that inherently provide multiple paths between every source and destination. Leveraging coarse-grained stability on the order of days is not an option for performing traffic engineering in the data center. And yet, attempting to send each packet along the best available path also seems like a non-starter from both a scalability perspective and a TCP compatibility perspective. On the second point, TCP does not behave well when packets may potentially be delivered out of order as the common case.
The state of the art in load balancing in data center is the Equal Cost Multipath (ECMP) extension to OSPF. Here, each switch tracks the set of available next hops to a particular destination. For each arriving packet, it extracts a potentially configurable set of headers (e.g., source and destination IP address, source and destination port, etc.) with the goal of deterministically identifying all of the packets that belong to the same logical higher-level flow. The switch then applies a hash function to the concatenated flow identifier to assign the flow to one of the available output ports.
ECMP has the effect of load balancing flows among the available paths. It can perform well under certain conditions, for example when flows are mostly of uniform, small size and when hosts communicate with one another with uniform probability. However, long-term hash collisions can leave certain links oversubscribed while others remain idle. In production networks, network administrators are sometimes left to manually tweak the ECMP hash function to achieve good performance for a particular communication pattern (though of course, the appropriate hash function depends on globally shifting communication patterns).
In our work, we have found that ECMP can under utilize network bandwidth by a factor of 2-4 for moderate sized networks. The worst-case overhead grows with network size.
Our work on Hedera shows how to improve network performance with small communication overhead to maintain overall network scalability. The key idea, detailed in the paper, is to leverage a central network fabric manager that tracks the behavior of large flows. By default, new flows that are initiated are considered small and scheduled using a technique similar to ECMP. However, once a flow grows beyond a certain threshold, the fabric manager attempts to schedule it in light of the behavior of all other large flows in the network. The fabric manager communicates with individual switches in the topology to track resource utilization using OpenFlow. This ensures that in the future our approach can be backward compatible with a range of commercially available switches.
An important consideration in our work is the ability to estimate the inherent demand of a TCP flow independent of its measured consumed bandwidth. That is, the fabric manager cannot perform scheduling of flows based on observations of observed bandwidth on a per-flow basis. This bandwidth can be off from what a flow would ideally achieve by a large factor because of poor previous scheduling decisions. Hence, we designed an algorithm to estimate the best case bandwidth that would be available to a TCP flow assuming the presence of a perfect scheduler. This demand estimator is then the input to our scheduling algorithm rather than any observed performance characteristics.
The final piece of the puzzle is an efficient scheduling algorithm for placing large flows in the network. One important consideration is the length of the control loop. That is, how quickly can we measure the behavior of existing flows and respond with a new placement of flows. If network communication patterns are shifting more rapidly than we are able to observe and react, we will be left, in effect, continuously reacting to no longer meaningful network conditions. We currently are able to measure and react at the granularity of approximately one second, but this is driven by some limitations in our switch hardware. As part of future work, we hope to drive the overhead down to approximately 100ms. It will likely take some hardware support, perhaps using an FPGA, to go much below 100 ms.
Overall, we have found that Hedera can deliver near-optimal network utilization for a range of communication patterns, with significant improvements relative to ECMP. It remains an open question whether the network scheduling problem we need to solve is NP-hard or not. But our current algorithms are reasonably efficient with acceptable performance under the conditions we have experimented with thus far.
We hope that the relative simplicity of our architecture along with its backward compatibility with existing switch hardware will enable more dynamic scheduling of data center network fabrics with higher levels of delivered performance and faster reaction to any network failures.