Archive for September, 2009

The Blurring of Layer 2 and Layer 3

Back when I took my graduate course on computer networks (from the tremendous Domenico Ferrari at UC Berkeley), the material was still taught strictly based on the seven-layer OSI protocol stack.  Essentially, our textbook had one chapter for each of the seven layers.  The running joke about the OSI model is that no one understands exactly what layers 5 (the session layer) and the layer 6 (the presentation layer) were all about.  In networking, we spend lots of time talking about layers 1, 2, 3, 4, and 7, but almost none about layers 5 and 6.  Recently, people have even started talking about layer 0, e.g., material scientists that create some of the physical substrates that support high levels of bandwidth on optical networks, and layer 8, the higher-level meaning that might be extracted from collections of applications and data, e.g., the Semantic Web.

What I have found interesting as of late however is that the line between two of the more well-defined layers, layer 2, the network layer, and layer 3, the internetwork layer has become increasingly blurred.  In fact, I would argue that much of the functionality that was traditionally relegated to either layer 2 or layer 3 has become blurred.  In the past, layer 2 was about getting data to/from hosts on the same physical network.  Layer 3 was about getting data among hosts on different physical networks.  Presumably, delivering data for hosts on, for instance, the same LAN segment should allow for simplifying assumptions relative to delivering data between networks.

However, technology forces have pushed us to a point where everything is about “inter-networking”.  A single physical LAN in isolation is just not interesting.  One would think that this would mean that layer 2 protocols would become increasingly marginalized and less important.  All the action should be at layer 3, because inter-networking is where all the action is.

However, just the opposite is in fact happening.  Just about all traditional layer 3/inter-networking functionality is migrating to layer 2 protocols.  So if one were to squint just a little bit, functionality at layer 2 and layer 3 is virtually indistinguishable and often duplicated.  Just as interesting perhaps is that layer 2 may in fact be the place where inter-networking takes place by default, at least within the campus, the enterprise, and the data center.  It would be too radical (for now) for me to make claims about it extending to the Internet as a whole, though a number of projects, including the 100×100 effort, have considered this very position.

Here, I will consider some of the reasons why inter-networking is migrating to layer 2.  There are at least two major forces at work here.

  • The first issue goes back to the original design of the Internet and its protocol suite.  The designers of the Internet made a crucial, and at the time entirely justified, design decision/optimization.  They used a host’s IP address to encode both its globally unique address and its hierarchical position in the global network.  That is, a host’s 32-bit IP address would be both the guaranteed unique handle for all potential senders and the basis for scalable routing/forwarding in Internet routers.  I recently heard a talk from Vint Cerf where he said that this was the one decision that he most wishes he could revisit.This design point was perfectly reasonable, and in fact a very nice optimization, as long as Internet hosts never, or at least very rarely, changed  locations in the network.  As soon as hosts could move from physical network to physical network with some frequency, then conflating host location with host identity introduces a number of challenges.  And of course today, we have exactly this situation with WiFi, smart phones, and virtual machine migration.  The problem stems from the fact that scalable Internet routing relies on hierarchically encoding IP addresses.  All hosts on the same LAN share the same prefix in their IP address; all hosts in the same organization share the same (typically shorter) prefix; etc.

    When a host moves from one layer 2 domain (previously one physical network) to another layer 2 domain, it must change its IP address (or use fairly clumsy forwarding schemes originally developed to support IP mobility with home agents, etc.).  Changing a host’s IP address breaks all outstanding TCP connections to that host and of course invalidates all network state that remote hosts were maintaining regarding a supposedly globally unique name.  Of course, it is worth noting that when the Internet protocols were being designed in the 70’s, an optimization targeting the case where host mobility was considered to be rare was entirely justified and even very clever!

  • The second major force at work in pushing inter-networking functionality into layer 2 is the relative difficulty of managing large layer-3 networks.  Essentially, because of the hierarchy imposed on the IP address name space, layer 3 devices in enterprise settings have to be configured with the unique subnet number corresponding to the prefix the switches are uniquely responsible for.  Similarly, end hosts must be configured through DHCP to receive an IP address corresponding to the first hop switch they connect to.

It is for these reasons that network designers and administrators became interested in managing multiple physical networks as a single layer 2 domain, even going back to some of the original work on layer 2 bridging and spanning tree protocols. In an extended LAN, any host could be assigned any IP address and it could maintain its IP address as it moved from switch to switch.  For instance, consider a campus WiFi network.  Technically, each WiFi base station forms its own distinct physical network.  If each base station were to be managed as a separate LAN, then hosts moving from one base station to another would need to be assigned a new IP address corresponding to the new subnet.  Similarly, with the advent of virtualization in the enterprise and data center, it is no longer necessary for a host to physically migrate from one network to another.  For load balancing, planned upgrades, and thermal management, it is desirable to migrate virtual machines from one physical host to another.  Once again, migrating a virtual machine should not necessitate resetting the machines globally unique name.

Of course, putting inter-networking functionality into layer 2 comes with significant challenges, especially when considering “textbook” Ethernet perhaps the most popular layer 2 network protocol:

  • Forwarding across LANs at layer 2 involves a single spanning tree that may result in sub-optimal routes and worse admits only path between each source and destination.
  • A number of support protocols, such as ARP, require broadcasting to the entire layer 2 domain, potentially limiting overall scalability.
  • Aggregation of forwarding entries becomes difficult/impossible because of flat MAC addresses increasing the amount of state in forwarding tables.  An earlier post discusses the memory limitations in modern switch hardware that makes this issue a significant challenge.
  • Forwarding loops can go on forever since layer 2 protocols do not have a TTL or Hop Count field in the header to enable looping packets to eventually be discarded.  This is especially problematic for broadcast packets.

In a subsequent post, I will discuss some of the techniques being explored to address these challenges.

YY Zhou Joins UC San Diego

I wanted to welcome Professor YY Zhou to UC San Diego.  YY is also joining our Center for Networked Systems as our 20th faculty member.  We were very happy to hire YY, most recently from the University of Illionois Computer Science department.  YY has been prolific in operating systems, storage, computer architecture, software engineering, and a variety of other areas.  I think it is fair to say that she and her students have performed some of the most creative work in recent years, pushing the state of the art in some of the most difficult problems in system reliability.

YY and her graduate students co-founded PatternInsight to commercialize some of their advancements.  The company already has a number of customers for their product, including places such as Intel, Cisco, Juniper, and Network Appliance.

Earlier, her work on software reliability has made quite a splash at SOSP, the premier computer systems conference, with six of her papers appearing there over the past three iterations.  Her most recent paper at SOSP 2009 investigates techniques for reproducing concurrency bugs in multicore/multiprocessor systems, a critical problem in software reliability as virtually all software must become increasingly concurrent to take advantage of performance improvements in underlying processors.

Her work at SOSP would be enough for most, but YY and her colleagues have also been regular contributors to MICRO, ISCA, ASPLOS, FAST, and OSDI.

We are very excited to have YY join our systems and networking group.

Scale Out Networking: “Data Center Switch Architecture in the Age of Merchant Silicon”

Last week, my PhD student Nathan Farrington presented our paper “Data Center Switch Architecture in the Age of Merchant Silicon” at Hot Interconnects.  My group has been thinking about the concept of scale out networking.  Today, we roughly understand how to build incrementally scalable computation infrastructures with clusters of commodity PC’s.  We similarly understand how to incrementally deploy storage in clusters through systems such as GFS or HDFS.  Higher-level software enables the computation and storage to be incrementally built out, achieving so-called “scale out” functionality.  Adding a number of CPUs and disks should result in a proportional increase in overall processing power and storage capacity.

However, achieving the same functionality for the network remains a challenge.  Adding a few high-bandwidth switches to a large topology may not increase the aggregate bandwidth available to applications running on the infrastructure.  In fact, ill-advised placement of new switches with original Ethernet spanning tree protocols could actually result in a reduction of bandwidth.

Of course, the ability to seamlessly harness additional CPUs and storage in some large-scale infrastructure did not become available overnight.  Significant monitoring and protocol work went into achieving such functionality.  So, one goal of our work is to consider the protocol, software, and hardware requirements of scale-out networking.  Essentially, how can developers of large-scale network infrastructures independently add both ports and bandwidth to their topology?

Along one dimension, the network should expand to accommodate more hosts by adding ports.  The bandwidth available in the global switching infrastructure should then be re-apportioned to the available ports.  This allocation may be influenced by higher-level administrator policy, importantly not necessarily on a link-by-link, port-by-port, or even path-by-path basis.  Rather, this allocation may take place on applications and services running on the infrastructure.  And, of course, the mapping of application to port-set may change dynamically.

Along a second dimension, the aggregate network bandwidth should be expandable by simply plugging in additional hardware.  This bandwidth should then correspond to increased available network performance across the network fabric, again subject to administrator policy.

Thus, I may have a network with 1000 ports of 10 Gigabit/sec of Ethernet.  The network fabric may support 1 Terabit/sec of aggregate bandwidth, making an average of 1 Gigabit/sec of bandwidth available to each port.  This would result in an oversubscription ratio of 10, which may be appropriate depending on the communication requirements of applications running on the framework.  Given this network, I should be able to expand the number of ports to 2000 while maintaining aggregate bandwidth in the switching fabric at 1 Terabit/sec, increasing the oversubscription ratio to 20.  Similarly, I might increase the aggregate bandwidth in the fabric to 2 Terabits/sec while maintaining the port count at 1000, decreasing the oversubscription ratio to 5.

Our paper considered the hardware requirements of such an architecture.  At a high level, we designed a modular two-level network architecture around available “merchant silicon.”  The first level, so-called pod switches, are large-scale fully functional Ethernet switches with between 100-1000 ports given current technology design points.  The pod switches are built from some number of merchant-silicon chips available economically from any number of manufacturers (including Fulcrum, Broadcom, Gnodal, etc.).  Fabric cards containing the merchant silicon control the amount of available aggregate bandwidth (and hence oversubscription ratio) in a pod. The second level of the architecture, the core switching array, similarly leverages the same merchant silicon in modular fabric cards to vary the amount of oversubscription available for global, or inter-pod, communication.

The system scales out the number of ports with additional pods (and line cards within a pod) and adds bandwidth to both pods and the network as a whole with modular line cards.

The work also considers the physical cabling challenges associated with any large-scale network infrastructure.  Essentially, transporting lots of bandwidth (e.g., potentially petabits/sec) across a room takes a lot of power and a lot of cables, especially if using traditional copper cable.  However, technology trends in optics is changing this side of the equation.  More on this in a separate post.

The availability of commodity, feature-rich switches will, I believe, change the face of networking in the same way that commodity processors changed the face of networked services and high-performance computing (back in the mid-90’s, the NOW project at UC Berkeley explored the use of clusters of commodity PC’s to address both domains).  Today, the highest performance compute systems are typically built from commodity x86 processors.  This was not necessarily true 10 and certainly not 20 years ago.  In the same way, the highest performance network fabrics will be built around commodity Ethernet switches on a chip moving forward.