Archive for the 'research' Category

The Blurring of Layer 2 and Layer 3

Back when I took my graduate course on computer networks (from the tremendous Domenico Ferrari at UC Berkeley), the material was still taught strictly based on the seven-layer OSI protocol stack.  Essentially, our textbook had one chapter for each of the seven layers.  The running joke about the OSI model is that no one understands exactly what layers 5 (the session layer) and the layer 6 (the presentation layer) were all about.  In networking, we spend lots of time talking about layers 1, 2, 3, 4, and 7, but almost none about layers 5 and 6.  Recently, people have even started talking about layer 0, e.g., material scientists that create some of the physical substrates that support high levels of bandwidth on optical networks, and layer 8, the higher-level meaning that might be extracted from collections of applications and data, e.g., the Semantic Web.

What I have found interesting as of late however is that the line between two of the more well-defined layers, layer 2, the network layer, and layer 3, the internetwork layer has become increasingly blurred.  In fact, I would argue that much of the functionality that was traditionally relegated to either layer 2 or layer 3 has become blurred.  In the past, layer 2 was about getting data to/from hosts on the same physical network.  Layer 3 was about getting data among hosts on different physical networks.  Presumably, delivering data for hosts on, for instance, the same LAN segment should allow for simplifying assumptions relative to delivering data between networks.

However, technology forces have pushed us to a point where everything is about “inter-networking”.  A single physical LAN in isolation is just not interesting.  One would think that this would mean that layer 2 protocols would become increasingly marginalized and less important.  All the action should be at layer 3, because inter-networking is where all the action is.

However, just the opposite is in fact happening.  Just about all traditional layer 3/inter-networking functionality is migrating to layer 2 protocols.  So if one were to squint just a little bit, functionality at layer 2 and layer 3 is virtually indistinguishable and often duplicated.  Just as interesting perhaps is that layer 2 may in fact be the place where inter-networking takes place by default, at least within the campus, the enterprise, and the data center.  It would be too radical (for now) for me to make claims about it extending to the Internet as a whole, though a number of projects, including the 100×100 effort, have considered this very position.

Here, I will consider some of the reasons why inter-networking is migrating to layer 2.  There are at least two major forces at work here.

  • The first issue goes back to the original design of the Internet and its protocol suite.  The designers of the Internet made a crucial, and at the time entirely justified, design decision/optimization.  They used a host’s IP address to encode both its globally unique address and its hierarchical position in the global network.  That is, a host’s 32-bit IP address would be both the guaranteed unique handle for all potential senders and the basis for scalable routing/forwarding in Internet routers.  I recently heard a talk from Vint Cerf where he said that this was the one decision that he most wishes he could revisit.This design point was perfectly reasonable, and in fact a very nice optimization, as long as Internet hosts never, or at least very rarely, changed  locations in the network.  As soon as hosts could move from physical network to physical network with some frequency, then conflating host location with host identity introduces a number of challenges.  And of course today, we have exactly this situation with WiFi, smart phones, and virtual machine migration.  The problem stems from the fact that scalable Internet routing relies on hierarchically encoding IP addresses.  All hosts on the same LAN share the same prefix in their IP address; all hosts in the same organization share the same (typically shorter) prefix; etc.

    When a host moves from one layer 2 domain (previously one physical network) to another layer 2 domain, it must change its IP address (or use fairly clumsy forwarding schemes originally developed to support IP mobility with home agents, etc.).  Changing a host’s IP address breaks all outstanding TCP connections to that host and of course invalidates all network state that remote hosts were maintaining regarding a supposedly globally unique name.  Of course, it is worth noting that when the Internet protocols were being designed in the 70’s, an optimization targeting the case where host mobility was considered to be rare was entirely justified and even very clever!

  • The second major force at work in pushing inter-networking functionality into layer 2 is the relative difficulty of managing large layer-3 networks.  Essentially, because of the hierarchy imposed on the IP address name space, layer 3 devices in enterprise settings have to be configured with the unique subnet number corresponding to the prefix the switches are uniquely responsible for.  Similarly, end hosts must be configured through DHCP to receive an IP address corresponding to the first hop switch they connect to.

It is for these reasons that network designers and administrators became interested in managing multiple physical networks as a single layer 2 domain, even going back to some of the original work on layer 2 bridging and spanning tree protocols. In an extended LAN, any host could be assigned any IP address and it could maintain its IP address as it moved from switch to switch.  For instance, consider a campus WiFi network.  Technically, each WiFi base station forms its own distinct physical network.  If each base station were to be managed as a separate LAN, then hosts moving from one base station to another would need to be assigned a new IP address corresponding to the new subnet.  Similarly, with the advent of virtualization in the enterprise and data center, it is no longer necessary for a host to physically migrate from one network to another.  For load balancing, planned upgrades, and thermal management, it is desirable to migrate virtual machines from one physical host to another.  Once again, migrating a virtual machine should not necessitate resetting the machines globally unique name.

Of course, putting inter-networking functionality into layer 2 comes with significant challenges, especially when considering “textbook” Ethernet perhaps the most popular layer 2 network protocol:

  • Forwarding across LANs at layer 2 involves a single spanning tree that may result in sub-optimal routes and worse admits only path between each source and destination.
  • A number of support protocols, such as ARP, require broadcasting to the entire layer 2 domain, potentially limiting overall scalability.
  • Aggregation of forwarding entries becomes difficult/impossible because of flat MAC addresses increasing the amount of state in forwarding tables.  An earlier post discusses the memory limitations in modern switch hardware that makes this issue a significant challenge.
  • Forwarding loops can go on forever since layer 2 protocols do not have a TTL or Hop Count field in the header to enable looping packets to eventually be discarded.  This is especially problematic for broadcast packets.

In a subsequent post, I will discuss some of the techniques being explored to address these challenges.

YY Zhou Joins UC San Diego

I wanted to welcome Professor YY Zhou to UC San Diego.  YY is also joining our Center for Networked Systems as our 20th faculty member.  We were very happy to hire YY, most recently from the University of Illionois Computer Science department.  YY has been prolific in operating systems, storage, computer architecture, software engineering, and a variety of other areas.  I think it is fair to say that she and her students have performed some of the most creative work in recent years, pushing the state of the art in some of the most difficult problems in system reliability.

YY and her graduate students co-founded PatternInsight to commercialize some of their advancements.  The company already has a number of customers for their product, including places such as Intel, Cisco, Juniper, and Network Appliance.

Earlier, her work on software reliability has made quite a splash at SOSP, the premier computer systems conference, with six of her papers appearing there over the past three iterations.  Her most recent paper at SOSP 2009 investigates techniques for reproducing concurrency bugs in multicore/multiprocessor systems, a critical problem in software reliability as virtually all software must become increasingly concurrent to take advantage of performance improvements in underlying processors.

Her work at SOSP would be enough for most, but YY and her colleagues have also been regular contributors to MICRO, ISCA, ASPLOS, FAST, and OSDI.

We are very excited to have YY join our systems and networking group.

Scale Out Networking: “Data Center Switch Architecture in the Age of Merchant Silicon”

Last week, my PhD student Nathan Farrington presented our paper “Data Center Switch Architecture in the Age of Merchant Silicon” at Hot Interconnects.  My group has been thinking about the concept of scale out networking.  Today, we roughly understand how to build incrementally scalable computation infrastructures with clusters of commodity PC’s.  We similarly understand how to incrementally deploy storage in clusters through systems such as GFS or HDFS.  Higher-level software enables the computation and storage to be incrementally built out, achieving so-called “scale out” functionality.  Adding a number of CPUs and disks should result in a proportional increase in overall processing power and storage capacity.

However, achieving the same functionality for the network remains a challenge.  Adding a few high-bandwidth switches to a large topology may not increase the aggregate bandwidth available to applications running on the infrastructure.  In fact, ill-advised placement of new switches with original Ethernet spanning tree protocols could actually result in a reduction of bandwidth.

Of course, the ability to seamlessly harness additional CPUs and storage in some large-scale infrastructure did not become available overnight.  Significant monitoring and protocol work went into achieving such functionality.  So, one goal of our work is to consider the protocol, software, and hardware requirements of scale-out networking.  Essentially, how can developers of large-scale network infrastructures independently add both ports and bandwidth to their topology?

Along one dimension, the network should expand to accommodate more hosts by adding ports.  The bandwidth available in the global switching infrastructure should then be re-apportioned to the available ports.  This allocation may be influenced by higher-level administrator policy, importantly not necessarily on a link-by-link, port-by-port, or even path-by-path basis.  Rather, this allocation may take place on applications and services running on the infrastructure.  And, of course, the mapping of application to port-set may change dynamically.

Along a second dimension, the aggregate network bandwidth should be expandable by simply plugging in additional hardware.  This bandwidth should then correspond to increased available network performance across the network fabric, again subject to administrator policy.

Thus, I may have a network with 1000 ports of 10 Gigabit/sec of Ethernet.  The network fabric may support 1 Terabit/sec of aggregate bandwidth, making an average of 1 Gigabit/sec of bandwidth available to each port.  This would result in an oversubscription ratio of 10, which may be appropriate depending on the communication requirements of applications running on the framework.  Given this network, I should be able to expand the number of ports to 2000 while maintaining aggregate bandwidth in the switching fabric at 1 Terabit/sec, increasing the oversubscription ratio to 20.  Similarly, I might increase the aggregate bandwidth in the fabric to 2 Terabits/sec while maintaining the port count at 1000, decreasing the oversubscription ratio to 5.

Our paper considered the hardware requirements of such an architecture.  At a high level, we designed a modular two-level network architecture around available “merchant silicon.”  The first level, so-called pod switches, are large-scale fully functional Ethernet switches with between 100-1000 ports given current technology design points.  The pod switches are built from some number of merchant-silicon chips available economically from any number of manufacturers (including Fulcrum, Broadcom, Gnodal, etc.).  Fabric cards containing the merchant silicon control the amount of available aggregate bandwidth (and hence oversubscription ratio) in a pod. The second level of the architecture, the core switching array, similarly leverages the same merchant silicon in modular fabric cards to vary the amount of oversubscription available for global, or inter-pod, communication.

The system scales out the number of ports with additional pods (and line cards within a pod) and adds bandwidth to both pods and the network as a whole with modular line cards.

The work also considers the physical cabling challenges associated with any large-scale network infrastructure.  Essentially, transporting lots of bandwidth (e.g., potentially petabits/sec) across a room takes a lot of power and a lot of cables, especially if using traditional copper cable.  However, technology trends in optics is changing this side of the equation.  More on this in a separate post.

The availability of commodity, feature-rich switches will, I believe, change the face of networking in the same way that commodity processors changed the face of networked services and high-performance computing (back in the mid-90’s, the NOW project at UC Berkeley explored the use of clusters of commodity PC’s to address both domains).  Today, the highest performance compute systems are typically built from commodity x86 processors.  This was not necessarily true 10 and certainly not 20 years ago.  In the same way, the highest performance network fabrics will be built around commodity Ethernet switches on a chip moving forward.

David vs. Goliath, UCSD vs. Microsoft?

In high school, I took journalism and worked on the school newspaper.  This means that I know the value of a good headline, that the headline may not reflect reality, and that the author of an article often does not write the headline.  In fact, the headline may not even reflect the contents of the article.  Still, I was surprised to see the headline to a recent Network World article, It’s Microsoft vs. the professors with competing data center architectures.  The article invokes an image of one side or the other throwing down the gauntlet and declaring war. In fact, nothing could be further from the truth.  If it were true, I would definitely feel worried about taking on Microsoft.

My group has been active in data center research.  The article does a very nice job of describing the architecture of PortLand, our recent work on a layer 2 network fabric designed to scale to very large data centers.  The article also describes recent work on VL2 from my colleagues at Microsoft Research.  Both papers appeared at SIGCOMM 2009 and were presented back-to-back in the same session at the conference.

I will leave detailed comparison between the two approaches to the papers themselves.  However, at a high level, both efforts start with a similar premise: the data center networking fabric, at the scale of 10k-100k ports, should be managed as a single network fabric.  One desirable goal here is to manage the netwrok as a single layer 2 domain.  However, conventional wisdom dictates that you cannot go beyond a few 100’s of ports for a layer 2 domain because of scalability and performance problems with traditional layer 2 protocols.  I described one such scalability limitation, limited switch state for forwarding tables, in an earlier post.  There are other challenges including spanning tree protocols, and broadcast overhead of ARP.

So the main takeaway is that we cannot scale a layer 2 network to target levels without changing some of the underlying protocols, at least a bit.  With perfect hindsight, the key difference between PortLand and VL2 is one of philosophy.  Both groups agree that the network should consist of unmodified switch hardware.  However, we believe that the end hosts should also remain unmodified, instead implementing new functionality by modifying switch software.  All switch hardware vendors export some API for programming switch forwarding tables and recently, systems such as OpenFlow export standard APIs for programming switch forwarding tables.  In fact, we implemented our prototype of PortLand using OpenFlow with the goal of maintaining the boundary between system and network administration.  VL2, on the other hand, prefers to leave the switch software unmodified and instead introduces its new functionality by modifying the end hosts themselves.  This leads to different architectural techniques and different designs.

One of our overriding goals is to reduce management burden, so we further introduce a decentralized Location Discovery Protocol (LDP) to automatically assign hierarchical prefixes to switches and end hosts.  These prefixes are the basis for compact forwarding tables in intermediate switches.  Both VL2 and PortLand leverage a directory service to essentially find an efficient path between a source and destination without resorting to broadcast (as would be required by default with ARP).

I consider the VL2 paper to be excellent.  I certainly learned a lot from reading the paper.  Perhaps the ultimate complement I can give is that I plan to assign it to my class in the spring when I teach graduate computer networks again.

Still, it is true that one of the best things about research is that we live in a marketplace of ideas and hence there must be some implicit competition.  We can only get better knowing that the folks at Microsoft are working on similar problems and certainly the “truth” as ascertained with 20/20 hindsight in 5-10 years will consist of some mixture of the competing techniques.  That way, everyone can declare victory.

“When 640KB is Still a Lot of Memory” or “Another Reason Scaling Layer 2 Networks is Hard”

In one apocryphal part of computing lore, Bill Gates famously explained the 640KB main memory limit in DOS back in 1981 by stating that 640KB should be sufficient for any program.  (According to at least Wikipedia, Bill Gates never made such a statement.)  Nearly 30 years later, computers routinely ship with gigabytes of memory.  We recently installed some machines 256GB of memory here at UCSD.  So we have all become desensitized to memory limitations in many settings.

Recently, we have been considering building large-scale Layer 2 networks for the data center environment.  In a Layer 2 network, each switch performs packet forwarding based on flat MAC addresses.  For any possible destination, the switch must match the destination MAC address in a packet in a lookup table and determine the output port for that destination.  High end switches today typically allocate 32k-64k entries in their MAC forwarding tables.  This means, assuming the potential for (eventual) all-to-all communication, the switches can scale to networks with up to 64k communicating end points.

Let’s assume that a forwarding table entry consists of 10 bytes, 6 bytes for the 48-bit MAC address and 4 bytes for the output port and any other bookkeeping information.  The resulting forwarding tables for such a high end switch would consist of 640KB of memory.

Initially, supporting 64k MAC entries may seem like it should be sufficient for just about any situation.  However, today we are starting to see data centers with hundreds of thousands of hosts.  Further, with the advent of virtualization, we often see 10+ virtual machines, each with their own unique MAC address, multiplexed onto individual physical machines.  So, let’s consider an extreme scenario where we would like to enable potentially all-to-all communication in a data center with 10 million virtual Layer 2 end points for communication (e.g., 500k hosts each with 20 virtual machines).

Clearly, in the short term at least, we will not have applications running on 10 million hosts simultaneously (I won’t make any pronouncements about never needing such application support!).  However, for maximum flexibility, a switch has to be at least be prepared for any directly-connected host to wish to an arbitrary host somewhere else in the data center.  Otherwise, the switch would have to run a reactive routing protocol to find an appropriate path to a destination for a given packet, introducing unnecessary and perhaps intolerable delays in establishing communication with a new destination.

One way to deal with this limitation is to partition the network into individual Layer 3 zones and require Layer 3/IP routing for hosts in different zones.  Employing Layer 3 routing in the data center decreases flexibility and increases administrative costs, as further discussed in our SIGCOMM 2009 paper on PortLand.

So let’s consider scaling a switch to support Layer 2 forwarding for 10 million end points.  Again assuming 10 bytes per forwarding table entry, this would require a forwarding table with 100 MB of memory.  For someone like me coming from an operating systems/application background, 100 MB of memory sounds like a tiny amount of memory.  After all, today I can buy 2GB of DRAM for about $25.  So what’s the problem?  We can scale switches to support the largest data centers imaginable by just adding a few dollars of memory.

Unfortunately, the lookups have to take place on the fast path of packet forwarding.  Switches operating today at 10 Gb/s have a few nanoseconds to perform such a lookup and determine the appropriate output port for a switch.  This requirement by itself eliminates the possibility of employing DRAM, it is simply not fast enough.  Still 100 MB of fast SRAM should still be affordable.  Unfortunately, the forwarding latency and the required bandwidth means that the forwarding tables have to be on-chip, i.e., on the same physical die as the switch ASIC.  At least for commodity switches, all functionality has to be on a single chip.  Otherwise, the cost for engineering hardware architecture that deliver sufficient bandwidth between a switch ASIC and off-chip SRAM (or TCAM) is prohibitive and eliminates the possibility of leveraging commodity hardware.  After all, one cannot expect commodity switch designers to target scenarios with 10 million potentially communicating end points as their target market while still maintaining their cost structure.

By analogy, even high-end processors from Intel/ACM using the very latest manufacturing technology (commodity switch hardware typically lags processor manufacturing by a generation or two) have Layer 1 caches with only a few MB of capacity.  Putting 100 MB of Layer 1 cache on a processor would be prohibitively expensive.  Similarly, having 640KB of fast forwarding table memory for commodity switches is at the high end (especially considering the significant amount of on-chip memory that must be allocated for packet buffering).

The bottom line is that getting 10’s or 100’s of MB of memory onto a switch ASIC just for forwarding tables is prohibitively expensive.  If we want to scale Layer 2 networks to potentially hundreds of thousands or millions of end hosts in the near future, we will require techniques to avoid having a single entry for each possible destination in switch forwarding tables.  This is one of the goals of our PortLand work: essentially, how to introduce hierarchy into Layer 2 addresses internally within the switch infrastructure to enable hierarchical (and much more compact) entries in forwarding tables.  With appropriate organization of the MAC address space, we should be able to support essentially arbitrary-sized data centers with a few hundred or a few thousand forwarding table entries, well within the bounds of commodity switch hardware.

The Push Toward Cloud Computing and Mega Data Centers

The computing industry is evolving toward a world where an increasing fraction of global computation and storage will be delivered from a relatively small number of dense data centers.  I think that is yet another very exciting to be in computing as we are once again in the process of reinventing what the computing model will look like ten years out.  Here I will talk about some of the drivers behind this trend.  A comprehensive discussion of these issues and more can be found in James Hamilton’s excellent Perspectives.

The driving forces behind this evolution are economics and convenience.  From an economic perspective, energy costs, rather than initial capital outlay, can dominate the cost of operating computing and storage equipment.  In this environment, operating large numbers of computers in regions of the world with cheap and clean power can incur an order of magnitude less cost while simultaneously making computing more environmentally friendly. Similarly, a well-run, dense data center may require an order of magnitude fewer people to operate than similar amounts of computing spread across multiple organizations and physical locations.

Finally, most computers operate at less than 10% (and often less than 1%) overall utilization when measured over long time scales. The advent of virtualization technologies allows for the multiplexing of multiple logical virtual machines onto individual physical machines, allowing for more efficient hardware utilization and also enabling end users and organization to only pay for the resources that they actually require at fine granularity.  Using 100 computers for 1 hour costs the same as using 1 computer for 100 hours without the need to procure and manage 100 computers for the long term.

Much of the computation running in data centers today run in parallel on hundreds, thousands, or even tens of thousands of processors.  A simple search request to Google, Yahoo!, etc. runs in parallel across a multi-petabyte dataset on thousands of computers.  Results must be returned interactively (e.g., less than 300 ms) for queries that require significant computation and communication.  At the other end of the interactivity spectrum, companies wish to run very large-scale data processing and data mining on their own petabyte-scale datasets.  For example, consider queries running over all the items stocked and sold by Walmart.

My group has become very interested in some of the issues in:

  • building out large-scale data centers, particularly the data center network infrastructure;
  • programming and managing applications running across multiple wide area data centers.

I will summarize some of our work in these areas in subsequent posts.

A Revolution in DBMSs

As a one-hop removed database outsider (partially tainted by co-authoring a VLDB 2000 paper), it seems to me that there is a revolution taking place right now in the field.  While the relational debate was before my time, the explosion in the data analytics market is forcing a re-examining of the appropriate architecture and interface for petabyte-scale data processing.  The MapReduce programming model and the open-source Hadoop framework have been receiving tremendous attention for the job of running ad hoc queries over extremely large, often unstructured datasets.  For instance, Facebook is running Hadoop queries over their multi-petabyte data warehouse.

Naturally, the database community argues that the traditional relational model is superior for these queries and that parallel database technology has been around for two decades to allow appropriate scaling.  One of the most entertaining papers I have read over the past few years takes this debate head on:

A Comparison of Approaches to Large-Scale Data Analysis by Pavlo et al in SIGMOD 2009.

The paper compares some of the pros and cons of each approach.  On the side for parallel databases:

  • Potentially superior performance from building appropriate indices over the data.
  • Long-term support for data evolution.  The queries need not change as “columns” are added to the underlying data.
  • Familiar declarative query interface

For shared nothing MapReduce jobs there are a number of benefits as well:

  • Much simpler system setup.  Configuring a parallel DB is likely more complex than configuring and installing an operating system.
  • Better support for fault tolerance.  MapReduce restarts individual jobs associated with failed or slow nodes.
  • Better scalability.  Parallel DB’s typically do not scale beyond about 100 nodes, whereas MapReduce jobs often run on thousands of nodes and anecdotally run on tens of thousands of nodes.

On this last point, the authors claim that petabyte-scale data warehouses are rare and that it is far from impossible to store 1 PB of data on 100 machines. On this last point, I believe this argument very conveniently ignores the CPU processing benefits available from large shared nothing clusters.  Having a factor of 10-100 more machines to carry out the processing can be a big win for certain queries relative to the parallel DB approach.

Overall however, the paper makes for a compelling read and I am sure that it will spark lively discussion.  I just recently read one very interesting followup on “HadoopDB” in VLDB 2009:

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads by Abouzeid et al in VLDB 2009.

This paper shows how to take the pre-existing Hive SQL front-end to MapReduce and instead push the functionality to run over DBMS’s running on individual nodes in the cluster that have been preloaded with data that would normally reside in HDFS.

A final effort worth mentioning in this space is the BOOM Analytics paper out of Joe Hellerstein’s group at Berkeley.  They have a re-implementation of Hadoop in a DataLog-like declarative programming language that should support a range of alternative query techniques over emerging massive data sets.

Perhaps most exciting to me is that the traditionally separate database and systems communities (check out this classic) will likely be coming together to build out a shared vision of “cloud computing” in the years ahead.

Amin Vahdat is a Professor in Computer Science and Engineering at UC San Diego.

May 2020