“When 640KB is Still a Lot of Memory” or “Another Reason Scaling Layer 2 Networks is Hard”

In one apocryphal part of computing lore, Bill Gates famously explained the 640KB main memory limit in DOS back in 1981 by stating that 640KB should be sufficient for any program.  (According to at least Wikipedia, Bill Gates never made such a statement.)  Nearly 30 years later, computers routinely ship with gigabytes of memory.  We recently installed some machines 256GB of memory here at UCSD.  So we have all become desensitized to memory limitations in many settings.

Recently, we have been considering building large-scale Layer 2 networks for the data center environment.  In a Layer 2 network, each switch performs packet forwarding based on flat MAC addresses.  For any possible destination, the switch must match the destination MAC address in a packet in a lookup table and determine the output port for that destination.  High end switches today typically allocate 32k-64k entries in their MAC forwarding tables.  This means, assuming the potential for (eventual) all-to-all communication, the switches can scale to networks with up to 64k communicating end points.

Let’s assume that a forwarding table entry consists of 10 bytes, 6 bytes for the 48-bit MAC address and 4 bytes for the output port and any other bookkeeping information.  The resulting forwarding tables for such a high end switch would consist of 640KB of memory.

Initially, supporting 64k MAC entries may seem like it should be sufficient for just about any situation.  However, today we are starting to see data centers with hundreds of thousands of hosts.  Further, with the advent of virtualization, we often see 10+ virtual machines, each with their own unique MAC address, multiplexed onto individual physical machines.  So, let’s consider an extreme scenario where we would like to enable potentially all-to-all communication in a data center with 10 million virtual Layer 2 end points for communication (e.g., 500k hosts each with 20 virtual machines).

Clearly, in the short term at least, we will not have applications running on 10 million hosts simultaneously (I won’t make any pronouncements about never needing such application support!).  However, for maximum flexibility, a switch has to be at least be prepared for any directly-connected host to wish to an arbitrary host somewhere else in the data center.  Otherwise, the switch would have to run a reactive routing protocol to find an appropriate path to a destination for a given packet, introducing unnecessary and perhaps intolerable delays in establishing communication with a new destination.

One way to deal with this limitation is to partition the network into individual Layer 3 zones and require Layer 3/IP routing for hosts in different zones.  Employing Layer 3 routing in the data center decreases flexibility and increases administrative costs, as further discussed in our SIGCOMM 2009 paper on PortLand.

So let’s consider scaling a switch to support Layer 2 forwarding for 10 million end points.  Again assuming 10 bytes per forwarding table entry, this would require a forwarding table with 100 MB of memory.  For someone like me coming from an operating systems/application background, 100 MB of memory sounds like a tiny amount of memory.  After all, today I can buy 2GB of DRAM for about $25.  So what’s the problem?  We can scale switches to support the largest data centers imaginable by just adding a few dollars of memory.

Unfortunately, the lookups have to take place on the fast path of packet forwarding.  Switches operating today at 10 Gb/s have a few nanoseconds to perform such a lookup and determine the appropriate output port for a switch.  This requirement by itself eliminates the possibility of employing DRAM, it is simply not fast enough.  Still 100 MB of fast SRAM should still be affordable.  Unfortunately, the forwarding latency and the required bandwidth means that the forwarding tables have to be on-chip, i.e., on the same physical die as the switch ASIC.  At least for commodity switches, all functionality has to be on a single chip.  Otherwise, the cost for engineering hardware architecture that deliver sufficient bandwidth between a switch ASIC and off-chip SRAM (or TCAM) is prohibitive and eliminates the possibility of leveraging commodity hardware.  After all, one cannot expect commodity switch designers to target scenarios with 10 million potentially communicating end points as their target market while still maintaining their cost structure.

By analogy, even high-end processors from Intel/ACM using the very latest manufacturing technology (commodity switch hardware typically lags processor manufacturing by a generation or two) have Layer 1 caches with only a few MB of capacity.  Putting 100 MB of Layer 1 cache on a processor would be prohibitively expensive.  Similarly, having 640KB of fast forwarding table memory for commodity switches is at the high end (especially considering the significant amount of on-chip memory that must be allocated for packet buffering).

The bottom line is that getting 10’s or 100’s of MB of memory onto a switch ASIC just for forwarding tables is prohibitively expensive.  If we want to scale Layer 2 networks to potentially hundreds of thousands or millions of end hosts in the near future, we will require techniques to avoid having a single entry for each possible destination in switch forwarding tables.  This is one of the goals of our PortLand work: essentially, how to introduce hierarchy into Layer 2 addresses internally within the switch infrastructure to enable hierarchical (and much more compact) entries in forwarding tables.  With appropriate organization of the MAC address space, we should be able to support essentially arbitrary-sized data centers with a few hundred or a few thousand forwarding table entries, well within the bounds of commodity switch hardware.

5 Responses to ““When 640KB is Still a Lot of Memory” or “Another Reason Scaling Layer 2 Networks is Hard””

  1. 1 Adam August 19, 2009 at 12:50 pm

    PortLand seems to me like someone is reinventing the IPv6 wheel.

    • 2 aminvahdat August 19, 2009 at 1:21 pm

      Interesting, but I am not sure how so. I assume you are referring to IPv6 auto configuration, where hosts can use their MAC addresses to assign themselves unique IP addresses. If I understand correctly, you would still have to configure individual switch subnets to reduce routing state and you still could not easily migrate virtual machines without reassigning their IP address (breaking all outstanding connections, invalidating remote state, etc.).

  2. 3 Jeff August 21, 2009 at 2:26 pm

    I think you and your colleagues work with PortLand is most welcome. After spending quite a few years in telecom, my perception is generally the carrier edge of the network ends up being the simple part of your network to maintain, the internal co-lo switching and routing infrastructure a major challenge — customers coming and going, their ever complex deployment designs that may demand things your architecture just can’t deliver, and very often subject to misconfiguration as a result of poor record keeping or user error.

    Generally speaking, customers demand fast fail-over and very often things like HSRP or VRRP alone just can’t cut the mustard in terms of speed of convergence.

    I will be very interested to see how PortLand evolves. Will this code be licensed under GPL or similar? I’d be very keen on doing some testing and experimenting with it in the near future.

    Congrats again to you and your colleagues on this project.

  1. 1 David vs. Goliath, UCSD vs. Microsoft « Idle Process Trackback on August 21, 2009 at 11:11 pm
  2. 2 The Blurring of Layer 2 and Layer 3 « Idle Process Trackback on September 24, 2009 at 2:30 pm
Comments are currently closed.

Amin Vahdat is a Professor in Computer Science and Engineering at UC San Diego.

August 2009

%d bloggers like this: