Scale Out Networking: “Data Center Switch Architecture in the Age of Merchant Silicon”

Last week, my PhD student Nathan Farrington presented our paper “Data Center Switch Architecture in the Age of Merchant Silicon” at Hot Interconnects.  My group has been thinking about the concept of scale out networking.  Today, we roughly understand how to build incrementally scalable computation infrastructures with clusters of commodity PC’s.  We similarly understand how to incrementally deploy storage in clusters through systems such as GFS or HDFS.  Higher-level software enables the computation and storage to be incrementally built out, achieving so-called “scale out” functionality.  Adding a number of CPUs and disks should result in a proportional increase in overall processing power and storage capacity.

However, achieving the same functionality for the network remains a challenge.  Adding a few high-bandwidth switches to a large topology may not increase the aggregate bandwidth available to applications running on the infrastructure.  In fact, ill-advised placement of new switches with original Ethernet spanning tree protocols could actually result in a reduction of bandwidth.

Of course, the ability to seamlessly harness additional CPUs and storage in some large-scale infrastructure did not become available overnight.  Significant monitoring and protocol work went into achieving such functionality.  So, one goal of our work is to consider the protocol, software, and hardware requirements of scale-out networking.  Essentially, how can developers of large-scale network infrastructures independently add both ports and bandwidth to their topology?

Along one dimension, the network should expand to accommodate more hosts by adding ports.  The bandwidth available in the global switching infrastructure should then be re-apportioned to the available ports.  This allocation may be influenced by higher-level administrator policy, importantly not necessarily on a link-by-link, port-by-port, or even path-by-path basis.  Rather, this allocation may take place on applications and services running on the infrastructure.  And, of course, the mapping of application to port-set may change dynamically.

Along a second dimension, the aggregate network bandwidth should be expandable by simply plugging in additional hardware.  This bandwidth should then correspond to increased available network performance across the network fabric, again subject to administrator policy.

Thus, I may have a network with 1000 ports of 10 Gigabit/sec of Ethernet.  The network fabric may support 1 Terabit/sec of aggregate bandwidth, making an average of 1 Gigabit/sec of bandwidth available to each port.  This would result in an oversubscription ratio of 10, which may be appropriate depending on the communication requirements of applications running on the framework.  Given this network, I should be able to expand the number of ports to 2000 while maintaining aggregate bandwidth in the switching fabric at 1 Terabit/sec, increasing the oversubscription ratio to 20.  Similarly, I might increase the aggregate bandwidth in the fabric to 2 Terabits/sec while maintaining the port count at 1000, decreasing the oversubscription ratio to 5.

Our paper considered the hardware requirements of such an architecture.  At a high level, we designed a modular two-level network architecture around available “merchant silicon.”  The first level, so-called pod switches, are large-scale fully functional Ethernet switches with between 100-1000 ports given current technology design points.  The pod switches are built from some number of merchant-silicon chips available economically from any number of manufacturers (including Fulcrum, Broadcom, Gnodal, etc.).  Fabric cards containing the merchant silicon control the amount of available aggregate bandwidth (and hence oversubscription ratio) in a pod. The second level of the architecture, the core switching array, similarly leverages the same merchant silicon in modular fabric cards to vary the amount of oversubscription available for global, or inter-pod, communication.

The system scales out the number of ports with additional pods (and line cards within a pod) and adds bandwidth to both pods and the network as a whole with modular line cards.

The work also considers the physical cabling challenges associated with any large-scale network infrastructure.  Essentially, transporting lots of bandwidth (e.g., potentially petabits/sec) across a room takes a lot of power and a lot of cables, especially if using traditional copper cable.  However, technology trends in optics is changing this side of the equation.  More on this in a separate post.

The availability of commodity, feature-rich switches will, I believe, change the face of networking in the same way that commodity processors changed the face of networked services and high-performance computing (back in the mid-90’s, the NOW project at UC Berkeley explored the use of clusters of commodity PC’s to address both domains).  Today, the highest performance compute systems are typically built from commodity x86 processors.  This was not necessarily true 10 and certainly not 20 years ago.  In the same way, the highest performance network fabrics will be built around commodity Ethernet switches on a chip moving forward.

4 Responses to “Scale Out Networking: “Data Center Switch Architecture in the Age of Merchant Silicon””

  1. 1 James Liao February 16, 2011 at 4:48 pm


    This was a nice post. Most of the points you foresaw in 2009 are actually happening today. With what you and your team have experimented and learned since 2009, would you consider TRILL meet the expectation you have set in this post?


    • 2 aminvahdat February 19, 2011 at 9:25 pm


      This is probably the subject of a longer post but here is a quick summary. Overall, I believe TRILL to be a big step forward in terms of L2 routing/forwarding. However, I remain concerned with three aspects of TRILL: – It remains flooding based to distribute topology information – In the end, each switch must track every host system wide – Multipath forwarding support is less than ideal (not that L3 ECMP is perfect)

  1. 1 7 Trends of Data Center Network « Pronto Systems Trackback on February 22, 2011 at 1:42 pm
  2. 2 New Network Meme « SIWDT Trackback on May 20, 2012 at 9:21 am
Comments are currently closed.

Amin Vahdat is a Professor in Computer Science and Engineering at UC San Diego.

September 2009

%d bloggers like this: