Archive for August, 2009

David vs. Goliath, UCSD vs. Microsoft?

In high school, I took journalism and worked on the school newspaper.  This means that I know the value of a good headline, that the headline may not reflect reality, and that the author of an article often does not write the headline.  In fact, the headline may not even reflect the contents of the article.  Still, I was surprised to see the headline to a recent Network World article, It’s Microsoft vs. the professors with competing data center architectures.  The article invokes an image of one side or the other throwing down the gauntlet and declaring war. In fact, nothing could be further from the truth.  If it were true, I would definitely feel worried about taking on Microsoft.

My group has been active in data center research.  The article does a very nice job of describing the architecture of PortLand, our recent work on a layer 2 network fabric designed to scale to very large data centers.  The article also describes recent work on VL2 from my colleagues at Microsoft Research.  Both papers appeared at SIGCOMM 2009 and were presented back-to-back in the same session at the conference.

I will leave detailed comparison between the two approaches to the papers themselves.  However, at a high level, both efforts start with a similar premise: the data center networking fabric, at the scale of 10k-100k ports, should be managed as a single network fabric.  One desirable goal here is to manage the netwrok as a single layer 2 domain.  However, conventional wisdom dictates that you cannot go beyond a few 100’s of ports for a layer 2 domain because of scalability and performance problems with traditional layer 2 protocols.  I described one such scalability limitation, limited switch state for forwarding tables, in an earlier post.  There are other challenges including spanning tree protocols, and broadcast overhead of ARP.

So the main takeaway is that we cannot scale a layer 2 network to target levels without changing some of the underlying protocols, at least a bit.  With perfect hindsight, the key difference between PortLand and VL2 is one of philosophy.  Both groups agree that the network should consist of unmodified switch hardware.  However, we believe that the end hosts should also remain unmodified, instead implementing new functionality by modifying switch software.  All switch hardware vendors export some API for programming switch forwarding tables and recently, systems such as OpenFlow export standard APIs for programming switch forwarding tables.  In fact, we implemented our prototype of PortLand using OpenFlow with the goal of maintaining the boundary between system and network administration.  VL2, on the other hand, prefers to leave the switch software unmodified and instead introduces its new functionality by modifying the end hosts themselves.  This leads to different architectural techniques and different designs.

One of our overriding goals is to reduce management burden, so we further introduce a decentralized Location Discovery Protocol (LDP) to automatically assign hierarchical prefixes to switches and end hosts.  These prefixes are the basis for compact forwarding tables in intermediate switches.  Both VL2 and PortLand leverage a directory service to essentially find an efficient path between a source and destination without resorting to broadcast (as would be required by default with ARP).

I consider the VL2 paper to be excellent.  I certainly learned a lot from reading the paper.  Perhaps the ultimate complement I can give is that I plan to assign it to my class in the spring when I teach graduate computer networks again.

Still, it is true that one of the best things about research is that we live in a marketplace of ideas and hence there must be some implicit competition.  We can only get better knowing that the folks at Microsoft are working on similar problems and certainly the “truth” as ascertained with 20/20 hindsight in 5-10 years will consist of some mixture of the competing techniques.  That way, everyone can declare victory.

“When 640KB is Still a Lot of Memory” or “Another Reason Scaling Layer 2 Networks is Hard”

In one apocryphal part of computing lore, Bill Gates famously explained the 640KB main memory limit in DOS back in 1981 by stating that 640KB should be sufficient for any program.  (According to at least Wikipedia, Bill Gates never made such a statement.)  Nearly 30 years later, computers routinely ship with gigabytes of memory.  We recently installed some machines 256GB of memory here at UCSD.  So we have all become desensitized to memory limitations in many settings.

Recently, we have been considering building large-scale Layer 2 networks for the data center environment.  In a Layer 2 network, each switch performs packet forwarding based on flat MAC addresses.  For any possible destination, the switch must match the destination MAC address in a packet in a lookup table and determine the output port for that destination.  High end switches today typically allocate 32k-64k entries in their MAC forwarding tables.  This means, assuming the potential for (eventual) all-to-all communication, the switches can scale to networks with up to 64k communicating end points.

Let’s assume that a forwarding table entry consists of 10 bytes, 6 bytes for the 48-bit MAC address and 4 bytes for the output port and any other bookkeeping information.  The resulting forwarding tables for such a high end switch would consist of 640KB of memory.

Initially, supporting 64k MAC entries may seem like it should be sufficient for just about any situation.  However, today we are starting to see data centers with hundreds of thousands of hosts.  Further, with the advent of virtualization, we often see 10+ virtual machines, each with their own unique MAC address, multiplexed onto individual physical machines.  So, let’s consider an extreme scenario where we would like to enable potentially all-to-all communication in a data center with 10 million virtual Layer 2 end points for communication (e.g., 500k hosts each with 20 virtual machines).

Clearly, in the short term at least, we will not have applications running on 10 million hosts simultaneously (I won’t make any pronouncements about never needing such application support!).  However, for maximum flexibility, a switch has to be at least be prepared for any directly-connected host to wish to an arbitrary host somewhere else in the data center.  Otherwise, the switch would have to run a reactive routing protocol to find an appropriate path to a destination for a given packet, introducing unnecessary and perhaps intolerable delays in establishing communication with a new destination.

One way to deal with this limitation is to partition the network into individual Layer 3 zones and require Layer 3/IP routing for hosts in different zones.  Employing Layer 3 routing in the data center decreases flexibility and increases administrative costs, as further discussed in our SIGCOMM 2009 paper on PortLand.

So let’s consider scaling a switch to support Layer 2 forwarding for 10 million end points.  Again assuming 10 bytes per forwarding table entry, this would require a forwarding table with 100 MB of memory.  For someone like me coming from an operating systems/application background, 100 MB of memory sounds like a tiny amount of memory.  After all, today I can buy 2GB of DRAM for about $25.  So what’s the problem?  We can scale switches to support the largest data centers imaginable by just adding a few dollars of memory.

Unfortunately, the lookups have to take place on the fast path of packet forwarding.  Switches operating today at 10 Gb/s have a few nanoseconds to perform such a lookup and determine the appropriate output port for a switch.  This requirement by itself eliminates the possibility of employing DRAM, it is simply not fast enough.  Still 100 MB of fast SRAM should still be affordable.  Unfortunately, the forwarding latency and the required bandwidth means that the forwarding tables have to be on-chip, i.e., on the same physical die as the switch ASIC.  At least for commodity switches, all functionality has to be on a single chip.  Otherwise, the cost for engineering hardware architecture that deliver sufficient bandwidth between a switch ASIC and off-chip SRAM (or TCAM) is prohibitive and eliminates the possibility of leveraging commodity hardware.  After all, one cannot expect commodity switch designers to target scenarios with 10 million potentially communicating end points as their target market while still maintaining their cost structure.

By analogy, even high-end processors from Intel/ACM using the very latest manufacturing technology (commodity switch hardware typically lags processor manufacturing by a generation or two) have Layer 1 caches with only a few MB of capacity.  Putting 100 MB of Layer 1 cache on a processor would be prohibitively expensive.  Similarly, having 640KB of fast forwarding table memory for commodity switches is at the high end (especially considering the significant amount of on-chip memory that must be allocated for packet buffering).

The bottom line is that getting 10’s or 100’s of MB of memory onto a switch ASIC just for forwarding tables is prohibitively expensive.  If we want to scale Layer 2 networks to potentially hundreds of thousands or millions of end hosts in the near future, we will require techniques to avoid having a single entry for each possible destination in switch forwarding tables.  This is one of the goals of our PortLand work: essentially, how to introduce hierarchy into Layer 2 addresses internally within the switch infrastructure to enable hierarchical (and much more compact) entries in forwarding tables.  With appropriate organization of the MAC address space, we should be able to support essentially arbitrary-sized data centers with a few hundred or a few thousand forwarding table entries, well within the bounds of commodity switch hardware.

Yahoo!’s Geo-Replication Service, PNUTS

A few weeks ago, I had the chance to visit Yahoo! Research.  I had nice conversations with Brian Cooper and Raghu Ramakrishnan regarding their new storage infrastructure, PNUTS.  I had a great time during my visit and wanted to write a bit about PNUTS after going through their paper in more detail.  Their work is addressing what I consider to be an increasingly important problem, delivering applications to a global audience from data centers spread all across the planet.

Such geo-replication of application data is required because no single data center can provide requisite levels of availability to clients and because speed of light delays and wide-area network congestion make it impossible to deliver interactive response times for clients potentially half ways across the planet.

PNUTS goal is to provide a hosted storage infrastructure exporting a record-based API.  Clients may insert records into tables following a loose scheme (not all columns have to be specified for all records).  Each record has a primary key and an assigned owner, used to deliver PNUTS’s consistency guarantees.  A table’s primary keys may be ordered or hashed, with ordering more naturally supporting range queries and hashing lending itself to load balancing.

Perhaps the primary question for any wide-area replication service is the consistency model.  Because the Yahoo! services leveraging PNUTS have strict performance requirements, the PNUTS designers deemed the overhead of providing strong consistency to be too high.  Instead, individual individual records export “timeline consistency.”  Essentially, all updates are forwarded to a per-record master.  Once the write is applied at the master, the synchronous portion of a write completes and success is returned to the client.  PNUTS then propagates the writes asynchronously to the other data centers replicating the record.  While reads to remote data centers may return stale data, updates will be ordered at the master (hence no conflicts) and pushed to remote replicas in order.

PNUTS aims to scale to ten+ wide-area data centers, each with 1,000 storage machines (petabyte-scale storage).  PNUTS targets record-based storage for online access and hence is complementary to storage systems such as HDFS that target batch-based analysis or other storage systems that target large “blob” storage (video, audio, etc.).

I find this space to be extremely interesting and really in its infancy.  Kudos to the folks at Yahoo! for being among the first to tackle this important space. There are a number of alternative techniques.  Dynamo from Amazon targets single data-center storage and exports an eventual consistency model that may leave updates applied “out of order” from the perspective of a client.  BigTable/GFS provide stronger consistency guarantees but synchronously apply updates to multiple replicas within the data center, making them less appropriate for geo-replication.

In my own group, we are also building a system targeting geo-replication across multiple wide-area data centers.  Our goal is to quantify the exact costs of strong consistency is for web services leveraging data replicated across multiple data centers.  We feel that there are applications that would benefit from strong consistency and sacrifice it in terms of significant additional complexity.  From Yahoo!’s internal measurements from the paper, we see that 85% of writes to a record originate from the same data center.  This certainly justifies locating the master at this location.  However, with timeline consistency, the remaining writes must go across the wide-area anyway to the master, making it difficult to enforce SLAs that are often set at the 99 or even 99.9%.  Further, unavailability of the master makes a record either unavailable or imposes the need to “fork” the timeline, requiring the client application to potentially reconcile conflicting updates.

Can we architect a system that enforces strong consistency, delivers acceptable performance, and maintains availability in the face of any single replica failure?  Clearly, there will be cases where the answer will be a resounding “No!”  We want to understand the scenarios where achieving these properties is possible and, with the appropriate architecture and design, expand the space.

Update: I recently found another nice writeup on PNUTS here.

University College London Mosaic


As an academic, I have two principal goals.  The first goal is to teach students and hopefully to have a positive impact on their lives.  The second (and it is really a by product of the first in my opinion) is to perform good research and publish papers that advances scientific understanding.  Recently, some of our work inspired some artwork on display at University College London.  I found this to be incredibly gratifying and I am left wondering if this mosaic will have more “impact” than most of the papers I publish?

First, some background.  A few years ago, our research in network emulation led us to consider the characteristics of Internet topologies.  We wanted to be able to generate random graphs of varying sizes that still maintained the essential characteristics of original graphs.  As is often the case with research, the work led us in unanticipated directions.  My former student, Priya Mahadevan,worked with a terrific former Physics PhD at CAIDA, Dmitri Krioukov, to find a series of graph degree distributions that are guaranteed to always converge to the original graph.  Interestingly, it appears that capturing the distribution of degree-labeled node triples in a wide range of complex graphs is sufficient to reproduce global graph properties.

We described this approach, called the dK-series, in a SIGCOMM 2006 paper.  In a SIGCOMM 2007 paper, we described Orbis, a graph generator that employs a number of algorithms to generate random graphs that reproduce a given dK-distribution.  Both papers contained visualizations of the graphs.  You can see some of the original pictures at the Orbis web page.

Back to the artwork.  The folks in the Computer Science Department at University College London liked some of the depictions of Internet topology and their resemblance to “digital dandelions.”  They commissioned an artist to create a mosaic based on these depictions and have it on display in their new building.  The Department Head of CS at UCL, Anthony Finkelstein, was kind enough to forward some great pictures.  The inscription reads:

Digital Dandelion
Mosaic by Hannah Griffiths 2009

Networking researchers model the structure and growth of the Internet to
predict its evolution. These models reveal the scaling characteristics of
networks, and can help scientists evaluate how new network algorithms and
architectures will perform. This mosaic, which looks like the head of a
dandelion, is based on a map of the Internet generated by algorithms from
computer scientists at UC San Diego in 2007. This map features Internet
nodes – the dots – and linkages – the lines. It is a (mostly) randomly
generated graph that retains the essential interconnectivity characteristics
of a specific corner of the Internet but doubles the number of nodes.



The Push Toward Cloud Computing and Mega Data Centers

The computing industry is evolving toward a world where an increasing fraction of global computation and storage will be delivered from a relatively small number of dense data centers.  I think that is yet another very exciting to be in computing as we are once again in the process of reinventing what the computing model will look like ten years out.  Here I will talk about some of the drivers behind this trend.  A comprehensive discussion of these issues and more can be found in James Hamilton’s excellent Perspectives.

The driving forces behind this evolution are economics and convenience.  From an economic perspective, energy costs, rather than initial capital outlay, can dominate the cost of operating computing and storage equipment.  In this environment, operating large numbers of computers in regions of the world with cheap and clean power can incur an order of magnitude less cost while simultaneously making computing more environmentally friendly. Similarly, a well-run, dense data center may require an order of magnitude fewer people to operate than similar amounts of computing spread across multiple organizations and physical locations.

Finally, most computers operate at less than 10% (and often less than 1%) overall utilization when measured over long time scales. The advent of virtualization technologies allows for the multiplexing of multiple logical virtual machines onto individual physical machines, allowing for more efficient hardware utilization and also enabling end users and organization to only pay for the resources that they actually require at fine granularity.  Using 100 computers for 1 hour costs the same as using 1 computer for 100 hours without the need to procure and manage 100 computers for the long term.

Much of the computation running in data centers today run in parallel on hundreds, thousands, or even tens of thousands of processors.  A simple search request to Google, Yahoo!, etc. runs in parallel across a multi-petabyte dataset on thousands of computers.  Results must be returned interactively (e.g., less than 300 ms) for queries that require significant computation and communication.  At the other end of the interactivity spectrum, companies wish to run very large-scale data processing and data mining on their own petabyte-scale datasets.  For example, consider queries running over all the items stocked and sold by Walmart.

My group has become very interested in some of the issues in:

  • building out large-scale data centers, particularly the data center network infrastructure;
  • programming and managing applications running across multiple wide area data centers.

I will summarize some of our work in these areas in subsequent posts.

Amin Vahdat is a Professor in Computer Science and Engineering at UC San Diego.

August 2009