Archive for July, 2009

Jeannie Albrecht secures NSF CAREER award

I am happy to announce that my former student, Jeannie Albrecht, has been recognized with an NSF CAREER Award.  This is one of the top honors for a young scientist.  Congratulations Jeannie!

A Revolution in DBMSs

As a one-hop removed database outsider (partially tainted by co-authoring a VLDB 2000 paper), it seems to me that there is a revolution taking place right now in the field.  While the relational debate was before my time, the explosion in the data analytics market is forcing a re-examining of the appropriate architecture and interface for petabyte-scale data processing.  The MapReduce programming model and the open-source Hadoop framework have been receiving tremendous attention for the job of running ad hoc queries over extremely large, often unstructured datasets.  For instance, Facebook is running Hadoop queries over their multi-petabyte data warehouse.

Naturally, the database community argues that the traditional relational model is superior for these queries and that parallel database technology has been around for two decades to allow appropriate scaling.  One of the most entertaining papers I have read over the past few years takes this debate head on:

A Comparison of Approaches to Large-Scale Data Analysis by Pavlo et al in SIGMOD 2009.

The paper compares some of the pros and cons of each approach.  On the side for parallel databases:

  • Potentially superior performance from building appropriate indices over the data.
  • Long-term support for data evolution.  The queries need not change as “columns” are added to the underlying data.
  • Familiar declarative query interface

For shared nothing MapReduce jobs there are a number of benefits as well:

  • Much simpler system setup.  Configuring a parallel DB is likely more complex than configuring and installing an operating system.
  • Better support for fault tolerance.  MapReduce restarts individual jobs associated with failed or slow nodes.
  • Better scalability.  Parallel DB’s typically do not scale beyond about 100 nodes, whereas MapReduce jobs often run on thousands of nodes and anecdotally run on tens of thousands of nodes.

On this last point, the authors claim that petabyte-scale data warehouses are rare and that it is far from impossible to store 1 PB of data on 100 machines. On this last point, I believe this argument very conveniently ignores the CPU processing benefits available from large shared nothing clusters.  Having a factor of 10-100 more machines to carry out the processing can be a big win for certain queries relative to the parallel DB approach.

Overall however, the paper makes for a compelling read and I am sure that it will spark lively discussion.  I just recently read one very interesting followup on “HadoopDB” in VLDB 2009:

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads by Abouzeid et al in VLDB 2009.

This paper shows how to take the pre-existing Hive SQL front-end to MapReduce and instead push the functionality to run over DBMS’s running on individual nodes in the cluster that have been preloaded with data that would normally reside in HDFS.

A final effort worth mentioning in this space is the BOOM Analytics paper out of Joe Hellerstein’s group at Berkeley.  They have a re-implementation of Hadoop in a DataLog-like declarative programming language that should support a range of alternative query techniques over emerging massive data sets.

Perhaps most exciting to me is that the traditionally separate database and systems communities (check out this classic) will likely be coming together to build out a shared vision of “cloud computing” in the years ahead.

Facebook’s Network Requirements

At the IEEE LEOS meeting, I had the chance to hear an excellent presentation by Donn Lee from Facebook on their network infrastructure and pain points. I first met Donn when I was speaking at Stanford on our own Data Center networking project.  He had a lot of great questions and feedback based on his experience at Facebook (not to mention Google, Cisco, etc.).

One interesting thing that came out of the presentation is the rate at which switch capacity has increased relative to the size and bandwidth requirements of data centers over the last decade or so.  Today, the biggest switch that one can buy is approximately a 128-port 10 Gigabit Ethernet switch.  However, data centers with 100,000’s of thousands of ports are not unheard of today and individual distributed applications can run on tens of thousands of machines.

A significant challenge is interconnecting all of these machines.  Donn mentioned that his ideal switch from an operational perspective at Facebook.  Would have 1500 ports.  The switch would have 1000 10GbE ports facing downward to end hosts (either 10k at 1 GbE each or 1k at 10 GbE each) and 50 100GbE ports facing up to another switch to allow communication with other logical clusters.  This suggests a requirement of nonblocking bandwidth at the granularity of 1-10k hosts and an oversubscription ratio of 2 in talking to other clusters.  This also suggests a switch that has 15 Terabits/sec of aggregate capacity.  The fact that this represents about a factor of 15 more bandwidth than what is available from commercial switches (not to mention the fact that there is no standard for 100 Gigabit Ethernet yet) means that Facebook has to build out complex meshes, presumably with some performance-limiting hashing to map flows to paths.

Given that doubling single switch capacity roughly requires a factor of 4 more logic, and the continued buildout of ever denser data centers, the communication fabric for these data centers is likely to form a top of increasing interest.

My group remains quite interested in this space.  In fact, our upcoming Merchant Silicon paper at Hot Interconnects this year considers the design of multi-stage 34 Tbps switch.

UPDATE: The slides from Donn’s talk are now available here.

Greg Papadapolous Keynote

Greg Papadapolous, CTO of Sun Microsystems, gave the keynote presentation at the CNS research review this morning. This was personally very gratifying for me as Greg was an inspiration to me (and many other graduate students) during our work on the Network of Workstations (NOW) project.

Greg pushed the envelope in addressing some of the commonly held wisdom in large-scale computer systems design, namely:

  • Ethernet is good enough
  • SMPs are too expensive
  • Failures are frequent

As systems designers, we have become accustomed to embrace simplicity and weak semantics in the underlying system architecture, though often at the cost of increased complexity in the higher-level applications that are left to deal with the failures. His challenge to the audience was to consider the inherent costs of building more reliable, higher performance systems. In many cases, the savings to the application developers and the improvements in end-to-end performance will be valuable.

The talk went through a number of interesting examples. In Ethernet, we consider congestion spreading through the network (no isolation), packet drops, etc. to be things that application developers have to account for. In building large-scale systems, we assume that all parts should be commodity and quite failure prone. The networks we employ to connect data-center scale systems cap out at a relatively modest maximum bisection bandwidth, leaving many applications starved on the network.

Put another way, Greg advocates applying the end-to-end argument, a driving force behind network design. The end-to-end argument states that reliability, fault tolerance, etc. in the end belongs in the application because that is the only place that it can be completely implemented. One of my takeaways from the talk is that it is possible that we have become too lazy in applying the end-to-end principle. The original paper clearly states that there are exceptions to the principle, especially where additional performance gains are possible, or by extension in the current environment, reduced cost or energy consumption.

Are there opportunities to add additional engineering into the system infrastructure to reduce application complexity and reduce overall cost?

CNS Research Review


We are busy preparing for the summer CNS Research Review at UCSD. This will be our eleventh semi-annual review and our biggest one yet with more than 100 attendees registered. Time certainly seems to be accelerating as this will be the ninth review that I am running as the Center Director. Special thanks to Kathy Krane and Paul Terry who are managing every detail for the event.

Greg Papadapalous will kick off the meeting with a keynote on his perspective of the future of computing. Greg has been an inspiration to me personally since he attended the NOW project retreats while I was a graduate student. At one of these meetings about 15 years ago, Greg said something very simple and yet profound, “The only good network is a boring network.” Certainly in the age of utra large scale data centers and a billion-node Internet, we still fall far short of this ideal.

Some of the highlights of the meeting will include:

  • Four presentations from our industrial partners on some of the top challenges that they are facing.
  • Updates and final reports on 11 ongoing CNS collaborations with our industrial partners.
  • Proposals for new 2009-2011 CNS research efforts.
  • A student poster session.
  • Feedback from our guests on our research. This is always the most valuable portion of the review. Nothing like having pointed feedback from some of the top people in industry about the directions we undertake.