Archive for July 26th, 2009

A Revolution in DBMSs

As a one-hop removed database outsider (partially tainted by co-authoring a VLDB 2000 paper), it seems to me that there is a revolution taking place right now in the field.  While the relational debate was before my time, the explosion in the data analytics market is forcing a re-examining of the appropriate architecture and interface for petabyte-scale data processing.  The MapReduce programming model and the open-source Hadoop framework have been receiving tremendous attention for the job of running ad hoc queries over extremely large, often unstructured datasets.  For instance, Facebook is running Hadoop queries over their multi-petabyte data warehouse.

Naturally, the database community argues that the traditional relational model is superior for these queries and that parallel database technology has been around for two decades to allow appropriate scaling.  One of the most entertaining papers I have read over the past few years takes this debate head on:

A Comparison of Approaches to Large-Scale Data Analysis by Pavlo et al in SIGMOD 2009.

The paper compares some of the pros and cons of each approach.  On the side for parallel databases:

  • Potentially superior performance from building appropriate indices over the data.
  • Long-term support for data evolution.  The queries need not change as “columns” are added to the underlying data.
  • Familiar declarative query interface

For shared nothing MapReduce jobs there are a number of benefits as well:

  • Much simpler system setup.  Configuring a parallel DB is likely more complex than configuring and installing an operating system.
  • Better support for fault tolerance.  MapReduce restarts individual jobs associated with failed or slow nodes.
  • Better scalability.  Parallel DB’s typically do not scale beyond about 100 nodes, whereas MapReduce jobs often run on thousands of nodes and anecdotally run on tens of thousands of nodes.

On this last point, the authors claim that petabyte-scale data warehouses are rare and that it is far from impossible to store 1 PB of data on 100 machines. On this last point, I believe this argument very conveniently ignores the CPU processing benefits available from large shared nothing clusters.  Having a factor of 10-100 more machines to carry out the processing can be a big win for certain queries relative to the parallel DB approach.

Overall however, the paper makes for a compelling read and I am sure that it will spark lively discussion.  I just recently read one very interesting followup on “HadoopDB” in VLDB 2009:

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads by Abouzeid et al in VLDB 2009.

This paper shows how to take the pre-existing Hive SQL front-end to MapReduce and instead push the functionality to run over DBMS’s running on individual nodes in the cluster that have been preloaded with data that would normally reside in HDFS.

A final effort worth mentioning in this space is the BOOM Analytics paper out of Joe Hellerstein’s group at Berkeley.  They have a re-implementation of Hadoop in a DataLog-like declarative programming language that should support a range of alternative query techniques over emerging massive data sets.

Perhaps most exciting to me is that the traditionally separate database and systems communities (check out this classic) will likely be coming together to build out a shared vision of “cloud computing” in the years ahead.

Facebook’s Network Requirements

At the IEEE LEOS meeting, I had the chance to hear an excellent presentation by Donn Lee from Facebook on their network infrastructure and pain points. I first met Donn when I was speaking at Stanford on our own Data Center networking project.  He had a lot of great questions and feedback based on his experience at Facebook (not to mention Google, Cisco, etc.).

One interesting thing that came out of the presentation is the rate at which switch capacity has increased relative to the size and bandwidth requirements of data centers over the last decade or so.  Today, the biggest switch that one can buy is approximately a 128-port 10 Gigabit Ethernet switch.  However, data centers with 100,000’s of thousands of ports are not unheard of today and individual distributed applications can run on tens of thousands of machines.

A significant challenge is interconnecting all of these machines.  Donn mentioned that his ideal switch from an operational perspective at Facebook.  Would have 1500 ports.  The switch would have 1000 10GbE ports facing downward to end hosts (either 10k at 1 GbE each or 1k at 10 GbE each) and 50 100GbE ports facing up to another switch to allow communication with other logical clusters.  This suggests a requirement of nonblocking bandwidth at the granularity of 1-10k hosts and an oversubscription ratio of 2 in talking to other clusters.  This also suggests a switch that has 15 Terabits/sec of aggregate capacity.  The fact that this represents about a factor of 15 more bandwidth than what is available from commercial switches (not to mention the fact that there is no standard for 100 Gigabit Ethernet yet) means that Facebook has to build out complex meshes, presumably with some performance-limiting hashing to map flows to paths.

Given that doubling single switch capacity roughly requires a factor of 4 more logic, and the continued buildout of ever denser data centers, the communication fabric for these data centers is likely to form a top of increasing interest.

My group remains quite interested in this space.  In fact, our upcoming Merchant Silicon paper at Hot Interconnects this year considers the design of multi-stage 34 Tbps switch.

UPDATE: The slides from Donn’s talk are now available here.