As a one-hop removed database outsider (partially tainted by co-authoring a VLDB 2000 paper), it seems to me that there is a revolution taking place right now in the field. While the relational debate was before my time, the explosion in the data analytics market is forcing a re-examining of the appropriate architecture and interface for petabyte-scale data processing. The MapReduce programming model and the open-source Hadoop framework have been receiving tremendous attention for the job of running ad hoc queries over extremely large, often unstructured datasets. For instance, Facebook is running Hadoop queries over their multi-petabyte data warehouse.
Naturally, the database community argues that the traditional relational model is superior for these queries and that parallel database technology has been around for two decades to allow appropriate scaling. One of the most entertaining papers I have read over the past few years takes this debate head on:
A Comparison of Approaches to Large-Scale Data Analysis by Pavlo et al in SIGMOD 2009.
The paper compares some of the pros and cons of each approach. On the side for parallel databases:
- Potentially superior performance from building appropriate indices over the data.
- Long-term support for data evolution. The queries need not change as “columns” are added to the underlying data.
- Familiar declarative query interface
For shared nothing MapReduce jobs there are a number of benefits as well:
- Much simpler system setup. Configuring a parallel DB is likely more complex than configuring and installing an operating system.
- Better support for fault tolerance. MapReduce restarts individual jobs associated with failed or slow nodes.
- Better scalability. Parallel DB’s typically do not scale beyond about 100 nodes, whereas MapReduce jobs often run on thousands of nodes and anecdotally run on tens of thousands of nodes.
On this last point, the authors claim that petabyte-scale data warehouses are rare and that it is far from impossible to store 1 PB of data on 100 machines. On this last point, I believe this argument very conveniently ignores the CPU processing benefits available from large shared nothing clusters. Having a factor of 10-100 more machines to carry out the processing can be a big win for certain queries relative to the parallel DB approach.
Overall however, the paper makes for a compelling read and I am sure that it will spark lively discussion. I just recently read one very interesting followup on “HadoopDB” in VLDB 2009:
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads by Abouzeid et al in VLDB 2009.
This paper shows how to take the pre-existing Hive SQL front-end to MapReduce and instead push the functionality to run over DBMS’s running on individual nodes in the cluster that have been preloaded with data that would normally reside in HDFS.
A final effort worth mentioning in this space is the BOOM Analytics paper out of Joe Hellerstein’s group at Berkeley. They have a re-implementation of Hadoop in a DataLog-like declarative programming language that should support a range of alternative query techniques over emerging massive data sets.
Perhaps most exciting to me is that the traditionally separate database and systems communities (check out this classic) will likely be coming together to build out a shared vision of “cloud computing” in the years ahead.