SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Downloaden Sie, um offline zu lesen
A Comparison of Approaches to Large-Scale Data Analysis

                          Andrew Pavlo                                    Erik Paulson                        Alexander Rasin
                          Brown University                          University of Wisconsin                    Brown University
                    pavlo@cs.brown.edu                           epaulson@cs.wisc.edu                     alexr@cs.brown.edu
  Daniel J. Abadi                       David J. DeWitt                             Samuel Madden                      Michael Stonebraker
     Yale University                        Microsoft Inc.                            M.I.T. CSAIL                             M.I.T. CSAIL
dna@cs.yale.edu dewitt@microsoft.com madden@csail.mit.edu stonebraker@csail.mit.edu

ABSTRACT                                                                              model through which users can express relatively sophisticated dis-
                                                                                      tributed programs, leading to significant interest in the educational
There is currently considerable enthusiasm around the MapReduce
                                                                                      community. For example, IBM and Google have announced plans
(MR) paradigm for large-scale data analysis [17]. Although the
                                                                                      to make a 1000 processor MapReduce cluster available to teach stu-
basic control flow of this framework has existed in parallel SQL
                                                                                      dents distributed programming.
database management systems (DBMS) for over 20 years, some
                                                                                         Given this interest in MapReduce, it is natural to ask “Why not
have called MR a dramatically new computing model [8, 17]. In
                                                                                      use a parallel DBMS instead?” Parallel database systems (which
this paper, we describe and compare both paradigms. Furthermore,
                                                                                      all share a common architectural design) have been commercially
we evaluate both kinds of systems in terms of performance and de-
                                                                                      available for nearly two decades, and there are now about a dozen in
velopment complexity. To this end, we define a benchmark con-
                                                                                      the marketplace, including Teradata, Aster Data, Netezza, DATAl-
sisting of a collection of tasks that we have run on an open source
                                                                                      legro (and therefore soon Microsoft SQL Server via Project Madi-
version of MR as well as on two parallel DBMSs. For each task,
                                                                                      son), Dataupia, Vertica, ParAccel, Neoview, Greenplum, DB2 (via
we measure each system’s performance for various degrees of par-
                                                                                      the Database Partitioning Feature), and Oracle (via Exadata). They
allelism on a cluster of 100 nodes. Our results reveal some inter-
                                                                                      are robust, high performance computing platforms. Like MapRe-
esting trade-offs. Although the process to load data into and tune
                                                                                      duce, they provide a high-level programming environment and par-
the execution of parallel DBMSs took much longer than the MR
                                                                                      allelize readily. Though it may seem that MR and parallel databases
system, the observed performance of these DBMSs was strikingly
                                                                                      target different audiences, it is in fact possible to write almost any
better. We speculate about the causes of the dramatic performance
                                                                                      parallel processing task as either a set of database queries (possibly
difference and consider implementation concepts that future sys-
                                                                                      using user defined functions and aggregates to filter and combine
tems should take from both kinds of architectures.
                                                                                      data) or a set of MapReduce jobs. Inspired by this question, our goal
                                                                                      is to understand the differences between the MapReduce approach
Categories and Subject Descriptors                                                    to performing large-scale data analysis and the approach taken by
                                                                                      parallel database systems. The two classes of systems make differ-
H.2.4 [Database Management]: Systems—Parallel databases
                                                                                      ent choices in several key areas. For example, all DBMSs require
                                                                                      that data conform to a well-defined schema, whereas MR permits
General Terms                                                                         data to be in any arbitrary format. Other differences also include
                                                                                      how each system provides indexing and compression optimizations,
Database Applications, Use Cases, Database Programming
                                                                                      programming models, the way in which data is distributed, and
                                                                                      query execution strategies.
1.    INTRODUCTION                                                                       The purpose of this paper is to consider these choices, and the
   Recently the trade press has been filled with news of the rev-                      trade-offs that they entail. We begin in Section 2 with a brief review
olution of “cluster computing”. This paradigm entails harnessing                      of the two alternative classes of systems, followed by a discussion
large numbers of (low-end) processors working in parallel to solve                    in Section 3 of the architectural trade-offs. Then, in Section 4 we
a computing problem. In effect, this suggests constructing a data                     present our benchmark consisting of a variety of tasks, one taken
center by lining up a large number of low-end servers instead of                      from the MR paper [8], and the rest a collection of more demanding
deploying a smaller set of high-end servers. With this rise of in-                    tasks. In addition, we present the results of running the benchmark
terest in clusters has come a proliferation of tools for programming                  on a 100-node cluster to execute each task. We tested the publicly
them. One of the earliest and best known such tools in MapReduce                      available open-source version of MapReduce, Hadoop [1], against
(MR) [8]. MapReduce is attractive because it provides a simple                        two parallel SQL DBMSs, Vertica [3] and a second system from a
                                                                                      major relational vendor. We also present results on the time each
                                                                                      system took to load the test data and report informally on the pro-
                                                                                      cedures needed to set up and tune the software for each task.
Permission to make digital or hard copies of all or part of this work for                In general, the SQL DBMSs were significantly faster and re-
personal or classroom use is granted without fee provided that copies are             quired less code to implement each task, but took longer to tune and
not made or distributed for profit or commercial advantage and that copies
                                                                                      load the data. Hence, we conclude with a discussion on the reasons
bear this notice and the full citation on the first page. To copy otherwise, to
                                                                                      for the differences between the approaches and provide suggestions
republish, to post on servers or to redistribute to lists, requires prior specific
                                                                                      on the best practices for any large-scale data analysis engine.
permission and/or a fee.
SIGMOD’09, June 29–July 2, 2009, Providence, Rhode Island, USA.                          Some readers may feel that experiments conducted using 100
Copyright 2009 ACM 978-1-60558-551-2/09/06 ...$5.00.
nodes are not interesting or representative of real world data pro-      instance, regardless of which Map instance produced the data. Each
cessing systems. We disagree with this conjecture on two points.         Reduce processes or combines the records assigned to it in some
First, as we demonstrate in Section 4, at 100 nodes the two parallel     way, and then writes records to an output file (in the distributed file
DBMSs range from a factor of 3.1 to 6.5 faster than MapReduce            system), which forms part of the computation’s final output.
on a variety of analytic tasks. While MR may indeed be capable              The input data set exists as a collection of one or more partitions
of scaling up to 1000s of nodes, the superior efficiency of mod-          in the distributed file system. It is the job of the MR scheduler to
ern DBMSs alleviates the need to use such massive hardware on            decide how many Map instances to run and how to allocate them
datasets in the range of 1–2PB (1000 nodes with 2TB of disk/node         to available nodes. Likewise, the scheduler must also decide on
has a total disk capacity of 2PB). For example, eBay’s Teradata con-     the number and location of nodes running Reduce instances. The
figuration uses just 72 nodes (two quad-core CPUs, 32GB RAM,              MR central controller is responsible for coordinating the system
104 300GB disks per node) to manage approximately 2.4PB of re-           activities on each node. A MR program finishes execution once the
lational data. As another example, Fox Interactive Media’s ware-         final result is written as new files in the distributed file system.
house is implemented using a 40-node Greenplum DBMS. Each
                                                                         2.2 Parallel DBMSs
node is a Sun X4500 machine with two dual-core CPUs, 48 500GB
                                                                            Database systems capable of running on clusters of shared noth-
disks, and 16 GB RAM (1PB total disk space) [7]. Since few data
                                                                         ing nodes have existed since the late 1980s. These systems all sup-
sets in the world even approach a petabyte in size, it is not at all
                                                                         port standard relational tables and SQL, and thus the fact that the
clear how many MR users really need 1,000 nodes.
                                                                         data is stored on multiple machines is transparent to the end-user.
                                                                         Many of these systems build on the pioneering research from the
2.    TWO APPROACHES TO LARGE SCALE
                                                                         Gamma [10] and Grace [11] parallel DBMS projects. The two key
      DATA ANALYSIS                                                      aspects that enable parallel execution are that (1) most (or even all)
   The two classes of systems we consider in this paper run on a         tables are partitioned over the nodes in a cluster and that (2) the sys-
“shared nothing” collection of computers [19]. That is, the sys-         tem uses an optimizer that translates SQL commands into a query
tem is deployed on a collection of independent machines, each with       plan whose execution is divided amongst multiple nodes. Because
local disk and local main memory, connected together on a high-          programmers only need to specify their goal in a high level lan-
speed local area network. Both systems achieve parallelism by            guage, they are not burdened by the underlying storage details, such
dividing any data set to be utilized into partitions, which are al-      as indexing options and join strategies.
located to different nodes to facilitate parallel processing. In this       Consider a SQL command to filter the records in a table T1 based
section, we provide an overview of how both the MR model and             on a predicate, along with a join to a second table T2 with an aggre-
traditional parallel DBMSs operate in this environment.                  gate computed on the result of the join. A basic sketch of how this
                                                                         command is processed in a parallel DBMS consists of three phases.
2.1 MapReduce                                                            Since the database will have already stored T1 on some collection
   One of the attractive qualities about the MapReduce program-          of the nodes partitioned on some attribute, the filter sub-query is
ming model is its simplicity: an MR program consists only of two         first performed in parallel at these sites similar to the filtering per-
functions, called Map and Reduce, that are written by a user to          formed in a Map function. Following this step, one of two common
process key/value data pairs. The input data set is stored in a col-     parallel join algorithms are employed based on the size of data ta-
lection of partitions in a distributed file system deployed on each       bles. For example, if the number of records in T2 is small, then the
node in the cluster. The program is then injected into a distributed     DBMS could replicate it on all nodes when the data is first loaded.
processing framework and executed in a manner to be described.           This allows the join to execute in parallel at all nodes. Following
   The Map function reads a set of “records” from an input file,          this, each node then computes the aggregate using its portion of the
does any desired filtering and/or transformations, and then outputs       answer to the join. A final “roll-up” step is required to compute the
a set of intermediate records in the form of new key/value pairs. As     final answer from these partial aggregates [9].
the Map function produces these output records, a “split” function          If the size of the data in T2 is large, then T2 ’s contents will be
partitions the records into R disjoint buckets by applying a function    distributed across multiple nodes. If these tables are partitioned on
to the key of each output record. This split function is typically a     different attributes than those used in the join, the system will have
hash function, though any deterministic function will suffice. Each       to hash both T2 and the filtered version of T1 on the join attribute us-
map bucket is written to the processing node’s local disk. The Map       ing a common hash function. The redistribution of both T2 and the
function terminates having produced R output files, one for each          filtered version of T1 to the nodes is similar to the processing that
bucket. In general, there are multiple instances of the Map function     occurs between the Map and the Reduce functions. Once each node
running on different nodes of a compute cluster. We use the term         has the necessary data, it then performs a hash join and calculates
instance to mean a unique running invocation of either the Map or        the preliminary aggregate function. Again, a roll-up computation
Reduce function. Each Map instance is assigned a distinct portion        must be performed as a last step to produce the final answer.
of the input file by the MR scheduler to process. If there are M             At first glance, these two approaches to data analysis and pro-
such distinct portions of the input file, then there are R files on disk   cessing have many common elements; however, there are notable
storage for each of the M Map tasks, for a total of M × R files;          differences that we consider in the next section.
Fij , 1 ≤ i ≤ M, 1 ≤ j ≤ R. The key observation is that all Map
                                                                         3. ARCHITECTURAL ELEMENTS
instances use the same hash function; thus, all output records with
the same hash value are stored in the same output file.                      In this section, we consider aspects of the two system architec-
   The second phase of a MR program executes R instances of the          tures that are necessary for processing large amounts of data in a
Reduce program, where R is typically the number of nodes. The            distributed environment. One theme in our discussion is that the na-
input for each Reduce instance Rj consists of the files Fij , 1 ≤         ture of the MR model is well suited for development environments
i ≤ M . These files are transferred over the network from the Map         with a small number of programmers and a limited application do-
nodes’ local disks. Note that again all output records from the Map      main. This lack of constraints, however, may not be appropriate for
phase with the same hash value are consumed by the same Reduce           longer-term and larger-sized projects.
3.1 Schema Support                                                           this is an acceptable strategy if the indexes do not need to be shared
                                                                             between multiple programmers, despite requiring every MR pro-
   Parallel DBMSs require data to fit into the relational paradigm
                                                                             grammer re-implement the same basic functionality.
of rows and columns. In contrast, the MR model does not require
                                                                                If sharing is needed, however, then the specifications of what in-
that data files adhere to a schema defined using the relational data
                                                                             dexes are present and how to use them must be transferred between
model. That is, the MR programmer is free to structure their data in
                                                                             programmers. It is again preferable to store this index information
any manner or even to have no structure at all.
                                                                             in a standard format in the system catalogs, so that programmers
   One might think that the absence of a rigid schema automati-
                                                                             can query this structure to discover such knowledge.
cally makes MR the preferable option. For example, SQL is often
criticized for its requirement that the programmer must specify the
                                                                             3.3 Programming Model
“shape” of the data in a data definition facility. On the other hand,
                                                                                During the 1970s, the database research community engaged in a
the MR programmer must often write a custom parser in order to
                                                                             contentious debate between the relational advocates and the Coda-
derive the appropriate semantics for their input records, which is at
                                                                             syl advocates [18]. The salient issue of this discussion was whether
least an equivalent amount of work. But there are also other poten-
                                                                             a program to access data in a DBMS should be written either by:
tial problems with not using a schema for large data sets.
   Whatever structure exists in MR input files must be built into
                                                                                1. Stating what you want – rather than presenting an algorithm
the Map and Reduce programs. Existing MR implementations pro-
                                                                                   for how to get it (Relational)
vide built-in functionality to handle simple key/value pair formats,
but the programmer must explicitly write support for more com-
                                                                                2. Presenting an algorithm for data access (Codasyl)
plex data structures, such as compound keys. This is possibly an
acceptable approach if a MR data set is not accessed by multiple
                                                                                In the end, the former view prevailed and the last 30 years is
applications. If such data sharing exists, however, a second pro-
                                                                             a testament to the value of relational database systems. Programs
grammer must decipher the code written by the first programmer to
                                                                             in high-level languages, such as SQL, are easier to write, easier
decide how to process the input file. A better approach, followed
                                                                             to modify, and easier for a new person to understand. Codasyl
by all SQL DBMSs, is to separate the schema from the application
                                                                             was criticized for being “the assembly language of DBMS access”.
and store it in a set of system catalogs that can be queried.
                                                                             We argue that MR programming is somewhat analogous to Codasyl
   But even if the schema is separated from the application and
                                                                             programming: one is forced to write algorithms in a low-level lan-
made available to multiple MR programs through a description fa-
                                                                             guage in order to perform record-level manipulation. On the other
cility, the developers must also agree on a single schema. This ob-
                                                                             hand, to many people brought up programming in procedural lan-
viously requires some commitment to a data model or models, and
                                                                             guages, such as C/C++ or Java, describing tasks in a declarative
the input files must obey this commitment as it is cumbersome to
                                                                             language like SQL can be challenging.
modify data attributes once the files are created.
                                                                                Anecdotal evidence from the MR community suggests that there
   Once the programmers agree on the structure of data, something
                                                                             is widespread sharing of MR code fragments to do common tasks,
or someone must ensure that any data added or modified does not
                                                                             such as joining data sets. To alleviate the burden of having to re-
violate integrity or other high-level constraints (e.g., employee salaries
                                                                             implement repetitive tasks, the MR community is migrating high-
must be non negative). Such conditions must be known and explic-
                                                                             level languages on top of the current interface to move such func-
itly adhered to by all programmers modifying a particular data set;
                                                                             tionality into the run time. Pig [15] and Hive [2] are two notable
a MR framework and its underlying distributed storage system has
                                                                             projects in this direction.
no knowledge of these rules, and thus allows input data to be easily
corrupted with bad data. By again separating such constraints from           3.4 Data Distribution
the application and enforcing them automatically by the run time
                                                                                The conventional wisdom for large-scale databases is to always
system, as is done by all SQL DBMSs, the integrity of the data is
                                                                             send the computation to the data, rather than the other way around.
enforced without additional work on the programmer’s behalf.
                                                                             In other words, one should send a small program over the network
   In summary, when no sharing is anticipated, the MR paradigm is
                                                                             to a node, rather than importing a large amount of data from the
quite flexible. If sharing is needed, however, then we argue that it is
                                                                             node. Parallel DBMSs use knowledge of data distribution and loca-
advantageous for the programmer to use a data description language
                                                                             tion to their advantage: a parallel query optimizer strives to balance
and factor schema definitions and integrity constraints out of appli-
                                                                             computational workloads while minimizing the amount data trans-
cation programs. This information should be installed in common
                                                                             mitted over the network connecting the nodes of the cluster.
system catalogs accessible to the appropriate users and applications.
                                                                                Aside from the initial decision on where to schedule Map in-
                                                                             stances, a MR programmer must perform these tasks manually. For
3.2 Indexing                                                                 example, suppose a user writes a MR program to process a collec-
   All modern DBMSs use hash or B-tree indexes to accelerate ac-             tion of documents in two parts. First, the Map function scans the
cess to data. If one is looking for a subset of records (e.g., em-           documents and creates a histogram of frequently occurring words.
ployees with a salary greater than $100,000), then using a proper            The documents are then passed to a Reduce function that groups
index reduces the scope of the search dramatically. Most database            files by their site of origin. Using this data, the user, or another
systems also support multiple indexes per table. Thus, the query             user building on the first user’s work, now wants to find sites with
optimizer can decide which index to use for each query or whether            a document that contains more than five occurrences of the word
to simply perform a brute-force sequential search.                           ‘Google’ or the word ‘IBM’. In the naive implementation of this
   Because the MR model is so simple, MR frameworks do not pro-              query, where the Map is executed over the accumulated statistics,
vide built-in indexes. The programmer must implement any indexes             the filtration is done after the statistics for all documents are com-
that they may desire to speed up access to the data inside of their          puted and shipped to reduce workers, even though only a small sub-
application. This is not easily accomplished, as the framework’s             set of documents satisfy the keyword filter.
data fetching mechanisms must also be instrumented to use these                 In contrast, the following SQL view and select queries perform a
indexes when pushing data to running Map instances. Once more,               similar computation:
materialize intermediate results to files each step of the way. This
 CREATE   VIEW Keywords AS
 SELECT   siteid, docid, word, COUNT(*) AS wordcount                       differs from parallel DBMSs, which have larger granules of work
   FROM   Documents
                                                                           (i.e., transactions) that are restarted in the event of a failure. Part of
  GROUP   BY siteid, docid, word;
                                                                           the reason for this approach is that DBMSs avoid saving interme-
 SELECT   DISTINCT siteid
   FROM   Keywords                                                         diate results to disk whenever possible. Thus, if a single node fails
  WHERE   (word = ‘IBM’ OR word = ‘Google’) AND wordcount > 5;
                                                                           during a long running query in a DBMS, the entire query must be
                                                                           completely restarted.
   A modern DBMS would rewrite the second query such that the
view definition is substituted for the Keywords table in the FROM
                                                                           4. PERFORMANCE BENCHMARKS
clause. Then, the optimizer can push the WHERE clause in the query
down so that it is applied to the Documents table before the COUNT            In this section, we present our benchmark consisting of five tasks
is computed, substantially reducing computation. If the documents          that we use to compare the performance of the MR model with that
are spread across multiple nodes, then this filter can be applied on        of parallel DBMSs. The first task is taken directly from the origi-
each node before documents belonging to the same site are grouped          nal MapReduce paper [8] that the authors’ claim is representative of
together, generating much less network I/O.                                common MR tasks. Because this task is quite simple, we also devel-
                                                                           oped four additional tasks, comprised of more complex analytical
3.5 Execution Strategy                                                     workloads designed to explore the trade-offs discussed in the pre-
   There is a potentially serious performance problem related to           vious section. We executed our benchmarks on a well-known MR
MR’s handling of data transfer between Map and Reduce jobs. Re-            implementation and two parallel DBMSs.
call that each of the N Map instances produces M output files,
                                                                           4.1 Benchmark Environment
each destined for a different Reduce instance. These files are writ-
ten to the local disk on the node executing each particular Map in-          As we describe the details of our benchmark environment, we
stance. If N is 1000 and M is 500, the Map phase of the program            note how the different data analysis systems that we test differ in
produces 500,000 local files. When the Reduce phase starts, each            operating assumptions and discuss the ways in which we dealt with
of the 500 Reduce instances needs to read its 1000 input files and          them in order to make the experiments uniform.
must use a file-transfer protocol to “pull” each of its input files from
                                                                           4.1.1 Tested Systems
the nodes on which the Map instances were run. With 100s of Re-
duce instances running simultaneously, it is inevitable that two or        Hadoop: The Hadoop system is the most popular open-source im-
more Reduce instances will attempt to read their input files from           plementation of the MapReduce framework, under development
the same map node simultaneously, inducing large numbers of disk           by Yahoo! and the Apache Software Foundation [1]. Unlike the
seeks and slowing the effective disk transfer rate. This is why par-       Google implementation of the original MR framework written in
allel database systems do not materialize their split files and instead     C++, the core Hadoop system is written entirely in Java. For our
use a push approach to transfer data instead of a pull.                    experiments in this paper, we use Hadoop version 0.19.0 running
                                                                           on Java 1.6.0. We deployed the system with the default configura-
3.6 Flexibility                                                            tion settings, except for the following changes that we found yielded
   Despite its widespread adoption, SQL is routinely criticized for        better performance without diverging from core MR fundamentals:
its insufficient expressive prowess. Some believe that it was a mis-        (1) data is stored using 256MB data blocks instead of the default
take for the database research community in the 1970s to focus on          64MB, (2) each task executor JVM ran with a maximum heap size
data sub-languages that could be embedded in any programming               of 512MB and the DataNode/JobTracker JVMs ran with a maxi-
language, rather than adding high-level data access to all program-        mum heap size of a 1024MB (for a total size of 3.5GB per node),
ming languages. Fortunately, new application frameworks, such as           (3) we enabled Hadoop’s “rack awareness” feature for data locality
Ruby on Rails [21] and LINQ [14], have started to reverse this sit-        in the cluster, and (4) we allowed Hadoop to reuse the task JVM
uation by leveraging new programming language functionality to             executor instead starting a new process for each Map/Reduce task.
implement an object-relational mapping pattern. These program-             Moreover, we configured the system to run two Map instances and
ming environments allow developers to benefit from the robustness           a single Reduce instance concurrently on each node.
of DBMS technologies without the burden of writing complex SQL.               The Hadoop framework also provides an implementation of the
   Proponents of the MR model argue that SQL does not facilitate           Google distributed file system [12]. For each benchmark trial, we
the desired generality that MR provides. But almost all of the major       store all input and output data in the Hadoop distributed file system
DBMS products (commercial and open-source) now provide sup-                (HDFS). We used the default settings of HDFS of three replicas
port for user-defined functions, stored procedures, and user-defined         per block and without compression; we also tested other configura-
aggregates in SQL. Although this does not have the full generality         tions, such as using only a single replica per block as well as block-
of MR, it does improve the flexibility of database systems.                 and record-level compression, but we found that our tests almost
                                                                           always executed at the same speed or worse with these features en-
3.7 Fault Tolerance                                                        abled (see Section 5.1.3). After each benchmark run finishes for a
   The MR frameworks provide a more sophisticated failure model            particular node scaling level, we delete the data directories on each
than parallel DBMSs. While both classes of systems use some form           node and reformat HDFS so that the next set of input data is repli-
of replication to deal with disk failures, MR is far more adept at         cated uniformly across all nodes.
handling node failures during the execution of a MR computation.              Hadoop uses a central job tracker and a “master” HDFS daemon
In a MR system, if a unit of work (i.e., processing a block of data)       to coordinate node activities. To ensure that these daemons do not
fails, then the MR scheduler can automatically restart the task on         affect the performance of worker nodes, we execute both of these
an alternate node. Part of the flexibility is the result of the fact that   additional framework components on a separate node in the cluster.
the output files of the Map phase are materialized locally instead of
being streamed to the nodes running the Reduce tasks. Similarly,           DBMS-X: We used the latest release of DBMS-X, a parallel SQL
pipelines of MR jobs, such as the one described in Section 4.3.4,          DBMS from a major relational database vendor that stores data in
a row-based format. The system is installed on each node and con-           We also measured the time it takes for each system to load the
figured to use 4GB shared memory segments for the buffer pool             test data. The results from these measurements are split between
and other temporary space. Each table is hash partitioned across         the actual loading of the data and any additional operations after the
all nodes on the salient attribute for that particular table, and then   loading that each system performs, such as compressing or building
sorted and indexed on different attributes (see Sections 4.2.1 and       indexes. The initial input data on each node is stored on one of its
4.3.1). Like the Hadoop experiments, we deleted the tables in DBMS-      two locally installed disks.
X and reloaded the data for each trial to ensure that the tuples was        Unless otherwise indicated, the final results from the queries ex-
uniformly distributed in the cluster.                                    ecuting in Vertica and DBMS-X are piped from a shell command
   By default DBMS-X does not compress data in its internal stor-        into a file on the disk not used by the DBMS. Although it is possi-
age, but it does provide ability to compress tables using a well-        ble to do an equivalent operation in Hadoop, it is easier (and more
known dictionary-based scheme. We found that enabling compres-           common) to store the results of a MR program into the distributed
sion reduced the execution times for almost all the benchmark tasks      file system. This procedure, however, is not analogous to how the
by 50%, and thus we only report results with compression enabled.        DBMSs produce their output data; rather than storing the results in
In only one case did we find that using compression actually per-         a single file, the MR program produces one output file for each Re-
formed worse. Furthermore, because all of our benchmarks are             duce instance and stores them in a single directory. The standard
read-only, we did not enable replication features in DBMS-X, since       practice is for developers then to use these output directories as a
this would not have improved performance and complicates the in-         single input unit for other MR jobs. If, however, a user wishes to
stallation process.                                                      use this data in a non-MR application, they must first combine the
                                                                         results into a single file and download it to the local file system.
Vertica: The Vertica database is a parallel DBMS designed for               Because of this discrepancy, we execute an extra Reduce func-
large data warehouses [3]. The main distinction of Vertica from          tion for each MR benchmark task that simply combines the final
other DBMSs (including DBMS-X) is that all data is stored as columns,    output into a single file in HDFS. Our results differentiate between
rather than rows [20]. It uses a unique execution engine designed        the execution times for Hadoop running the actual benchmark task
specifically for operating on top of a column-oriented storage layer.     versus the additional combine operation. Thus, the Hadoop results
Unlike DBMS-X, Vertica compresses data by default since its ex-          displayed in the graphs for this paper are shown as stacked bars: the
ecutor can operate directly on compressed tables. Because dis-           lower portion of each bar is the execution time for just the specific
abling this feature is not typical in Vertica deployments, the Ver-      benchmark task, while the upper portion is the execution time for
tica results in this paper are generated using only compressed data.     the single Reduce function to combine all of the program’s output
Vertica also sorts every table by one or more attributes based on a      data into a single file.
clustered index.
                                                                         4.2 The Original MR Task
   We found that the default 256MB buffer size per node performed
                                                                            Our first benchmark task is the “Grep task” taken from the orig-
well in our experiments. The Vertica resource manager is respon-
                                                                         inal MapReduce paper, which the authors describe as “represen-
sible for setting the amount of memory given to queries, but we
                                                                         tative of a large subset of the real programs written by users of
provide a hint to the system to expect to execute only one query at
                                                                         MapReduce” [8]. For this task, each system must scan through a
a time. Thus, each query receives most the maximum amount of
                                                                         data set of 100-byte records looking for a three-character pattern.
memory available on each node at runtime.
                                                                         Each record consists of a unique key in the first 10 bytes, followed
4.1.2 Node Configuration                                                  by a 90-byte random value. The search pattern is only found in the
   All three systems were deployed on a 100-node cluster. Each           last 90 bytes once in every 10,000 records.
node has a single 2.40 GHz Intel Core 2 Duo processor running 64-           The input data is stored on each node in plain text files, with one
bit Red Hat Enterprise Linux 5 (kernel version 2.6.18) with 4GB          record per line. For the Hadoop trials, we uploaded these files unal-
RAM and two 250GB SATA-I hard disks. According to hdparm,                tered directly into HDFS. To load the data into Vertica and DBMS-
the hard disks deliver 7GB/sec for cached reads and about 74MB/sec       X, we execute each system’s proprietary load commands in parallel
for buffered reads. The nodes are connected with Cisco Catalyst          on each node and store the data using the following schema:
3750E-48TD switches. This switch has gigabit Ethernet ports for
                                                                            CREATE TABLE Data (
each node and an internal switching fabric of 128Gbps [6]. There               key VARCHAR(10) PRIMARY KEY,
are 50 nodes per switch. The switches are linked together via Cisco            field VARCHAR(90) );
StackWise Plus, which creates a 64Gbps ring between the switches.
                                                                            We execute the Grep task using two different data sets. The mea-
Traffic between two nodes on the same switch is entirely local to the
                                                                         surements in the original MapReduce paper are based on process-
switch and does not impact traffic on the ring.
                                                                         ing 1TB of data on approximately 1800 nodes, which is 5.6 million
4.1.3 Benchmark Execution                                                records or roughly 535MB of data per node. For each system, we
   For each benchmark task, we describe the steps used to imple-         execute the Grep task on cluster sizes of 1, 10, 25, 50, and 100
ment the MR program as well as provide the equivalent SQL state-         nodes. The total number of records processed for each cluster size
ment(s) executed by the two database systems. We executed each           is therefore 5.6 million times the number of nodes. The perfor-
task three times and report the average of the trials. Each system ex-   mance of each system not only illustrates how each system scales
ecutes the benchmark tasks separately to ensure exclusive access to      as the amount of data is increased, but also allows us to (to some
the cluster’s resources. To measure the basic performance without        extent) compare the results to the original MR system.
the overhead of coordinating parallel tasks, we first execute each           While our first dataset fixes the size of the data per node to be the
task on a single node. We then execute the task on different cluster     same as the original MR benchmark and only varies the number of
sizes to show how each system scales as both the amount of data          nodes, our second dataset fixes the total dataset size to be the same
processed and available resources are increased. We only report          as the original MR benchmark (1TB) and evenly divides the data
results using trials where all nodes are available and the system’s      amongst a variable number of nodes. This task measures how well
software operates correctly during the benchmark execution.              each system scales as the number of available nodes is increased.
1500
                                                                                            30000                                                              50000

           1250
                                                                                            25000
                                                                                                                                                               40000
           1000
                                                                                            20000
 seconds




                                                                                                                                                               30000




                                                                                  seconds




                                                                                                                                                     seconds
            750                                                                             15000
                                                                                                                                                               20000
            500                                                                             10000




                                                                                                                                                                                                                           ← 4486.2




                                                                                                                                                                                                                                           ← 4520.8
                                                                                                                                                                                           ← 4369.8




                                                                                                                                                                                                           ← 4445.3
                                                                                                                                                                           ← 2262.2
                                                                                                                                                               10000

                                                  ← 76.7
                                    ← 75.5




                                                                         ← 75.5
                                                               ← 67.7
            250                                                                             5000
                      ← 17.6




              0                                                                                0                                                                  0
                  1 Nodes      10 Nodes      25 Nodes      50 Nodes 100 Nodes                       25 Nodes             50 Nodes        100 Nodes                     1 Nodes        10 Nodes        25 Nodes        50 Nodes        100 Nodes

                                 Vertica                      Hadoop                                           Vertica                Hadoop                                             Vertica                         Hadoop



Figure 1: Load Times – Grep Task Data Set                                         Figure 2: Load Times – Grep Task Data Set                          Figure 3: Load Times – UserVisits Data Set
          (535MB/node)                                                                      (1TB/cluster)                                                      (20GB/node)

Since Hadoop needs a total of 3TB of disk space in order to store                                                            ferent node based on the hash of its primary key. Once the data is
three replicas of each block in HDFS, we were limited to running                                                             loaded, the columns are automatically sorted and compressed ac-
this benchmark only on 25, 50, and 100 nodes (at fewer than 25                                                               cording to the physical design of the database.
nodes, there is not enough available disk space to store 3TB).
                                                                                                                             Results & Discussion: The results for loading both the 535MB/node
4.2.1 Data Loading
                                                                                                                             and 1TB/cluster data sets are shown in Figures 1 and 2, respectively.
  We now describe the procedures used to load the data from the                                                              For DBMS-X, we separate the times of the two loading phases,
nodes’ local files into each system’s internal storage representation.                                                        which are shown as a stacked bar in the graphs: the bottom seg-
                                                                                                                             ment represents the execution time of the parallel LOAD commands
Hadoop: There are two ways to load data into Hadoop’s distributed                                                            and the top segment is the reorganization process.
file system: (1) use Hadoop’s command-line file utility to upload                                                                 The most striking feature of the results for the load times in
files stored on the local filesystem into HDFS or (2) create a custom                                                          535MB/node data set shown in Figure 1 is the difference in perfor-
data loader program that writes data using Hadoop’s internal I/O                                                             mance of DBMS-X compared to Hadoop and Vertica. Despite issu-
API. We did not need to alter the input data for our MR programs,                                                            ing the initial LOAD command in the first phase on each node in par-
therefore we loaded the files on each node in parallel directly into                                                          allel, the data was actually loaded on each node sequentially. Thus,
HDFS as plain text using the command-line utility. Storing the data                                                          as the total of amount of data is increased, the load times also in-
in this manner enables MR programs to access data using Hadoop’s                                                             creased proportionately. This also explains why, for the 1TB/cluster
TextInputFormat data format, where the keys are line num-                                                                    data set, the load times for DBMS-X do not decrease as less data
bers in each file and their corresponding values are the contents of                                                          is stored per node. However, the compression and housekeeping on
each line. We found that this approach yielded the best performance                                                          DBMS-X can be done in parallel across nodes, and thus the execu-
in both the loading process and task execution, as opposed to using                                                          tion time of the second phase of the loading process is cut in half
Hadoop’s serialized data formats or compression features.                                                                    when twice as many nodes are used to store the 1TB of data.
                                                                                                                                Without using either block- or record-level compression, Hadoop
DBMS-X: The loading process in DBMS-X occurs in two phases.                                                                  clearly outperforms both DBMS-X and Vertica since each node is
First, we execute the LOAD SQL command in parallel on each node                                                              simply copying each data file from the local disk into the local
in the cluster to read data from the local filesystem and insert its                                                          HDFS instance and then distributing two replicas to other nodes
contents into a particular table in the database. We specify in this                                                         in the cluster. If we load the data into Hadoop using only a sin-
command that the local data is delimited by a special character, thus                                                        gle replica per block, then the load times are reduced by a factor
we did not need to write a custom program to transform the data                                                              of three. But as we will discuss in Section 5, the lack of multiple
before loading it. But because our data generator simply creates                                                             replicas often increases the execution times of jobs.
random keys for each record on each node, the system must redis-
tribute the tuples to other nodes in the cluster as it reads each record                                                      4.2.2 Task Execution
from the input files based on the target table’s partitioning attribute.                                                      SQL Commands: A pattern search for a particular field is sim-
It would be possible to generate a “hash-aware” version of the data                                                          ply the following query in SQL. Neither SQL system contained an
generator that would allow DBMS-X to just load the input files on                                                             index on the field attribute, so this query requires a full table scan.
each node without this redistribution process, but we do not believe
that this would improve load times very much.                                                                                       SELECT * FROM Data WHERE field LIKE ‘%XYZ%’;
   Once the initial loading phase is complete, we then execute an
administrative command to reorganize the data on each node. This                                                             MapReduce Program: The MR program consists of just a Map
process executes in parallel on each node to compress data, build                                                            function that is given a single record already split into the appro-
each table’s indexes, and perform other housekeeping.                                                                        priate key/value pair and then performs a sub-string match on the
                                                                                                                             value. If the search pattern is found, the Map function simply out-
Vertica: Vertica also provides a COPY SQL command that is is-                                                                puts the input key/value pair to HDFS. Because no Reduce function
sued from a single host and then coordinates the loading process on                                                          is defined, the output generated by each Map instance is the final
multiple nodes in parallel in the cluster. The user gives the COPY                                                           output of the program.
command as input a list of nodes to execute the loading operation
for. This process is similar to DBMS-X: on each node the Vertica                                                             Results & Discussion: The performance results for the three sys-
loader splits the input data files on a delimiter, creates a new tuple                                                        tems for this task is shown in Figures 4 and 5. Surprisingly, the
for each line in an input file, and redistributes that tuple to a dif-                                                        relative differences between the systems are not consistent in the
70                                                                         1500

              60
                                                                                         1250

              50
                                                                                         1000
    seconds




                                                                               seconds
              40
                                                                                          750
              30
                                                                                          500
              20

                                                                                          250
              10

                                                                                            0
               0
                                                                                                25 Nodes             50 Nodes      100 Nodes
                   1 Nodes   10 Nodes     25 Nodes   50 Nodes    100 Nodes

                                                                                                           Vertica              Hadoop
                                Vertica                 Hadoop

                                                                                         Figure 5: Grep Task Results – 1TB/cluster Data Set
              Figure 4: Grep Task Results – 535MB/node Data Set

two figures. In Figure 4, the two parallel databases perform about            600,000 unique HTML documents, each with a unique URL. In
the same, more than a factor of two faster in Hadoop. But in Fig-            each document, we randomly generate links to other pages set us-
ure 5, both DBMS-X and Hadoop perform more than a factor of                  ing a Zipfian distribution.
two slower than Vertica. The reason is that the amount of data pro-             We also generated two additional data sets meant to model log
cessing varies substantially from the two experiments. For the re-           files of HTTP server traffic. These data sets consist of values de-
sults in Figure 4, very little data is being processed (535MB/node).         rived from the HTML documents as well as several randomly gen-
This causes Hadoop’s non-insignificant start-up costs to become the           erated attributes. The schema of these three tables is as follows:
limiting factor in its performance. As will be described in Section          CREATE TABLE Documents (                   CREATE TABLE UserVisits (
5.1.2, for short-running queries (i.e., queries that take less than a           url VARCHAR(100)                          sourceIP VARCHAR(16),
                                                                                    PRIMARY KEY,                          destURL VARCHAR(100),
minute), Hadoop’s start-up costs can dominate the execution time.
                                                                                contents TEXT );                          visitDate DATE,
In our observations, we found that takes 10–25 seconds before all                                                         adRevenue FLOAT,
Map tasks have been started and are running at full speed across the         CREATE TABLE Rankings (                      userAgent VARCHAR(64),
                                                                               pageURL VARCHAR(100)                       countryCode VARCHAR(3),
nodes in the cluster. Furthermore, as the total number of allocated
                                                                                       PRIMARY KEY,                       languageCode VARCHAR(6),
Map tasks increases, there is additional overhead required for the             pageRank INT,                              searchWord VARCHAR(32),
central job tracker to coordinate node activities. Hence, this fixed            avgDuration INT );                         duration INT );
overhead increases slightly as more nodes are added to the cluster              Our data generator created unique files with 155 million UserVis-
and for longer data processing tasks, as shown in Figure 5, this fixed        its records (20GB/node) and 18 million Rankings records (1GB/node)
cost is dwarfed by the time to complete the required processing.             on each node. The visitDate, adRevenue, and sourceIP fields are
   The upper segments of each Hadoop bar in the graphs represent             picked uniformly at random from specific ranges. All other fields
the execution time of the additional MR job to combine the output            are picked uniformly from sampling real-world data sets. Each data
into a single file. Since we ran this as a separate MapReduce job,            file is stored on each node as a column-delimited text file.
these segments consume a larger percentage of overall time in Fig-
                                                                             4.3.1 Data Loading
ure 4, as the fixed start-up overhead cost again dominates the work
needed to perform the rest of the task. Even though the Grep task is            We now describe the procedures for loading the UserVisits and
selective, the results in Figure 5 show how this combine phase can           Rankings data sets. For reasons to be discussed in Section 4.3.5,
still take hundreds of seconds due to the need to open and combine           only Hadoop needs to directly load the Documents files into its in-
many small output files. Each Map instance produces its output in             ternal storage system. DBMS-X and Vertica both execute a UDF
a separate HDFS file, and thus even though each file is small there            that processes the Documents on each node at runtime and loads
are many Map tasks and therefore many files on each node.                     the data into a temporary table. We account for the overhead of
   For the 1TB/cluster data set experiments, Figure 5 shows that all         this approach in the benchmark times, rather than in the load times.
systems executed the task on twice as many nodes in nearly half the          Therefore, we do not provide results for loading this data set.
amount of time, as one would expect since the total amount of data
was held constant across nodes for this experiment. Hadoop and               Hadoop: Unlike the Grep task’s data set, which was uploaded di-
DBMS-X performs approximately the same, since Hadoop’s start-                rectly into HDFS unaltered, the UserVisits and Rankings data sets
up cost is amortized across the increased amount of data processing          needed to be modified so that the first and second columns are sep-
for this experiment. However, the results clearly show that Vertica          arated by a tab delimiter and all other fields in each line are sepa-
outperforms both DBMS-X and Hadoop. We attribute this to Ver-                rated by a unique field delimiter. Because there are no schemas in
tica’s aggressive use of data compression (see Section 5.1.3), which         the MR model, in order to access the different attributes at run time,
becomes more effective as more data is stored per node.                      the Map and Reduce functions in each task must manually split the
                                                                             value by the delimiter character into an array of strings.
4.3 Analytical Tasks                                                            We wrote a custom data loader executed in parallel on each node
  To explore more complex uses of both types of systems, we de-              to read in each line of the data sets, prepare the data as needed,
veloped four tasks related to HTML document processing. We first              and then write the tuple into a plain text file in HDFS. Loading
generate a collection of random HTML documents, similar to that              the data sets in this manner was roughly three times slower than
which a web crawler might find. Each node is assigned a set of                using the command-line utility, but did not require us to write cus-
160                                                                         factor across all cluster scaling levels. Although the relative per-
                                                                                          formance of all systems degrade as both the number of nodes and
              140
                                                                                          the total amount of data increase, Hadoop is most affected. For
                                                                                          example, there is almost a 50% difference in the execution time
              120
                                                                                          between the 1 node and 10 node experiments. This is again due
              100
                                                                                          to Hadoop’s increased start-up costs as more nodes are added to
    seconds




                                                                                          the cluster, which takes up a proportionately larger fraction of total
              80
                                                                                          query time for short-running queries.
              60
                                                                                             Another important reason for why the parallel DBMSs are able
                                                                                          to outperform Hadoop is that both Vertica and DBMS-X use an in-
              40




                                                                              ← 12.4
                                                                                          dex on the pageRank column and store the Rankings table already




                                                             ← 4.7
              20                                  ← 1.8
                                ← 0.8
                    ← 0.3




                                                                                          sorted by pageRank. Thus, executing this query is trivial. It should
                                                                                          also be noted that although Vertica’s absolute times remain low, its
               0
                    1 Nodes     10 Nodes          25 Nodes   50 Nodes         100 Nodes   relative performance degrades as the number of nodes increases.
                                                                                          This is in spite of the fact that each node still executes the query in
                                        Vertica                      Hadoop
                                                                                          the same amount of time (about 170ms). But because the nodes fin-
                                                                                          ish executing the query so quickly, the system becomes flooded with
                            Figure 6: Selection Task Results
                                                                                          control messages from too many nodes, which then takes a longer
                                                                                          time for the system to process. Vertica uses a reliable message layer
tom input handlers in Hadoop; the MR programs are able to use
                                                                                          for query dissemination and commit protocol processing [4], which
Hadoop’s KeyValueTextInputFormat interface on the data
                                                                                          we believe has considerable overhead when more than a few dozen
files to automatically split lines of text files into key/values pairs by
                                                                                          nodes are involved in the query.
the tab delimiter. Again, we found that other data format options,
such as SequenceFileInputFormat or custom Writable                                        4.3.3 Aggregation Task
tuples, resulted in both slower load and execution times.
                                                                                             Our next task requires each system to calculate the total adRev-
                                                                                          enue generated for each sourceIP in the UserVisits table (20GB/node),
DBMS-X: We used the same loading procedures for DBMS-X as
                                                                                          grouped by the sourceIP column. We also ran a variant of this query
discussed in Section 4.2. The Rankings table was hash partitioned
                                                                                          where we grouped by the seven-character prefix of the sourceIP col-
across the cluster on pageURL and the data on each node was sorted
                                                                                          umn to measure the effect of reducing the total number of groups
by pageRank. Likewise, the UserVisits table was hash partitioned
                                                                                          on query performance. We designed this task to measure the per-
on destinationURL and sorted by visitDate on each node.
                                                                                          formance of parallel analytics on a single read-only table, where
                                                                                          nodes need to exchange intermediate data with one another in order
Vertica: Similar to DBMS-X, Vertica used the same bulk load com-
                                                                                          compute the final value. Regardless of the number of nodes in the
mands discussed in Section 4.2 and sorted the UserVisits and Rank-
                                                                                          cluster, this tasks always produces 2.5 million records (53 MB); the
ings tables by the visitDate and pageRank columns, respectively.
                                                                                          variant query produces 2,000 records (24KB).
Results & Discussion: Since the results of loading the UserVisits
                                                                                          SQL Commands: The SQL commands to calculate the total adRev-
and Ranking data sets are similar, we only provide the results for
                                                                                          enue is straightforward:
loading the larger UserVisits data in Figure 3. Just as with loading
the Grep 535MB/node data set (Figure 1), the loading times for                               SELECT sourceIP, SUM(adRevenue)
each system increases in proportion to the number of nodes used.                               FROM UserVisits GROUP BY sourceIP;

4.3.2 Selection Task                                                                      The variant query is:
   The Selection task is a lightweight filter to find the pageURLs
                                                                                             SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue)
in the Rankings table (1GB/node) with a pageRank above a user-
                                                                                               FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7);
defined threshold. For our experiments, we set this threshold pa-
rameter to 10, which yields approximately 36,000 records per data                         MapReduce Program: Unlike the previous tasks, the MR program
file on each node.                                                                         for this task consists of both a Map and Reduce function. The Map
                                                                                          function first splits the input value by the field delimiter, and then
SQL Commands: The DBMSs execute the selection task using the                              outputs the sourceIP field (given as the input key) and the adRev-
following simple SQL statement:                                                           enue field as a new key/value pair. For the variant query, only the
                                                                                          first seven characters (representing the first two octets, each stored
   SELECT pageURL, pageRank
                                                                                          as three digits) of the sourceIP are used. These two Map functions
     FROM Rankings WHERE pageRank > X;
                                                                                          share the same Reduce function that simply adds together all of the
MapReduce Program: The MR program uses only a single Map                                  adRevenue values for each sourceIP and then outputs the prefix and
function that splits the input value based on the field delimiter and                      revenue total. We also used MR’s Combine feature to perform the
outputs the record’s pageURL and pageRank as a new key/value                              pre-aggregate before data is transmitted to the Reduce instances,
pair if its pageRank is above the threshold. This task does not re-                       improving the first query’s execution time by a factor of two [8].
quire a Reduce function, since each pageURL in the Rankings data
set is unique across all nodes.                                                           Results & Discussion: The results of the aggregation task experi-
                                                                                          ment in Figures 7 and 8 show once again that the two DBMSs out-
Results & Discussion: As was discussed in the Grep task, the re-                          perform Hadoop. The DBMSs execute these queries by having each
sults from this experiment, shown in Figure 6, demonstrate again                          node scan its local table, extract the sourceIP and adRevenue fields,
that the parallel DBMSs outperform Hadoop by a rather significant                          and perform a local group by. These local groups are then merged at
1800                                                                     1400

               1600
                                                                                        1200
               1400
                                                                                        1000
               1200
     seconds




                                                                              seconds
                                                                                         800
               1000

                800                                                                      600
                600
                                                                                         400
                400
                                                                                         200
                200

                  0                                                                        0
                      1 Nodes   10 Nodes    25 Nodes   50 Nodes 100 Nodes                      1 Nodes   10 Nodes    25 Nodes   50 Nodes 100 Nodes

                                  Vertica                 Hadoop                                           Vertica                 Hadoop


       Figure 7: Aggregation Task Results (2.5 million Groups)                          Figure 8: Aggregation Task Results (2,000 Groups)

the query coordinator, which outputs results to the user. The results       Once this table is populated, it is then trivial to use a second query
in Figure 7 illustrate that the two DBMSs perform about the same            to output the record with the largest totalRevenue field.
for a large number of groups, as their runtime is dominated by the
cost to transmit the large number of local groups and merge them              SELECT INTO Temp sourceIP,
                                                                                               AVG(pageRank) as avgPageRank,
at the coordinator. For the experiments using fewer nodes, Vertica
                                                                                               SUM(adRevenue) as totalRevenue
performs somewhat better, since it has to read less data (since it              FROM Rankings AS R, UserVisits AS UV
can directly access the sourceIP and adRevenue columns), but it                WHERE R.pageURL = UV.destURL
                                                                                 AND UV.visitDate BETWEEN Date(‘2000-01-15’)
becomes slightly slower as more nodes are used.
                                                                                                      AND Date(‘2000-01-22’)
   Based on the results in Figure 8, it is more advantageous to use            GROUP BY UV.sourceIP;
a column-store system when processing fewer groups for this task.
                                                                              SELECT sourceIP, totalRevenue, avgPageRank
This is because the two columns accessed (sourceIP and adRev-
                                                                                FROM Temp
enue) consist of only 20 bytes out of the more than 200 bytes per              ORDER BY totalRevenue DESC LIMIT 1;
UserVisits tuple, and therefore there are relatively few groups that
need to be merged so communication costs are much lower than in
                                                                            MapReduce Program: Because the MR model does not have an
the non-variant plan. Vertica is thus able to outperform the other
                                                                            inherent ability to join two or more disparate data sets, the MR pro-
two systems from not reading unused parts of the UserVisits tuples.
                                                                            gram that implements the join task must be broken out into three
   Note that the execution times for all systems are roughly consis-
                                                                            separate phases. Each of these phases is implemented together as a
tent for any number of nodes (modulo Vertica’s slight slow down as
                                                                            single MR program in Hadoop, but do not begin executing until the
the number of nodes increases). Since this benchmark task requires
                                                                            previous phase is complete.
the system to scan through the entire data set, the run time is always
bounded by the constant sequential scan performance and network
                                                                            Phase 1 – The first phase filters UserVisits records that are outside
repartitioning costs for each node.
                                                                            the desired data range and then joins the qualifying records with
                                                                            records from the Rankings file. The MR program is initially given
4.3.4 Join Task                                                             all of the UserVisits and Rankings data files as input.
   The join task consists of two sub-tasks that perform a complex              Map Function: For each key/value input pair, we determine its
calculation on two data sets. In the first part of the task, each sys-       record type by counting the number of fields produced when split-
tem must find the sourceIP that generated the most revenue within            ting the value on the delimiter. If it is a UserVisits record, we
a particular date range. Once these intermediate records are gener-         apply the filter based on the date range predicate. These qualify-
ated, the system must then calculate the average pageRank of all the        ing records are emitted with composite keys of the form (destURL,
pages visited during this interval. We use the week of January 15-          K1 ), where K1 indicates that it is a UserVisits record. All Rankings
22, 2000 in our experiments, which matches approximately 134,000            records are emitted with composite keys of the form (pageURL,
records in the UserVisits table.                                            K2 ), where K2 indicates that it is a Rankings record. These output
   The salient aspect of this task is that it must consume two data         records are repartitioned using a user-supplied partitioning function
different sets and join them together in order to find pairs of Rank-        that only hashes on the URL portion of the composite key.
ing and UserVisits records with matching values for pageURL and                Reduce Function: The input to the Reduce function is a single
destURL. This task stresses each system using fairly complex op-            sorted run of records in URL order. For each URL, we divide its
erations over a large amount of data. The performance results are           values into two sets based on the tag component of the composite
also a good indication on how well the DBMS’s query optimizer               key. The function then forms the cross product of the two sets to
produces efficient join plans.                                               complete the join and outputs a new key/value pair with the sour-
                                                                            ceIP as the key and the tuple (pageURL, pageRank, adRevenue) as
SQL Commands: In contrast to the complexity of the MR program               the value.
described below, the DBMSs need only two fairly simple queries to
complete the task. The first statement creates a temporary table and         Phase 2 – The next phase computes the total adRevenue and aver-
uses it to store the output of the SELECT statement that performs           age pageRank based on the sourceIP of records generated in Phase
the join of UserVisits and Rankings and computes the aggregates.            1. This phase uses a Reduce function in order to gather all of the
8000
              1800

              1600                                                                                                             7000
              1400
                                                                                                                               6000
              1200
                                                                                                                               5000




                                                                                                                     seconds
    seconds




              1000
                                                                                                                               4000
               800
                                                                                                                               3000
               600
                                                                                                                               2000
               400




                                                                                                 ← 85.0
                                                                               ← 36.1
                                                             ← 31.3




                                                                                                          ← 31.9
                                          ← 28.2
                                                   ← 28.0




                                                                      ← 29.2




                                                                                        ← 29.4
                     ← 21.5
                              ← 15.7




                                                                                                                               1000
               200

                                                                                                                                  0
                 0
                                                                                                                                      1 Nodes   10 Nodes    25 Nodes   50 Nodes 100 Nodes
                     1 Nodes             10 Nodes            25 Nodes          50 Nodes 100 Nodes

                                                                                                                                                  Vertica                 Hadoop
                                                   Vertica        DBMS−X                Hadoop


                                       Figure 9: Join Task Results                                                                    Figure 10: UDF Aggregation Task Results

records for a particular sourceIP on a single node. We use the iden-                                               seconds to split, parse, and deserialize the various attributes. Thus,
tity Map function in the Hadoop API to supply records directly to                                                  the CPU overhead needed to parse these tables on the fly is the lim-
the split process [1, 8].                                                                                          iting factor for Hadoop.
   Reduce Function: For each sourceIP, the function adds up the                                                       Second, the parallel DBMSs are able to take advantage of the fact
adRevenue and computes the average pageRank, retaining the one                                                     that both the UserVisits and the Rankings tables are partitioned by
with the maximum total ad revenue. Each Reduce instance outputs                                                    the join key. This means that both systems are able to do the join
a single record with sourceIP as the key and the value as a tuple of                                               locally on each node, without any network overhead of repartition-
the form (avgPageRank, totalRevenue).                                                                              ing before the join. Thus, they simply have to do a local hash join
                                                                                                                   between the Rankings table and a selective part of the UserVisits
Phase 3 – In the final phase, we again only need to define a sin-                                                    table on each node, with a trivial ORDER BY clause across nodes.
gle Reduce function that uses the output from the previous phase to
                                                                                                                   4.3.5 UDF Aggregation Task
produce the record with the largest total adRevenue. We only exe-
cute one instance of the Reduce function on a single node to scan                                                     The final task is to compute the inlink count for each document
all the records from Phase 2 and find the target record.                                                            in the dataset, a task that is often used as a component of PageR-
   Reduce Function: The function processes each key/value pair                                                     ank calculations. Specifically, for this task, the systems must read
and keeps track of the record with the largest totalRevenue field.                                                  each document file and search for all the URLs that appear in the
Because the Hadoop API does not easily expose the total number                                                     contents. The systems must then, for each unique URL, count the
records that a Reduce instance will process, there is no way for                                                   number of unique pages that reference that particular URL across
the Reduce function to know that it is processing the last record.                                                 the entire set of files. It is this type of task that the MR is believed
Therefore, we override the closing callback method in our Reduce                                                   to be commonly used for.
implementation so that the MR program outputs the largest record                                                      We make two adjustments for this task in order to make pro-
right before it exits.                                                                                             cessing easier in Hadoop. First, we allow the aggregate to include
                                                                                                                   self-references, as it is non-trivial for a Map function to discover
Results & Discussion: The performance results for this task is dis-                                                the name of the input file it is processing. Second, on each node
played in Figure 9. We had to slightly change the SQL used in 100                                                  we concatenate the HTML documents into larger files when storing
node experiments for Vertica due to an optimizer bug in the system,                                                them in HDFS. We found this improved Hadoop’s performance by
which is why there is an increase in the execution time for Vertica                                                a factor of two and helped avoid memory issues with the central
going from 50 to 100 nodes. But even with this increase, it is clear                                               HDFS master when a large number of files are stored in the system.
that this task results in the biggest performance difference between
Hadoop and the parallel database systems. The reason for this dis-                                                 SQL Commands: To perform this task in a parallel DBMS re-
parity is two-fold.                                                                                                quires a user-defined function F that parses the contents of each
   First, despite the increased complexity of the query, the perfor-                                               record in the Documents table and emits URLs into the database.
mance of Hadoop is yet again limited by the speed with which the                                                   This function can be written in a general-purpose language and is
large UserVisits table (20GB/node) can be read off disk. The MR                                                    effectively identical to the Map program discussed below. With this
program has to perform a complete table scan, while the parallel                                                   function F, we populate a temporary table with a list of URLs and
database systems were able to take advantage of clustered indexes                                                  then can execute a simple query to calculate the inlink count:
on UserVisits.visitDate to significantly reduce the amount of data
                                                                                                                     SELECT INTO Temp F(contents) FROM Documents;
that needed to be read. When breaking down the costs of the dif-
                                                                                                                     SELECT url, SUM(value) FROM Temp GROUP BY url;
ferent parts of the Hadoop query, we found that regardless of the
number of nodes in the cluster, phase 2 and phase 3 took on aver-
                                                                                                                      Despite the simplicity of this proposed UDF, we found that in
age 24.3 seconds and 12.7 seconds, respectively. In contrast, phase
                                                                                                                   practice it was difficult to implement in the DBMSs.
1, which contains the Map task that reads in the UserVisits and
                                                                                                                      For DBMS-X, we translated the MR program used in Hadoop
Rankings tables, takes an average of 1434.7 seconds to complete.
                                                                                                                   into an equivalent C program that uses the POSIX regular expres-
Interestingly, it takes approximately 600 seconds of raw I/O to read
                                                                                                                   sion library to search for links in the document. For each URL
the UserVisits and Rankings tables off of disk and then another 300
                                                                                                                   found in the document contents, the UDF returns a new tuple (URL,
benchmarks-sigmod09
benchmarks-sigmod09
benchmarks-sigmod09
benchmarks-sigmod09

Weitere ähnliche Inhalte

Was ist angesagt?

CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
 
A Brief on MapReduce Performance
A Brief on MapReduce PerformanceA Brief on MapReduce Performance
A Brief on MapReduce PerformanceAM Publications
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Jovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudJovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudBharat Rane
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learningbutest
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterIRJET Journal
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTIONijcses
 

Was ist angesagt? (20)

CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 
A Brief on MapReduce Performance
A Brief on MapReduce PerformanceA Brief on MapReduce Performance
A Brief on MapReduce Performance
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
H04502048051
H04502048051H04502048051
H04502048051
 
Jovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudJovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloud
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
 
MapReduce
MapReduceMapReduce
MapReduce
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
 

Ähnlich wie benchmarks-sigmod09

MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningGianvito Siciliano
 
Using BIG DATA implementations onto Software Defined Networking
Using BIG DATA implementations onto Software Defined NetworkingUsing BIG DATA implementations onto Software Defined Networking
Using BIG DATA implementations onto Software Defined NetworkingIJCSIS Research Publications
 
New Data Technologies, Graph Computing and Relationship Discovery in the Ente...
New Data Technologies, Graph Computing and Relationship Discovery in the Ente...New Data Technologies, Graph Computing and Relationship Discovery in the Ente...
New Data Technologies, Graph Computing and Relationship Discovery in the Ente...InfiniteGraph
 
No Sql On Social And Sematic Web
No Sql On Social And Sematic WebNo Sql On Social And Sematic Web
No Sql On Social And Sematic WebStefan Ceriu
 
NoSQL On Social And Sematic Web
NoSQL On Social And Sematic WebNoSQL On Social And Sematic Web
NoSQL On Social And Sematic WebStefan Prutianu
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd Iaetsd
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation foringenioustech
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceinventy
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?IJCSIS Research Publications
 

Ähnlich wie benchmarks-sigmod09 (20)

MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
Using BIG DATA implementations onto Software Defined Networking
Using BIG DATA implementations onto Software Defined NetworkingUsing BIG DATA implementations onto Software Defined Networking
Using BIG DATA implementations onto Software Defined Networking
 
New Data Technologies, Graph Computing and Relationship Discovery in the Ente...
New Data Technologies, Graph Computing and Relationship Discovery in the Ente...New Data Technologies, Graph Computing and Relationship Discovery in the Ente...
New Data Technologies, Graph Computing and Relationship Discovery in the Ente...
 
No Sql On Social And Sematic Web
No Sql On Social And Sematic WebNo Sql On Social And Sematic Web
No Sql On Social And Sematic Web
 
NoSQL On Social And Sematic Web
NoSQL On Social And Sematic WebNoSQL On Social And Sematic Web
NoSQL On Social And Sematic Web
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasets
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation for
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
Sqlmr
SqlmrSqlmr
Sqlmr
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
Sqlmr
SqlmrSqlmr
Sqlmr
 
Sqlmr
SqlmrSqlmr
Sqlmr
 
Sqlmr
SqlmrSqlmr
Sqlmr
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
Cross cloud map reduce for big data
Cross cloud map reduce for big dataCross cloud map reduce for big data
Cross cloud map reduce for big data
 
Database
DatabaseDatabase
Database
 
C1803041317
C1803041317C1803041317
C1803041317
 

Mehr von Hiroshi Ono

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipediaHiroshi Ono
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説Hiroshi Ono
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitectureHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfHiroshi Ono
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdfHiroshi Ono
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdfHiroshi Ono
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfHiroshi Ono
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 

Mehr von Hiroshi Ono (20)

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitecture
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdf
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdf
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdf
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdf
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 

Kürzlich hochgeladen

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

benchmarks-sigmod09

  • 1. A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo Erik Paulson Alexander Rasin Brown University University of Wisconsin Brown University pavlo@cs.brown.edu epaulson@cs.wisc.edu alexr@cs.brown.edu Daniel J. Abadi David J. DeWitt Samuel Madden Michael Stonebraker Yale University Microsoft Inc. M.I.T. CSAIL M.I.T. CSAIL dna@cs.yale.edu dewitt@microsoft.com madden@csail.mit.edu stonebraker@csail.mit.edu ABSTRACT model through which users can express relatively sophisticated dis- tributed programs, leading to significant interest in the educational There is currently considerable enthusiasm around the MapReduce community. For example, IBM and Google have announced plans (MR) paradigm for large-scale data analysis [17]. Although the to make a 1000 processor MapReduce cluster available to teach stu- basic control flow of this framework has existed in parallel SQL dents distributed programming. database management systems (DBMS) for over 20 years, some Given this interest in MapReduce, it is natural to ask “Why not have called MR a dramatically new computing model [8, 17]. In use a parallel DBMS instead?” Parallel database systems (which this paper, we describe and compare both paradigms. Furthermore, all share a common architectural design) have been commercially we evaluate both kinds of systems in terms of performance and de- available for nearly two decades, and there are now about a dozen in velopment complexity. To this end, we define a benchmark con- the marketplace, including Teradata, Aster Data, Netezza, DATAl- sisting of a collection of tasks that we have run on an open source legro (and therefore soon Microsoft SQL Server via Project Madi- version of MR as well as on two parallel DBMSs. For each task, son), Dataupia, Vertica, ParAccel, Neoview, Greenplum, DB2 (via we measure each system’s performance for various degrees of par- the Database Partitioning Feature), and Oracle (via Exadata). They allelism on a cluster of 100 nodes. Our results reveal some inter- are robust, high performance computing platforms. Like MapRe- esting trade-offs. Although the process to load data into and tune duce, they provide a high-level programming environment and par- the execution of parallel DBMSs took much longer than the MR allelize readily. Though it may seem that MR and parallel databases system, the observed performance of these DBMSs was strikingly target different audiences, it is in fact possible to write almost any better. We speculate about the causes of the dramatic performance parallel processing task as either a set of database queries (possibly difference and consider implementation concepts that future sys- using user defined functions and aggregates to filter and combine tems should take from both kinds of architectures. data) or a set of MapReduce jobs. Inspired by this question, our goal is to understand the differences between the MapReduce approach Categories and Subject Descriptors to performing large-scale data analysis and the approach taken by parallel database systems. The two classes of systems make differ- H.2.4 [Database Management]: Systems—Parallel databases ent choices in several key areas. For example, all DBMSs require that data conform to a well-defined schema, whereas MR permits General Terms data to be in any arbitrary format. Other differences also include how each system provides indexing and compression optimizations, Database Applications, Use Cases, Database Programming programming models, the way in which data is distributed, and query execution strategies. 1. INTRODUCTION The purpose of this paper is to consider these choices, and the Recently the trade press has been filled with news of the rev- trade-offs that they entail. We begin in Section 2 with a brief review olution of “cluster computing”. This paradigm entails harnessing of the two alternative classes of systems, followed by a discussion large numbers of (low-end) processors working in parallel to solve in Section 3 of the architectural trade-offs. Then, in Section 4 we a computing problem. In effect, this suggests constructing a data present our benchmark consisting of a variety of tasks, one taken center by lining up a large number of low-end servers instead of from the MR paper [8], and the rest a collection of more demanding deploying a smaller set of high-end servers. With this rise of in- tasks. In addition, we present the results of running the benchmark terest in clusters has come a proliferation of tools for programming on a 100-node cluster to execute each task. We tested the publicly them. One of the earliest and best known such tools in MapReduce available open-source version of MapReduce, Hadoop [1], against (MR) [8]. MapReduce is attractive because it provides a simple two parallel SQL DBMSs, Vertica [3] and a second system from a major relational vendor. We also present results on the time each system took to load the test data and report informally on the pro- cedures needed to set up and tune the software for each task. Permission to make digital or hard copies of all or part of this work for In general, the SQL DBMSs were significantly faster and re- personal or classroom use is granted without fee provided that copies are quired less code to implement each task, but took longer to tune and not made or distributed for profit or commercial advantage and that copies load the data. Hence, we conclude with a discussion on the reasons bear this notice and the full citation on the first page. To copy otherwise, to for the differences between the approaches and provide suggestions republish, to post on servers or to redistribute to lists, requires prior specific on the best practices for any large-scale data analysis engine. permission and/or a fee. SIGMOD’09, June 29–July 2, 2009, Providence, Rhode Island, USA. Some readers may feel that experiments conducted using 100 Copyright 2009 ACM 978-1-60558-551-2/09/06 ...$5.00.
  • 2. nodes are not interesting or representative of real world data pro- instance, regardless of which Map instance produced the data. Each cessing systems. We disagree with this conjecture on two points. Reduce processes or combines the records assigned to it in some First, as we demonstrate in Section 4, at 100 nodes the two parallel way, and then writes records to an output file (in the distributed file DBMSs range from a factor of 3.1 to 6.5 faster than MapReduce system), which forms part of the computation’s final output. on a variety of analytic tasks. While MR may indeed be capable The input data set exists as a collection of one or more partitions of scaling up to 1000s of nodes, the superior efficiency of mod- in the distributed file system. It is the job of the MR scheduler to ern DBMSs alleviates the need to use such massive hardware on decide how many Map instances to run and how to allocate them datasets in the range of 1–2PB (1000 nodes with 2TB of disk/node to available nodes. Likewise, the scheduler must also decide on has a total disk capacity of 2PB). For example, eBay’s Teradata con- the number and location of nodes running Reduce instances. The figuration uses just 72 nodes (two quad-core CPUs, 32GB RAM, MR central controller is responsible for coordinating the system 104 300GB disks per node) to manage approximately 2.4PB of re- activities on each node. A MR program finishes execution once the lational data. As another example, Fox Interactive Media’s ware- final result is written as new files in the distributed file system. house is implemented using a 40-node Greenplum DBMS. Each 2.2 Parallel DBMSs node is a Sun X4500 machine with two dual-core CPUs, 48 500GB Database systems capable of running on clusters of shared noth- disks, and 16 GB RAM (1PB total disk space) [7]. Since few data ing nodes have existed since the late 1980s. These systems all sup- sets in the world even approach a petabyte in size, it is not at all port standard relational tables and SQL, and thus the fact that the clear how many MR users really need 1,000 nodes. data is stored on multiple machines is transparent to the end-user. Many of these systems build on the pioneering research from the 2. TWO APPROACHES TO LARGE SCALE Gamma [10] and Grace [11] parallel DBMS projects. The two key DATA ANALYSIS aspects that enable parallel execution are that (1) most (or even all) The two classes of systems we consider in this paper run on a tables are partitioned over the nodes in a cluster and that (2) the sys- “shared nothing” collection of computers [19]. That is, the sys- tem uses an optimizer that translates SQL commands into a query tem is deployed on a collection of independent machines, each with plan whose execution is divided amongst multiple nodes. Because local disk and local main memory, connected together on a high- programmers only need to specify their goal in a high level lan- speed local area network. Both systems achieve parallelism by guage, they are not burdened by the underlying storage details, such dividing any data set to be utilized into partitions, which are al- as indexing options and join strategies. located to different nodes to facilitate parallel processing. In this Consider a SQL command to filter the records in a table T1 based section, we provide an overview of how both the MR model and on a predicate, along with a join to a second table T2 with an aggre- traditional parallel DBMSs operate in this environment. gate computed on the result of the join. A basic sketch of how this command is processed in a parallel DBMS consists of three phases. 2.1 MapReduce Since the database will have already stored T1 on some collection One of the attractive qualities about the MapReduce program- of the nodes partitioned on some attribute, the filter sub-query is ming model is its simplicity: an MR program consists only of two first performed in parallel at these sites similar to the filtering per- functions, called Map and Reduce, that are written by a user to formed in a Map function. Following this step, one of two common process key/value data pairs. The input data set is stored in a col- parallel join algorithms are employed based on the size of data ta- lection of partitions in a distributed file system deployed on each bles. For example, if the number of records in T2 is small, then the node in the cluster. The program is then injected into a distributed DBMS could replicate it on all nodes when the data is first loaded. processing framework and executed in a manner to be described. This allows the join to execute in parallel at all nodes. Following The Map function reads a set of “records” from an input file, this, each node then computes the aggregate using its portion of the does any desired filtering and/or transformations, and then outputs answer to the join. A final “roll-up” step is required to compute the a set of intermediate records in the form of new key/value pairs. As final answer from these partial aggregates [9]. the Map function produces these output records, a “split” function If the size of the data in T2 is large, then T2 ’s contents will be partitions the records into R disjoint buckets by applying a function distributed across multiple nodes. If these tables are partitioned on to the key of each output record. This split function is typically a different attributes than those used in the join, the system will have hash function, though any deterministic function will suffice. Each to hash both T2 and the filtered version of T1 on the join attribute us- map bucket is written to the processing node’s local disk. The Map ing a common hash function. The redistribution of both T2 and the function terminates having produced R output files, one for each filtered version of T1 to the nodes is similar to the processing that bucket. In general, there are multiple instances of the Map function occurs between the Map and the Reduce functions. Once each node running on different nodes of a compute cluster. We use the term has the necessary data, it then performs a hash join and calculates instance to mean a unique running invocation of either the Map or the preliminary aggregate function. Again, a roll-up computation Reduce function. Each Map instance is assigned a distinct portion must be performed as a last step to produce the final answer. of the input file by the MR scheduler to process. If there are M At first glance, these two approaches to data analysis and pro- such distinct portions of the input file, then there are R files on disk cessing have many common elements; however, there are notable storage for each of the M Map tasks, for a total of M × R files; differences that we consider in the next section. Fij , 1 ≤ i ≤ M, 1 ≤ j ≤ R. The key observation is that all Map 3. ARCHITECTURAL ELEMENTS instances use the same hash function; thus, all output records with the same hash value are stored in the same output file. In this section, we consider aspects of the two system architec- The second phase of a MR program executes R instances of the tures that are necessary for processing large amounts of data in a Reduce program, where R is typically the number of nodes. The distributed environment. One theme in our discussion is that the na- input for each Reduce instance Rj consists of the files Fij , 1 ≤ ture of the MR model is well suited for development environments i ≤ M . These files are transferred over the network from the Map with a small number of programmers and a limited application do- nodes’ local disks. Note that again all output records from the Map main. This lack of constraints, however, may not be appropriate for phase with the same hash value are consumed by the same Reduce longer-term and larger-sized projects.
  • 3. 3.1 Schema Support this is an acceptable strategy if the indexes do not need to be shared between multiple programmers, despite requiring every MR pro- Parallel DBMSs require data to fit into the relational paradigm grammer re-implement the same basic functionality. of rows and columns. In contrast, the MR model does not require If sharing is needed, however, then the specifications of what in- that data files adhere to a schema defined using the relational data dexes are present and how to use them must be transferred between model. That is, the MR programmer is free to structure their data in programmers. It is again preferable to store this index information any manner or even to have no structure at all. in a standard format in the system catalogs, so that programmers One might think that the absence of a rigid schema automati- can query this structure to discover such knowledge. cally makes MR the preferable option. For example, SQL is often criticized for its requirement that the programmer must specify the 3.3 Programming Model “shape” of the data in a data definition facility. On the other hand, During the 1970s, the database research community engaged in a the MR programmer must often write a custom parser in order to contentious debate between the relational advocates and the Coda- derive the appropriate semantics for their input records, which is at syl advocates [18]. The salient issue of this discussion was whether least an equivalent amount of work. But there are also other poten- a program to access data in a DBMS should be written either by: tial problems with not using a schema for large data sets. Whatever structure exists in MR input files must be built into 1. Stating what you want – rather than presenting an algorithm the Map and Reduce programs. Existing MR implementations pro- for how to get it (Relational) vide built-in functionality to handle simple key/value pair formats, but the programmer must explicitly write support for more com- 2. Presenting an algorithm for data access (Codasyl) plex data structures, such as compound keys. This is possibly an acceptable approach if a MR data set is not accessed by multiple In the end, the former view prevailed and the last 30 years is applications. If such data sharing exists, however, a second pro- a testament to the value of relational database systems. Programs grammer must decipher the code written by the first programmer to in high-level languages, such as SQL, are easier to write, easier decide how to process the input file. A better approach, followed to modify, and easier for a new person to understand. Codasyl by all SQL DBMSs, is to separate the schema from the application was criticized for being “the assembly language of DBMS access”. and store it in a set of system catalogs that can be queried. We argue that MR programming is somewhat analogous to Codasyl But even if the schema is separated from the application and programming: one is forced to write algorithms in a low-level lan- made available to multiple MR programs through a description fa- guage in order to perform record-level manipulation. On the other cility, the developers must also agree on a single schema. This ob- hand, to many people brought up programming in procedural lan- viously requires some commitment to a data model or models, and guages, such as C/C++ or Java, describing tasks in a declarative the input files must obey this commitment as it is cumbersome to language like SQL can be challenging. modify data attributes once the files are created. Anecdotal evidence from the MR community suggests that there Once the programmers agree on the structure of data, something is widespread sharing of MR code fragments to do common tasks, or someone must ensure that any data added or modified does not such as joining data sets. To alleviate the burden of having to re- violate integrity or other high-level constraints (e.g., employee salaries implement repetitive tasks, the MR community is migrating high- must be non negative). Such conditions must be known and explic- level languages on top of the current interface to move such func- itly adhered to by all programmers modifying a particular data set; tionality into the run time. Pig [15] and Hive [2] are two notable a MR framework and its underlying distributed storage system has projects in this direction. no knowledge of these rules, and thus allows input data to be easily corrupted with bad data. By again separating such constraints from 3.4 Data Distribution the application and enforcing them automatically by the run time The conventional wisdom for large-scale databases is to always system, as is done by all SQL DBMSs, the integrity of the data is send the computation to the data, rather than the other way around. enforced without additional work on the programmer’s behalf. In other words, one should send a small program over the network In summary, when no sharing is anticipated, the MR paradigm is to a node, rather than importing a large amount of data from the quite flexible. If sharing is needed, however, then we argue that it is node. Parallel DBMSs use knowledge of data distribution and loca- advantageous for the programmer to use a data description language tion to their advantage: a parallel query optimizer strives to balance and factor schema definitions and integrity constraints out of appli- computational workloads while minimizing the amount data trans- cation programs. This information should be installed in common mitted over the network connecting the nodes of the cluster. system catalogs accessible to the appropriate users and applications. Aside from the initial decision on where to schedule Map in- stances, a MR programmer must perform these tasks manually. For 3.2 Indexing example, suppose a user writes a MR program to process a collec- All modern DBMSs use hash or B-tree indexes to accelerate ac- tion of documents in two parts. First, the Map function scans the cess to data. If one is looking for a subset of records (e.g., em- documents and creates a histogram of frequently occurring words. ployees with a salary greater than $100,000), then using a proper The documents are then passed to a Reduce function that groups index reduces the scope of the search dramatically. Most database files by their site of origin. Using this data, the user, or another systems also support multiple indexes per table. Thus, the query user building on the first user’s work, now wants to find sites with optimizer can decide which index to use for each query or whether a document that contains more than five occurrences of the word to simply perform a brute-force sequential search. ‘Google’ or the word ‘IBM’. In the naive implementation of this Because the MR model is so simple, MR frameworks do not pro- query, where the Map is executed over the accumulated statistics, vide built-in indexes. The programmer must implement any indexes the filtration is done after the statistics for all documents are com- that they may desire to speed up access to the data inside of their puted and shipped to reduce workers, even though only a small sub- application. This is not easily accomplished, as the framework’s set of documents satisfy the keyword filter. data fetching mechanisms must also be instrumented to use these In contrast, the following SQL view and select queries perform a indexes when pushing data to running Map instances. Once more, similar computation:
  • 4. materialize intermediate results to files each step of the way. This CREATE VIEW Keywords AS SELECT siteid, docid, word, COUNT(*) AS wordcount differs from parallel DBMSs, which have larger granules of work FROM Documents (i.e., transactions) that are restarted in the event of a failure. Part of GROUP BY siteid, docid, word; the reason for this approach is that DBMSs avoid saving interme- SELECT DISTINCT siteid FROM Keywords diate results to disk whenever possible. Thus, if a single node fails WHERE (word = ‘IBM’ OR word = ‘Google’) AND wordcount > 5; during a long running query in a DBMS, the entire query must be completely restarted. A modern DBMS would rewrite the second query such that the view definition is substituted for the Keywords table in the FROM 4. PERFORMANCE BENCHMARKS clause. Then, the optimizer can push the WHERE clause in the query down so that it is applied to the Documents table before the COUNT In this section, we present our benchmark consisting of five tasks is computed, substantially reducing computation. If the documents that we use to compare the performance of the MR model with that are spread across multiple nodes, then this filter can be applied on of parallel DBMSs. The first task is taken directly from the origi- each node before documents belonging to the same site are grouped nal MapReduce paper [8] that the authors’ claim is representative of together, generating much less network I/O. common MR tasks. Because this task is quite simple, we also devel- oped four additional tasks, comprised of more complex analytical 3.5 Execution Strategy workloads designed to explore the trade-offs discussed in the pre- There is a potentially serious performance problem related to vious section. We executed our benchmarks on a well-known MR MR’s handling of data transfer between Map and Reduce jobs. Re- implementation and two parallel DBMSs. call that each of the N Map instances produces M output files, 4.1 Benchmark Environment each destined for a different Reduce instance. These files are writ- ten to the local disk on the node executing each particular Map in- As we describe the details of our benchmark environment, we stance. If N is 1000 and M is 500, the Map phase of the program note how the different data analysis systems that we test differ in produces 500,000 local files. When the Reduce phase starts, each operating assumptions and discuss the ways in which we dealt with of the 500 Reduce instances needs to read its 1000 input files and them in order to make the experiments uniform. must use a file-transfer protocol to “pull” each of its input files from 4.1.1 Tested Systems the nodes on which the Map instances were run. With 100s of Re- duce instances running simultaneously, it is inevitable that two or Hadoop: The Hadoop system is the most popular open-source im- more Reduce instances will attempt to read their input files from plementation of the MapReduce framework, under development the same map node simultaneously, inducing large numbers of disk by Yahoo! and the Apache Software Foundation [1]. Unlike the seeks and slowing the effective disk transfer rate. This is why par- Google implementation of the original MR framework written in allel database systems do not materialize their split files and instead C++, the core Hadoop system is written entirely in Java. For our use a push approach to transfer data instead of a pull. experiments in this paper, we use Hadoop version 0.19.0 running on Java 1.6.0. We deployed the system with the default configura- 3.6 Flexibility tion settings, except for the following changes that we found yielded Despite its widespread adoption, SQL is routinely criticized for better performance without diverging from core MR fundamentals: its insufficient expressive prowess. Some believe that it was a mis- (1) data is stored using 256MB data blocks instead of the default take for the database research community in the 1970s to focus on 64MB, (2) each task executor JVM ran with a maximum heap size data sub-languages that could be embedded in any programming of 512MB and the DataNode/JobTracker JVMs ran with a maxi- language, rather than adding high-level data access to all program- mum heap size of a 1024MB (for a total size of 3.5GB per node), ming languages. Fortunately, new application frameworks, such as (3) we enabled Hadoop’s “rack awareness” feature for data locality Ruby on Rails [21] and LINQ [14], have started to reverse this sit- in the cluster, and (4) we allowed Hadoop to reuse the task JVM uation by leveraging new programming language functionality to executor instead starting a new process for each Map/Reduce task. implement an object-relational mapping pattern. These program- Moreover, we configured the system to run two Map instances and ming environments allow developers to benefit from the robustness a single Reduce instance concurrently on each node. of DBMS technologies without the burden of writing complex SQL. The Hadoop framework also provides an implementation of the Proponents of the MR model argue that SQL does not facilitate Google distributed file system [12]. For each benchmark trial, we the desired generality that MR provides. But almost all of the major store all input and output data in the Hadoop distributed file system DBMS products (commercial and open-source) now provide sup- (HDFS). We used the default settings of HDFS of three replicas port for user-defined functions, stored procedures, and user-defined per block and without compression; we also tested other configura- aggregates in SQL. Although this does not have the full generality tions, such as using only a single replica per block as well as block- of MR, it does improve the flexibility of database systems. and record-level compression, but we found that our tests almost always executed at the same speed or worse with these features en- 3.7 Fault Tolerance abled (see Section 5.1.3). After each benchmark run finishes for a The MR frameworks provide a more sophisticated failure model particular node scaling level, we delete the data directories on each than parallel DBMSs. While both classes of systems use some form node and reformat HDFS so that the next set of input data is repli- of replication to deal with disk failures, MR is far more adept at cated uniformly across all nodes. handling node failures during the execution of a MR computation. Hadoop uses a central job tracker and a “master” HDFS daemon In a MR system, if a unit of work (i.e., processing a block of data) to coordinate node activities. To ensure that these daemons do not fails, then the MR scheduler can automatically restart the task on affect the performance of worker nodes, we execute both of these an alternate node. Part of the flexibility is the result of the fact that additional framework components on a separate node in the cluster. the output files of the Map phase are materialized locally instead of being streamed to the nodes running the Reduce tasks. Similarly, DBMS-X: We used the latest release of DBMS-X, a parallel SQL pipelines of MR jobs, such as the one described in Section 4.3.4, DBMS from a major relational database vendor that stores data in
  • 5. a row-based format. The system is installed on each node and con- We also measured the time it takes for each system to load the figured to use 4GB shared memory segments for the buffer pool test data. The results from these measurements are split between and other temporary space. Each table is hash partitioned across the actual loading of the data and any additional operations after the all nodes on the salient attribute for that particular table, and then loading that each system performs, such as compressing or building sorted and indexed on different attributes (see Sections 4.2.1 and indexes. The initial input data on each node is stored on one of its 4.3.1). Like the Hadoop experiments, we deleted the tables in DBMS- two locally installed disks. X and reloaded the data for each trial to ensure that the tuples was Unless otherwise indicated, the final results from the queries ex- uniformly distributed in the cluster. ecuting in Vertica and DBMS-X are piped from a shell command By default DBMS-X does not compress data in its internal stor- into a file on the disk not used by the DBMS. Although it is possi- age, but it does provide ability to compress tables using a well- ble to do an equivalent operation in Hadoop, it is easier (and more known dictionary-based scheme. We found that enabling compres- common) to store the results of a MR program into the distributed sion reduced the execution times for almost all the benchmark tasks file system. This procedure, however, is not analogous to how the by 50%, and thus we only report results with compression enabled. DBMSs produce their output data; rather than storing the results in In only one case did we find that using compression actually per- a single file, the MR program produces one output file for each Re- formed worse. Furthermore, because all of our benchmarks are duce instance and stores them in a single directory. The standard read-only, we did not enable replication features in DBMS-X, since practice is for developers then to use these output directories as a this would not have improved performance and complicates the in- single input unit for other MR jobs. If, however, a user wishes to stallation process. use this data in a non-MR application, they must first combine the results into a single file and download it to the local file system. Vertica: The Vertica database is a parallel DBMS designed for Because of this discrepancy, we execute an extra Reduce func- large data warehouses [3]. The main distinction of Vertica from tion for each MR benchmark task that simply combines the final other DBMSs (including DBMS-X) is that all data is stored as columns, output into a single file in HDFS. Our results differentiate between rather than rows [20]. It uses a unique execution engine designed the execution times for Hadoop running the actual benchmark task specifically for operating on top of a column-oriented storage layer. versus the additional combine operation. Thus, the Hadoop results Unlike DBMS-X, Vertica compresses data by default since its ex- displayed in the graphs for this paper are shown as stacked bars: the ecutor can operate directly on compressed tables. Because dis- lower portion of each bar is the execution time for just the specific abling this feature is not typical in Vertica deployments, the Ver- benchmark task, while the upper portion is the execution time for tica results in this paper are generated using only compressed data. the single Reduce function to combine all of the program’s output Vertica also sorts every table by one or more attributes based on a data into a single file. clustered index. 4.2 The Original MR Task We found that the default 256MB buffer size per node performed Our first benchmark task is the “Grep task” taken from the orig- well in our experiments. The Vertica resource manager is respon- inal MapReduce paper, which the authors describe as “represen- sible for setting the amount of memory given to queries, but we tative of a large subset of the real programs written by users of provide a hint to the system to expect to execute only one query at MapReduce” [8]. For this task, each system must scan through a a time. Thus, each query receives most the maximum amount of data set of 100-byte records looking for a three-character pattern. memory available on each node at runtime. Each record consists of a unique key in the first 10 bytes, followed 4.1.2 Node Configuration by a 90-byte random value. The search pattern is only found in the All three systems were deployed on a 100-node cluster. Each last 90 bytes once in every 10,000 records. node has a single 2.40 GHz Intel Core 2 Duo processor running 64- The input data is stored on each node in plain text files, with one bit Red Hat Enterprise Linux 5 (kernel version 2.6.18) with 4GB record per line. For the Hadoop trials, we uploaded these files unal- RAM and two 250GB SATA-I hard disks. According to hdparm, tered directly into HDFS. To load the data into Vertica and DBMS- the hard disks deliver 7GB/sec for cached reads and about 74MB/sec X, we execute each system’s proprietary load commands in parallel for buffered reads. The nodes are connected with Cisco Catalyst on each node and store the data using the following schema: 3750E-48TD switches. This switch has gigabit Ethernet ports for CREATE TABLE Data ( each node and an internal switching fabric of 128Gbps [6]. There key VARCHAR(10) PRIMARY KEY, are 50 nodes per switch. The switches are linked together via Cisco field VARCHAR(90) ); StackWise Plus, which creates a 64Gbps ring between the switches. We execute the Grep task using two different data sets. The mea- Traffic between two nodes on the same switch is entirely local to the surements in the original MapReduce paper are based on process- switch and does not impact traffic on the ring. ing 1TB of data on approximately 1800 nodes, which is 5.6 million 4.1.3 Benchmark Execution records or roughly 535MB of data per node. For each system, we For each benchmark task, we describe the steps used to imple- execute the Grep task on cluster sizes of 1, 10, 25, 50, and 100 ment the MR program as well as provide the equivalent SQL state- nodes. The total number of records processed for each cluster size ment(s) executed by the two database systems. We executed each is therefore 5.6 million times the number of nodes. The perfor- task three times and report the average of the trials. Each system ex- mance of each system not only illustrates how each system scales ecutes the benchmark tasks separately to ensure exclusive access to as the amount of data is increased, but also allows us to (to some the cluster’s resources. To measure the basic performance without extent) compare the results to the original MR system. the overhead of coordinating parallel tasks, we first execute each While our first dataset fixes the size of the data per node to be the task on a single node. We then execute the task on different cluster same as the original MR benchmark and only varies the number of sizes to show how each system scales as both the amount of data nodes, our second dataset fixes the total dataset size to be the same processed and available resources are increased. We only report as the original MR benchmark (1TB) and evenly divides the data results using trials where all nodes are available and the system’s amongst a variable number of nodes. This task measures how well software operates correctly during the benchmark execution. each system scales as the number of available nodes is increased.
  • 6. 1500 30000 50000 1250 25000 40000 1000 20000 seconds 30000 seconds seconds 750 15000 20000 500 10000 ← 4486.2 ← 4520.8 ← 4369.8 ← 4445.3 ← 2262.2 10000 ← 76.7 ← 75.5 ← 75.5 ← 67.7 250 5000 ← 17.6 0 0 0 1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes 25 Nodes 50 Nodes 100 Nodes 1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes Vertica Hadoop Vertica Hadoop Vertica Hadoop Figure 1: Load Times – Grep Task Data Set Figure 2: Load Times – Grep Task Data Set Figure 3: Load Times – UserVisits Data Set (535MB/node) (1TB/cluster) (20GB/node) Since Hadoop needs a total of 3TB of disk space in order to store ferent node based on the hash of its primary key. Once the data is three replicas of each block in HDFS, we were limited to running loaded, the columns are automatically sorted and compressed ac- this benchmark only on 25, 50, and 100 nodes (at fewer than 25 cording to the physical design of the database. nodes, there is not enough available disk space to store 3TB). Results & Discussion: The results for loading both the 535MB/node 4.2.1 Data Loading and 1TB/cluster data sets are shown in Figures 1 and 2, respectively. We now describe the procedures used to load the data from the For DBMS-X, we separate the times of the two loading phases, nodes’ local files into each system’s internal storage representation. which are shown as a stacked bar in the graphs: the bottom seg- ment represents the execution time of the parallel LOAD commands Hadoop: There are two ways to load data into Hadoop’s distributed and the top segment is the reorganization process. file system: (1) use Hadoop’s command-line file utility to upload The most striking feature of the results for the load times in files stored on the local filesystem into HDFS or (2) create a custom 535MB/node data set shown in Figure 1 is the difference in perfor- data loader program that writes data using Hadoop’s internal I/O mance of DBMS-X compared to Hadoop and Vertica. Despite issu- API. We did not need to alter the input data for our MR programs, ing the initial LOAD command in the first phase on each node in par- therefore we loaded the files on each node in parallel directly into allel, the data was actually loaded on each node sequentially. Thus, HDFS as plain text using the command-line utility. Storing the data as the total of amount of data is increased, the load times also in- in this manner enables MR programs to access data using Hadoop’s creased proportionately. This also explains why, for the 1TB/cluster TextInputFormat data format, where the keys are line num- data set, the load times for DBMS-X do not decrease as less data bers in each file and their corresponding values are the contents of is stored per node. However, the compression and housekeeping on each line. We found that this approach yielded the best performance DBMS-X can be done in parallel across nodes, and thus the execu- in both the loading process and task execution, as opposed to using tion time of the second phase of the loading process is cut in half Hadoop’s serialized data formats or compression features. when twice as many nodes are used to store the 1TB of data. Without using either block- or record-level compression, Hadoop DBMS-X: The loading process in DBMS-X occurs in two phases. clearly outperforms both DBMS-X and Vertica since each node is First, we execute the LOAD SQL command in parallel on each node simply copying each data file from the local disk into the local in the cluster to read data from the local filesystem and insert its HDFS instance and then distributing two replicas to other nodes contents into a particular table in the database. We specify in this in the cluster. If we load the data into Hadoop using only a sin- command that the local data is delimited by a special character, thus gle replica per block, then the load times are reduced by a factor we did not need to write a custom program to transform the data of three. But as we will discuss in Section 5, the lack of multiple before loading it. But because our data generator simply creates replicas often increases the execution times of jobs. random keys for each record on each node, the system must redis- tribute the tuples to other nodes in the cluster as it reads each record 4.2.2 Task Execution from the input files based on the target table’s partitioning attribute. SQL Commands: A pattern search for a particular field is sim- It would be possible to generate a “hash-aware” version of the data ply the following query in SQL. Neither SQL system contained an generator that would allow DBMS-X to just load the input files on index on the field attribute, so this query requires a full table scan. each node without this redistribution process, but we do not believe that this would improve load times very much. SELECT * FROM Data WHERE field LIKE ‘%XYZ%’; Once the initial loading phase is complete, we then execute an administrative command to reorganize the data on each node. This MapReduce Program: The MR program consists of just a Map process executes in parallel on each node to compress data, build function that is given a single record already split into the appro- each table’s indexes, and perform other housekeeping. priate key/value pair and then performs a sub-string match on the value. If the search pattern is found, the Map function simply out- Vertica: Vertica also provides a COPY SQL command that is is- puts the input key/value pair to HDFS. Because no Reduce function sued from a single host and then coordinates the loading process on is defined, the output generated by each Map instance is the final multiple nodes in parallel in the cluster. The user gives the COPY output of the program. command as input a list of nodes to execute the loading operation for. This process is similar to DBMS-X: on each node the Vertica Results & Discussion: The performance results for the three sys- loader splits the input data files on a delimiter, creates a new tuple tems for this task is shown in Figures 4 and 5. Surprisingly, the for each line in an input file, and redistributes that tuple to a dif- relative differences between the systems are not consistent in the
  • 7. 70 1500 60 1250 50 1000 seconds seconds 40 750 30 500 20 250 10 0 0 25 Nodes 50 Nodes 100 Nodes 1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes Vertica Hadoop Vertica Hadoop Figure 5: Grep Task Results – 1TB/cluster Data Set Figure 4: Grep Task Results – 535MB/node Data Set two figures. In Figure 4, the two parallel databases perform about 600,000 unique HTML documents, each with a unique URL. In the same, more than a factor of two faster in Hadoop. But in Fig- each document, we randomly generate links to other pages set us- ure 5, both DBMS-X and Hadoop perform more than a factor of ing a Zipfian distribution. two slower than Vertica. The reason is that the amount of data pro- We also generated two additional data sets meant to model log cessing varies substantially from the two experiments. For the re- files of HTTP server traffic. These data sets consist of values de- sults in Figure 4, very little data is being processed (535MB/node). rived from the HTML documents as well as several randomly gen- This causes Hadoop’s non-insignificant start-up costs to become the erated attributes. The schema of these three tables is as follows: limiting factor in its performance. As will be described in Section CREATE TABLE Documents ( CREATE TABLE UserVisits ( 5.1.2, for short-running queries (i.e., queries that take less than a url VARCHAR(100) sourceIP VARCHAR(16), PRIMARY KEY, destURL VARCHAR(100), minute), Hadoop’s start-up costs can dominate the execution time. contents TEXT ); visitDate DATE, In our observations, we found that takes 10–25 seconds before all adRevenue FLOAT, Map tasks have been started and are running at full speed across the CREATE TABLE Rankings ( userAgent VARCHAR(64), pageURL VARCHAR(100) countryCode VARCHAR(3), nodes in the cluster. Furthermore, as the total number of allocated PRIMARY KEY, languageCode VARCHAR(6), Map tasks increases, there is additional overhead required for the pageRank INT, searchWord VARCHAR(32), central job tracker to coordinate node activities. Hence, this fixed avgDuration INT ); duration INT ); overhead increases slightly as more nodes are added to the cluster Our data generator created unique files with 155 million UserVis- and for longer data processing tasks, as shown in Figure 5, this fixed its records (20GB/node) and 18 million Rankings records (1GB/node) cost is dwarfed by the time to complete the required processing. on each node. The visitDate, adRevenue, and sourceIP fields are The upper segments of each Hadoop bar in the graphs represent picked uniformly at random from specific ranges. All other fields the execution time of the additional MR job to combine the output are picked uniformly from sampling real-world data sets. Each data into a single file. Since we ran this as a separate MapReduce job, file is stored on each node as a column-delimited text file. these segments consume a larger percentage of overall time in Fig- 4.3.1 Data Loading ure 4, as the fixed start-up overhead cost again dominates the work needed to perform the rest of the task. Even though the Grep task is We now describe the procedures for loading the UserVisits and selective, the results in Figure 5 show how this combine phase can Rankings data sets. For reasons to be discussed in Section 4.3.5, still take hundreds of seconds due to the need to open and combine only Hadoop needs to directly load the Documents files into its in- many small output files. Each Map instance produces its output in ternal storage system. DBMS-X and Vertica both execute a UDF a separate HDFS file, and thus even though each file is small there that processes the Documents on each node at runtime and loads are many Map tasks and therefore many files on each node. the data into a temporary table. We account for the overhead of For the 1TB/cluster data set experiments, Figure 5 shows that all this approach in the benchmark times, rather than in the load times. systems executed the task on twice as many nodes in nearly half the Therefore, we do not provide results for loading this data set. amount of time, as one would expect since the total amount of data was held constant across nodes for this experiment. Hadoop and Hadoop: Unlike the Grep task’s data set, which was uploaded di- DBMS-X performs approximately the same, since Hadoop’s start- rectly into HDFS unaltered, the UserVisits and Rankings data sets up cost is amortized across the increased amount of data processing needed to be modified so that the first and second columns are sep- for this experiment. However, the results clearly show that Vertica arated by a tab delimiter and all other fields in each line are sepa- outperforms both DBMS-X and Hadoop. We attribute this to Ver- rated by a unique field delimiter. Because there are no schemas in tica’s aggressive use of data compression (see Section 5.1.3), which the MR model, in order to access the different attributes at run time, becomes more effective as more data is stored per node. the Map and Reduce functions in each task must manually split the value by the delimiter character into an array of strings. 4.3 Analytical Tasks We wrote a custom data loader executed in parallel on each node To explore more complex uses of both types of systems, we de- to read in each line of the data sets, prepare the data as needed, veloped four tasks related to HTML document processing. We first and then write the tuple into a plain text file in HDFS. Loading generate a collection of random HTML documents, similar to that the data sets in this manner was roughly three times slower than which a web crawler might find. Each node is assigned a set of using the command-line utility, but did not require us to write cus-
  • 8. 160 factor across all cluster scaling levels. Although the relative per- formance of all systems degrade as both the number of nodes and 140 the total amount of data increase, Hadoop is most affected. For example, there is almost a 50% difference in the execution time 120 between the 1 node and 10 node experiments. This is again due 100 to Hadoop’s increased start-up costs as more nodes are added to seconds the cluster, which takes up a proportionately larger fraction of total 80 query time for short-running queries. 60 Another important reason for why the parallel DBMSs are able to outperform Hadoop is that both Vertica and DBMS-X use an in- 40 ← 12.4 dex on the pageRank column and store the Rankings table already ← 4.7 20 ← 1.8 ← 0.8 ← 0.3 sorted by pageRank. Thus, executing this query is trivial. It should also be noted that although Vertica’s absolute times remain low, its 0 1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes relative performance degrades as the number of nodes increases. This is in spite of the fact that each node still executes the query in Vertica Hadoop the same amount of time (about 170ms). But because the nodes fin- ish executing the query so quickly, the system becomes flooded with Figure 6: Selection Task Results control messages from too many nodes, which then takes a longer time for the system to process. Vertica uses a reliable message layer tom input handlers in Hadoop; the MR programs are able to use for query dissemination and commit protocol processing [4], which Hadoop’s KeyValueTextInputFormat interface on the data we believe has considerable overhead when more than a few dozen files to automatically split lines of text files into key/values pairs by nodes are involved in the query. the tab delimiter. Again, we found that other data format options, such as SequenceFileInputFormat or custom Writable 4.3.3 Aggregation Task tuples, resulted in both slower load and execution times. Our next task requires each system to calculate the total adRev- enue generated for each sourceIP in the UserVisits table (20GB/node), DBMS-X: We used the same loading procedures for DBMS-X as grouped by the sourceIP column. We also ran a variant of this query discussed in Section 4.2. The Rankings table was hash partitioned where we grouped by the seven-character prefix of the sourceIP col- across the cluster on pageURL and the data on each node was sorted umn to measure the effect of reducing the total number of groups by pageRank. Likewise, the UserVisits table was hash partitioned on query performance. We designed this task to measure the per- on destinationURL and sorted by visitDate on each node. formance of parallel analytics on a single read-only table, where nodes need to exchange intermediate data with one another in order Vertica: Similar to DBMS-X, Vertica used the same bulk load com- compute the final value. Regardless of the number of nodes in the mands discussed in Section 4.2 and sorted the UserVisits and Rank- cluster, this tasks always produces 2.5 million records (53 MB); the ings tables by the visitDate and pageRank columns, respectively. variant query produces 2,000 records (24KB). Results & Discussion: Since the results of loading the UserVisits SQL Commands: The SQL commands to calculate the total adRev- and Ranking data sets are similar, we only provide the results for enue is straightforward: loading the larger UserVisits data in Figure 3. Just as with loading the Grep 535MB/node data set (Figure 1), the loading times for SELECT sourceIP, SUM(adRevenue) each system increases in proportion to the number of nodes used. FROM UserVisits GROUP BY sourceIP; 4.3.2 Selection Task The variant query is: The Selection task is a lightweight filter to find the pageURLs SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) in the Rankings table (1GB/node) with a pageRank above a user- FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7); defined threshold. For our experiments, we set this threshold pa- rameter to 10, which yields approximately 36,000 records per data MapReduce Program: Unlike the previous tasks, the MR program file on each node. for this task consists of both a Map and Reduce function. The Map function first splits the input value by the field delimiter, and then SQL Commands: The DBMSs execute the selection task using the outputs the sourceIP field (given as the input key) and the adRev- following simple SQL statement: enue field as a new key/value pair. For the variant query, only the first seven characters (representing the first two octets, each stored SELECT pageURL, pageRank as three digits) of the sourceIP are used. These two Map functions FROM Rankings WHERE pageRank > X; share the same Reduce function that simply adds together all of the MapReduce Program: The MR program uses only a single Map adRevenue values for each sourceIP and then outputs the prefix and function that splits the input value based on the field delimiter and revenue total. We also used MR’s Combine feature to perform the outputs the record’s pageURL and pageRank as a new key/value pre-aggregate before data is transmitted to the Reduce instances, pair if its pageRank is above the threshold. This task does not re- improving the first query’s execution time by a factor of two [8]. quire a Reduce function, since each pageURL in the Rankings data set is unique across all nodes. Results & Discussion: The results of the aggregation task experi- ment in Figures 7 and 8 show once again that the two DBMSs out- Results & Discussion: As was discussed in the Grep task, the re- perform Hadoop. The DBMSs execute these queries by having each sults from this experiment, shown in Figure 6, demonstrate again node scan its local table, extract the sourceIP and adRevenue fields, that the parallel DBMSs outperform Hadoop by a rather significant and perform a local group by. These local groups are then merged at
  • 9. 1800 1400 1600 1200 1400 1000 1200 seconds seconds 800 1000 800 600 600 400 400 200 200 0 0 1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes 1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes Vertica Hadoop Vertica Hadoop Figure 7: Aggregation Task Results (2.5 million Groups) Figure 8: Aggregation Task Results (2,000 Groups) the query coordinator, which outputs results to the user. The results Once this table is populated, it is then trivial to use a second query in Figure 7 illustrate that the two DBMSs perform about the same to output the record with the largest totalRevenue field. for a large number of groups, as their runtime is dominated by the cost to transmit the large number of local groups and merge them SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, at the coordinator. For the experiments using fewer nodes, Vertica SUM(adRevenue) as totalRevenue performs somewhat better, since it has to read less data (since it FROM Rankings AS R, UserVisits AS UV can directly access the sourceIP and adRevenue columns), but it WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) becomes slightly slower as more nodes are used. AND Date(‘2000-01-22’) Based on the results in Figure 8, it is more advantageous to use GROUP BY UV.sourceIP; a column-store system when processing fewer groups for this task. SELECT sourceIP, totalRevenue, avgPageRank This is because the two columns accessed (sourceIP and adRev- FROM Temp enue) consist of only 20 bytes out of the more than 200 bytes per ORDER BY totalRevenue DESC LIMIT 1; UserVisits tuple, and therefore there are relatively few groups that need to be merged so communication costs are much lower than in MapReduce Program: Because the MR model does not have an the non-variant plan. Vertica is thus able to outperform the other inherent ability to join two or more disparate data sets, the MR pro- two systems from not reading unused parts of the UserVisits tuples. gram that implements the join task must be broken out into three Note that the execution times for all systems are roughly consis- separate phases. Each of these phases is implemented together as a tent for any number of nodes (modulo Vertica’s slight slow down as single MR program in Hadoop, but do not begin executing until the the number of nodes increases). Since this benchmark task requires previous phase is complete. the system to scan through the entire data set, the run time is always bounded by the constant sequential scan performance and network Phase 1 – The first phase filters UserVisits records that are outside repartitioning costs for each node. the desired data range and then joins the qualifying records with records from the Rankings file. The MR program is initially given 4.3.4 Join Task all of the UserVisits and Rankings data files as input. The join task consists of two sub-tasks that perform a complex Map Function: For each key/value input pair, we determine its calculation on two data sets. In the first part of the task, each sys- record type by counting the number of fields produced when split- tem must find the sourceIP that generated the most revenue within ting the value on the delimiter. If it is a UserVisits record, we a particular date range. Once these intermediate records are gener- apply the filter based on the date range predicate. These qualify- ated, the system must then calculate the average pageRank of all the ing records are emitted with composite keys of the form (destURL, pages visited during this interval. We use the week of January 15- K1 ), where K1 indicates that it is a UserVisits record. All Rankings 22, 2000 in our experiments, which matches approximately 134,000 records are emitted with composite keys of the form (pageURL, records in the UserVisits table. K2 ), where K2 indicates that it is a Rankings record. These output The salient aspect of this task is that it must consume two data records are repartitioned using a user-supplied partitioning function different sets and join them together in order to find pairs of Rank- that only hashes on the URL portion of the composite key. ing and UserVisits records with matching values for pageURL and Reduce Function: The input to the Reduce function is a single destURL. This task stresses each system using fairly complex op- sorted run of records in URL order. For each URL, we divide its erations over a large amount of data. The performance results are values into two sets based on the tag component of the composite also a good indication on how well the DBMS’s query optimizer key. The function then forms the cross product of the two sets to produces efficient join plans. complete the join and outputs a new key/value pair with the sour- ceIP as the key and the tuple (pageURL, pageRank, adRevenue) as SQL Commands: In contrast to the complexity of the MR program the value. described below, the DBMSs need only two fairly simple queries to complete the task. The first statement creates a temporary table and Phase 2 – The next phase computes the total adRevenue and aver- uses it to store the output of the SELECT statement that performs age pageRank based on the sourceIP of records generated in Phase the join of UserVisits and Rankings and computes the aggregates. 1. This phase uses a Reduce function in order to gather all of the
  • 10. 8000 1800 1600 7000 1400 6000 1200 5000 seconds seconds 1000 4000 800 3000 600 2000 400 ← 85.0 ← 36.1 ← 31.3 ← 31.9 ← 28.2 ← 28.0 ← 29.2 ← 29.4 ← 21.5 ← 15.7 1000 200 0 0 1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes 1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes Vertica Hadoop Vertica DBMS−X Hadoop Figure 9: Join Task Results Figure 10: UDF Aggregation Task Results records for a particular sourceIP on a single node. We use the iden- seconds to split, parse, and deserialize the various attributes. Thus, tity Map function in the Hadoop API to supply records directly to the CPU overhead needed to parse these tables on the fly is the lim- the split process [1, 8]. iting factor for Hadoop. Reduce Function: For each sourceIP, the function adds up the Second, the parallel DBMSs are able to take advantage of the fact adRevenue and computes the average pageRank, retaining the one that both the UserVisits and the Rankings tables are partitioned by with the maximum total ad revenue. Each Reduce instance outputs the join key. This means that both systems are able to do the join a single record with sourceIP as the key and the value as a tuple of locally on each node, without any network overhead of repartition- the form (avgPageRank, totalRevenue). ing before the join. Thus, they simply have to do a local hash join between the Rankings table and a selective part of the UserVisits Phase 3 – In the final phase, we again only need to define a sin- table on each node, with a trivial ORDER BY clause across nodes. gle Reduce function that uses the output from the previous phase to 4.3.5 UDF Aggregation Task produce the record with the largest total adRevenue. We only exe- cute one instance of the Reduce function on a single node to scan The final task is to compute the inlink count for each document all the records from Phase 2 and find the target record. in the dataset, a task that is often used as a component of PageR- Reduce Function: The function processes each key/value pair ank calculations. Specifically, for this task, the systems must read and keeps track of the record with the largest totalRevenue field. each document file and search for all the URLs that appear in the Because the Hadoop API does not easily expose the total number contents. The systems must then, for each unique URL, count the records that a Reduce instance will process, there is no way for number of unique pages that reference that particular URL across the Reduce function to know that it is processing the last record. the entire set of files. It is this type of task that the MR is believed Therefore, we override the closing callback method in our Reduce to be commonly used for. implementation so that the MR program outputs the largest record We make two adjustments for this task in order to make pro- right before it exits. cessing easier in Hadoop. First, we allow the aggregate to include self-references, as it is non-trivial for a Map function to discover Results & Discussion: The performance results for this task is dis- the name of the input file it is processing. Second, on each node played in Figure 9. We had to slightly change the SQL used in 100 we concatenate the HTML documents into larger files when storing node experiments for Vertica due to an optimizer bug in the system, them in HDFS. We found this improved Hadoop’s performance by which is why there is an increase in the execution time for Vertica a factor of two and helped avoid memory issues with the central going from 50 to 100 nodes. But even with this increase, it is clear HDFS master when a large number of files are stored in the system. that this task results in the biggest performance difference between Hadoop and the parallel database systems. The reason for this dis- SQL Commands: To perform this task in a parallel DBMS re- parity is two-fold. quires a user-defined function F that parses the contents of each First, despite the increased complexity of the query, the perfor- record in the Documents table and emits URLs into the database. mance of Hadoop is yet again limited by the speed with which the This function can be written in a general-purpose language and is large UserVisits table (20GB/node) can be read off disk. The MR effectively identical to the Map program discussed below. With this program has to perform a complete table scan, while the parallel function F, we populate a temporary table with a list of URLs and database systems were able to take advantage of clustered indexes then can execute a simple query to calculate the inlink count: on UserVisits.visitDate to significantly reduce the amount of data SELECT INTO Temp F(contents) FROM Documents; that needed to be read. When breaking down the costs of the dif- SELECT url, SUM(value) FROM Temp GROUP BY url; ferent parts of the Hadoop query, we found that regardless of the number of nodes in the cluster, phase 2 and phase 3 took on aver- Despite the simplicity of this proposed UDF, we found that in age 24.3 seconds and 12.7 seconds, respectively. In contrast, phase practice it was difficult to implement in the DBMSs. 1, which contains the Map task that reads in the UserVisits and For DBMS-X, we translated the MR program used in Hadoop Rankings tables, takes an average of 1434.7 seconds to complete. into an equivalent C program that uses the POSIX regular expres- Interestingly, it takes approximately 600 seconds of raw I/O to read sion library to search for links in the document. For each URL the UserVisits and Rankings tables off of disk and then another 300 found in the document contents, the UDF returns a new tuple (URL,