2. Second, the underlying single-machined DBMSs are techniques [10][11]. These works mainly deal with the
unable to use global index residing outside each single language problem.
system, so may be not optimal in terms of performance for The second one is for efficiency. Some works have
some kind of queries. Maybe this can be handled in the been done to improve the kernel part of Hadoop
middleware in some way, but it is still making a detour. [12][13][14][15]. Some other works improve the system by
Third, the data loading speed of DBMS is slow due to external mechanisms [7] [16]. The work on integrating
very strict constraints on the data schema and semantics. DBMS and Hadoop comes from the thoughts in [5], which
The data usually need to go through complex logic before point out that the brute-force style work of MapReduce is
being stored into the storage, which is not necessary for not optimal in terms of efficiency, and something should be
many applications. done to merge the techniques from MapReduce and DBMS.
The construction of HadoopDB takes a DBMS basis, so HadoopDB [7] is the first, as we know, to try to merge the
incurs inevitable limitations due to the DBMS legacy. We two systems. Although pioneered the idea, HadoopDB still
believe that Hadoop has already done a lot to make itself a has limitations which will bottleneck the application in real
competent system for large-scale data processing scene.
applications, so the right way is to take a Hadoop basis, and
borrow DBMS techniques when appropriate. III. SYSTEMS FOR LARGE-SCALE DATA ANALYSIS
In this paper, we propose our approach to integrate In this section we will first introduce the requirements
DBMSs as a read-only execution layer into Hadoop. Based of our application of large dataset analysis. We believe that
on Hadoop, we incorporate modified DBMS engines which these requirements are typical for many other applications.
are augmented with a customized storage engine capable of Based on these requirements, we revisit existing systems
directly accessing data from HDFS and taking use of global for data processing. Finally, in terms of merging the
index access method. In this architecture, DBMS plays a techniques of both Hadoop and DBMS, we determine the
role of providing efficient read-only operators, instead of position where our system should stand.
managing the data. The following benefits are obtained:
(1) With data being put on HDFS, the fault tolerance A. Application Requirements
problem in the data layer is solved naturally. Our case is a network security application. There are
(2) The DBMS engine executes the sub-queries with the some monitors keeping watching the whole network and
efficiency advantage as is the case for HadoopDB. generating sampling records for captured events. These
Besides, based on HDFS and MapReduce, a global generated data are streaming into the analytic platform in
index mechanism is able to be put into action with the real time. Once stored into the system, the data should be
DBMS engines, and significantly improves the available for ad-hoc read-only queries. This analytic
performance for certain queries. platform is the one we focus on.
(3) As a read-only layer, DBMS is not responsible for the
data loading, but instead, the data are loaded through a 1) Large-scale Parallel Processing
loader outside the DBMS, or the user can write the data The large-scale parallel processing power is a basic
directly to the HDFS in a predefine manner. Doing this requirement for all large dataset analysis applications.
greatly accelerates the data loading speed to the raw Different with the point operations of key-value model in
speed of HDFS writing, while keeps the convenience web services, analysis work usually needs to access big part
and flexibility for the user. of the whole dataset even for a single query, which must
The remainder of the paper is organized as follows: resort to large-scale parallel processing for huge raw power.
Section II introduces the related work; Section III analyzes In such environment, automatic mechanism is very critical.
the application requirements and existing systems, and then Automatic parallelization, scheduling and fault handling
positions the desired system; Section IV describes our liberate the user from heavy programming and maintenance
proposed system; Section V gives experimental results; work. All these play an important role in guaranteeing the
Section VI concludes the paper. scalability of the system.
2) High Efficiency
II. RELATED WORK
Efficiency is a necessary concern because it will take up
With the wide deployment in the data analysis field, so many resources for each single query, and higher
there appear two types of works to improve Hadoop-based efficiency can lower the cost significantly.
systems. The first one is for the usability. Despite the Besides the common reasons, there is a special one in
flexibility, for many users, the low level language of our case. In many other applications, the data have a short
MapReduce is somewhat inconvenient to use compared to life cycle. The data are loaded into the system in batch
that in a higher level such as SQL of relational DBMS. So mode, then some almost fixed queries are put onto the data,
some systems on top of Hadoop are developed. Facebook’s and after that, the data will be removed or offloaded to the
Hive [8] and Yahoo’s Pig [9] are examples of this kind. offline system. In such condition, organizing the data into
They provide simple declarative languages capable of some sophisticated structure is not worthwhile given the
expressing complex ad-hoc queries on structured data. extra maintenance cost and the low utility. Sometimes the
Some other higher level languages or system level products tasks on the data are simply to generate statistical reports as
are also developed on top of either MapReduce or similar timing jobs, for example, only working during night, so it
18
3. may be all right even if the execution takes a less efficient applied to every layer and component of DBMS, such as
way. These applications can be seen as data processing optimized data storage format, diverse access methods,
rather than data analyzing. sophisticated query execution, efficient data cache, etc.
By comparison, our application is to deal with ad-hoc Many of these techniques are widely copied and reinvented
analytic queries over long existing dataset. It is worthwhile by other systems [17]. This is the advantage of DBMS, and
to adopt some optimized data structures and execution is desirable for large dataset analysis.
mechanism to improve the query efficiency. For example, For data loading, there is the limitation in DBMS-based
the queries are often with predicates on some attributes, for systems. Due to strict constraints such as ACID property,
which using index can reduce the execution cost in certain the system can not load the data in an efficient way,
cases. Although it has extra cost on index maintenance, this especially for online loading. The online loading speed of
will be amortized by the repeated usage. Ad-hoc queries any DBMS node is lower than 10MB/s, as far as we know.
also make the cache usable. Queries on the same set of data The systems in data warehouse application are often offline
may exist, so the cache will make sense. Using index also systems with very weak online loading requirement, in
poses demand for index data cache. After all, the long life which the common case is to load data in batch at regular
cycle of the data and ad-hoc queries will justify the effort time points.
for optimization. Hadoop: Many applications replace DBMS with
Hadoop for a couple of reasons. One of the most important
3) Continuous High Speed Data Loading ones is that Hadoop is scalable due to the fault tolerance. In
The data in our application are streaming in addition, Hadoop is easier to deploy and use. For data
continuously at relatively high speed. This requires the processing tasks, MapReduce provides very simple and
system be capable of loading data with high speed in an flexible parallel programming paradigm, and is able to
online mode. The data can be stored in appending manner, express complex queries.
and once stored there will be no update on them. High As to parallel processing, Hadoop is totally born for it.
speed online loading requires the logic on the path be MapReduce run-time system has full ability to parallelize
simple enough. Unnecessary strict semantics checking the whole processing in large-scale systems. Both
should be avoided. MapReduce and HDFS are completely fault tolerant,
B. Existing Systems Reconsidered making the whole system highly scalable. This is one of the
In response to the requirements of the application, we most important properties of Hadoop, and is critical for
consider three types of systems: database management large dataset analysis. In addition, the block-level
system, Hadoop, and HadoopDB. If we treat the techniques replication of HDFS gives MapReduce great opportunities
of DBMS and Hadoop as two extremes, there should be a for high degree of parallelism and fine-grained execution
broad spectrum between them. To satisfy the requirements fault-tolerance, which improves the performance
of the application, it is right to draw strength from both. significantly.
HadoopDB is actually a DBMS equipped with some For efficiency, Hadoop seems to have a long way to go.
Hadoop techniques. However, we believe that our system MapReduce is working in a brute-force way. Whatever the
should start from the other side. query is, it has to scan all the data without helper structures
such as index. The scan and processing are not guaranteed
1) Existing Systems to be efficient, because it is the user’s work to implement
DBMS: DBMS has long story for data management. the details. The data are often stored on HDFS in text
The irreplaceable domain of DBMS is transaction format which is straightforward to the user but not compact.
processing, where ACID property must be guaranteed. As In these aspects, Hadoop is not as good as DBMS, at least
to the data analysis, DBMSs also hold an important for now.
position. Parallel DBMSs are very popular in data As to data loading, writing data directly to HDFS can
warehouse market. In this domain, parallel DBMS provides be guaranteed with an acceptable speed. For many large-
a high degree of parallelism and achieves good scale data analysis applications including ours, weak
performance for analytic queries. consistency is enough. The data can go though without
For parallel processing, parallel DBMS only competes complex logic, so writing structured data can also achieve
at a limited scale. The serious problem with DBMSs is that the same speed as the case for unstructured byte stream.
most of them are not fault tolerant. If something goes Although the speed can be lowered when replicas are
wrong during the execution of a query, it has to restart stored, this is an inevitable tradeoff with fault tolerance on
entirely. In a large-scale system consisting of thousands of data.
components, a long running query will never succeed HadoopDB: DBMS and Hadoop have their own
considering the high failure rate. While parallel DBMSs are superiority in the appropriate domain. To better satisfy the
competent for data warehouse, they are not suitable for emerging applications, DBMS must incorporate fault
large-scale applications in this aspect. The largest DBMS- tolerance in order to be scalable, while Hadoop should
based analytic system as we know consists of only 100 borrow techniques from DBMS to improve the efficiency.
machines. HadoopDB is constructed based on this idea. DBMSs are
For efficiency, DBMS has embodied decades of taken as the storage and execution units, and MapReduce
academic and industrial research. Optimizations have been mechanism takes responsibility for parallelization and fault
19
4. tolerance on top of the underlying DBMSs. Fig. 1 shows DBMS camp, which seems capable of supporting this kind
the architecture of HadoopDB briefly. HDFS is used to of applications.
store the system metadata and the result set of the query. HadoopDB merges techniques from both DBMS and
All source data are stored in DBMSs. When executing a Hadoop, but it is hardly be used for the applications due to
query, Maps are scheduled to the nodes according to the the DBMS legacy and the further required implementing
metadata which tells the location of each block of the data. work. So now we must identify two different ways of
Maps issue SQL queries to the underlying DBMSs and merging DBMS and Hadoop techniques, which can help
emit the result records to Reduces. Reduces aggregate positioning the desired system. The difference between the
result sets from multiple nodes, and write the final results two ways is about the starting point for constructing an
onto HDFS. integrated system: one from DBMS, another from Hadoop.
HadoopDB tries to introduce fault tolerance and fine-
grained parallelism into the parallel DBMS, so belongs to
the first category. While high efficiency is reserved, all
strict constraints of DBMS are also inherited. HadoopDB is
desired for applications where the strict schema and
semantics of data are given high priority. So it is capable of
dealing with traditional database applications. After all,
HadoopDB means Hadoop database, not database Hadoop.
The system we need should go from the other side.
Hadoop satisfies the majority of our need except that in
efficiency, so we should integrate DBMS techniques into
Hadoop-based system, rather than the reverse. MapReduce
Figure 1. Architecture of HadoopDB. and HDFS are all developed for large-scale data processing
applications, and it is only the efficiency that needs special
For parallel processing, HadoopDB is just partially fault concern. Hence, we should position our system closer to
tolerant. MapReduce only guarantees the fault tolerance in the Hadoop side, while be positive to incorporate desired
the execution layer. The data are stored in DBMSs rather properties from the DBMS.
than HDFS, so the availability of data should be specially
handled. Common DBMS has no special concern on fine- C. Our Approach
grained data replication for intra-query parallelization, Different with HadoopDB, we take DBMSs as read-
without which the MapReduce framework can not only execution components. For a specific query, dataset on
completely exploit the parallelism. HadoopDB uses HDFS is split logically into blocks as usual, and each block
batched approach to dump the data out of the database and is assigned an executor which is now a database execution
replicate them to some other nodes, which doesn’t support thread; all intermediate results computed by the database
online loading at all. This functionality can be achieved in engines are aggregated by Reducers which are the same as
the middleware, but implementing replications of fine before; final results are written onto HDFS naturally.
granularity on top of table of DBMS needs non-trivial work, Using this approach, the parallel execution is still in
which makes it hard to use in practice. block granularity, and fault tolerance is guaranteed in both
For efficiency, the advantage of DBMS in this aspect the data layer and the execution layer.
can be reflected in the integrated system, because each Efficiency now depends on the DBMS layer. Many
query will be translated into sub-queries actually executed techniques such as data cache, query cache, optimized
by individual DBMS query engine. Despite the operators in the database will still make effect. But data
improvement, there is still one limitation on global access methods are partially different as before due to the
structure mechanism. Because all DBMSs in HadoopDB customized storage engine using HDFS. Index access
are unmodified ones of single-machined version, this layer mechanism should be reconsidered and adapted to work in
can not take use of any global structures, such as global this situation.
index which makes sense when the query is with predicates The data loading process is intuitive. Streaming data
of high selectivity. can be packed into optimized format as that in DBMS, and
Finally comes the data loading requirement. In then be directly written onto HDFS bypassing the logic for
HadoopDB, DBMS-based storage obviously inherits the transactions, which will not cost too much through this
limitation. The data have to be loaded through a complete simple logic.
DBMS logic, so it is difficult to improve the loading speed.
IV. DBMS ENGINE INTEGRATED HADOOP SYSTEM
2) Discussion
We have given an analysis of two typical systems and In this section, we will give a detailed description of the
the integrated HadoopDB respectively. Lack of fault system constructed for our application. We first give the
tolerance eliminates traditional parallel DBMSs from the overview of the system architecture, and then focus on the
candidates for large-scale data processing applications. query execution process. Besides the familiar full scan
However, HadoopDB, as a fault tolerant parallel DBMS in execution, a global index access mechanism in MapReduce
essence, becomes a promising representative from the framework is introduced.
20
5. A. Overview fits DBMSs into the MapReduce execution framework very
The system consists of four parts as shown in Fig. 2. well.
The bottom is the storage layer HDFS. On top of HDFS are From the perspective of the whole system, we embed
the database query engines as the executors. The top is modified database engines into Hadoop rather than just
MapReduce system. The middleware layer contains the gluing them together like HadoopDB, which yields a more
data loader, the indexer, etc. coordinated system.
The data loader stores the incoming data onto storage in B. Query Execution
a simple way. The data are packed in binary format into the
pages, each of which is the smallest I/O unit of the database Fig. 3 describes the framework of the query execution.
query engine. The The query is first translated into sub-queries expressed in
SQL. Sub-queries will be executed by each database engine
thread. Besides the operations applied, the sub-query also
indicates the position information of the data block on
which the database engine thread should process. This
position information is figured out by the splitting process
on the source data file, and is used by MapReduce runtime
system to schedule the sub-tasks. Each sub-query is passed
to an instance of Map on a specific node, where the
position parameters in the sub-query will be set to the
according splitting result values.
Figure 2. Architecture.
page size is fixed according to system parameter, and is
32KB by default. Using binary format reduces the occupied
disk space compared to the text representation, so improves
both the loading and query performance. The binary format
is also obeyed by the customized storage engine in the
database when parsing the data into tuples. Note that
although in binary formant for structured data, the CPU
cost of loading is not increased, and in fact, it is more
efficient than using text format. The data will be replicated Figure 3. Query execution.
automatically by HDFS in the block granularity. The
Each Map instance issues the sub-query of the SQL
default block size is 64MB, and is configurable. The block
format to the local database engine thread, and emits the
size of HDFS is an integer multiple of the page size for
results returned by the database. The Map instance doesn’t
ease of implementation.
need to aggregate the local intermediate results, because the
The indexer can create some kind of index on the
sub-query executed by the database already finishes this.
loaded data in the batch mode. Because HDFS is append-
The most critical part for this process is the customized
only, it will be complex to build an updatable index in the
storage engine that provides the ability to access HDFS
real-time manner. Actually, the data are often queried with
data at block level.
time range predicate, so creating separated indexes along
time dimension on a periodic basis is an acceptable solution 1) Customized Storage Engine
in real scenes. We support B+-tree index for now. The B+- The query engine accesses the data through the storage
tree index searches all the data across the cluster, so it is a engine using a collection of routines in an iterator manner.
global index structure. The index data are also stored in init() is first evoked every time the executor wants to
HDFS, and can be seen by each database executor. The accessing the data. After that, get_next() routine is called
detailed index structure and index access method in repeatedly by the executor, which returns a tuple each time,
MapReduce framework will be described in Section IV.C. and the executor applies the operations on the stream of the
The database executors are actually MySQL server tuples. When the query is finished, a close() function is
threads. MapReduce run-time system schedules a sub-task called, which cleans the context.
(Map instance) to a specific node, on which the sub-task We store the dataset on HDFS, so the first thing we do
issues a SQL query to the underlying MySQL server on this is to implement the routines using HDFS API. The
node. We implement a new storage engine for MySQL so implementation is almost the same as that for local file
that the query engine can get tuples from HDFS data files. system, except that all file system calls is replaced by the
Some tricks are applied to make the query engine capable HDFS collection. The data are page formatted, and the
of accessing tuples from a specific block of the HDFS file, reading is on the page basis. The predefined data format is
which provides the ability to execute at the block level and obeyed when parsing the tuple out of the page. A data
21
6. cache is also implemented in the storage engine, where nodes must be launched, because all the local indexes
LRU evict algorithm is adopted as usual. search the same value domain. The control overhead such
The schema definition of the HDFS dataset must be as that on setup and cleanup of sub-tasks will occupy a big
registered in the database, so that the query can be executed part of the total running time. Comparatively, when using
by the database without syntax or semantics exceptions. global index, only some of the nodes possessing the
This is achieved by the cooperation of the data loader, qualified index entries need to be started. However, global
database query engine and the storage engine. When a new index access will incur global communications which
table is to be created, the data loader creates necessary data should not be neglected.
files for this table and issues ‘create table’ to all nodes. The HadoopDB consists of single-machined DBMSs which
database query engine will process this DDL (Data are unable to take use of global index, so the supported one
Definition Language) query and record the information is just local index mechanism.
about the table into the metadata. Then the query engine In our integrated system, the data are replicated and
will call the create() function of the storage engine with distributed by HDFS, so the DBMS layer has no knowledge
necessary parameters. In create(), the normal routine is to about the locality of the data, which means that local index
create data files and other data structures needed, but in our doesn’t make sense. We choose to implement the global
implementation, it just opens the data files already created index mechanism. The index mechanism must give
by the data loader, and initiates the context. After executing consideration to the MapReduce execution style so that
the command, the data loader gets ready to load the makes the index access be parallelized in this framework.
incoming data to the new table data file, and all databases
are available for queries on this table. 1) Index Creation
Every database query engine now is able to see the The indexer is responsible for building B+-tree index on
whole dataset on HDFS. However, it can only execute the dataset, and the index file is stored into HDFS, so it can
query at the table level, so you can not specify which part be accessed by all the nodes. Because there will only be
of this table the query should process. This manner is not read requests, the index doesn’t need to support update
appropriate for MapReduce. MapReduce framework operations. So the entries in the B+-tree node are dense-
logically splits the dataset into blocks, and assigns each packed, leaving no free space for later insert. We create the
Map instance a block to process. To fit into this paradigm, B+-tree index as follows: first, sort the
the database must be able to process at the block level (value_on_index_attribute, offset) pairs from all the records
rather than the table level. Here we make use of pseudo and write them sequentially into the index file which
column to achieve this goal. Besides the columns for the directly forms the leaf nodes of the tree; then scan the leaf
data, we introduce an addition pseudo column blk, which nodes, create all the intermediate nodes and the root node
exists in the metadata of the database but is not stored in a bottom-up fashion, and append them to the index file.
actually. This column is used to pass parameters about the Traditional B+-tree index may be created through insert
position information of the data block which needs operations in an online mode which makes the leaf nodes
processing. With determined data block, the Map instance not physically contiguous in the file, but connected by
adds a predicate on the pseudo column to the where clause pointers. Our approach guarantees that the leaf nodes
of the SQL query. When executing this modified query, the occupy a contiguous range of space in the index file which
position constants will be sent to the storage engine, so facilitates the parallel access to these leaf nodes during
only the tuples in the indicated storage range will be read in. query. The structure of the index is illustrated in Fig. 4.
During this process, the query engine works as before
without the perception of this matter.
Till now, the database engines are well integrated into
Hadoop framework.
C. Global Index Mechanism
One of the useful auxiliary data structures in DBMS is
the index. For certain queries, index assisted execution can
improve the efficiency. For example, with predicate of high
selectivity, the qualified tuples for the query are only a
small part of the whole. The brute-force scan on the whole
dataset will waste too much energy. If there is some index
on the predicate attribute, using index to directly retrieve
the qualified tuples may save a lot.
In parallel DBMS, there are two kinds of indexes in
Figure 4. Leaf nodes occupy continuous space in the index file and Maps
terms of locality. Local index is the one that resides on one work on selected leaf node data blocks.
node and only searches the local dataset on the same node;
global index is the one that has references to the data across
the whole cluster, and the global index itself usually
distributes across all the nodes. When using local index, all
22
7. 2) Index Access is set up on the cluster, and one MySQL server of version
To support the index access method, we add another 5.0 is running on each individual node.
pseudo column idx to the table schema, and modify the The benchmark is from our application. Although from
storage engine implementation accordingly. In the case of specific domain, the data schema and operation are very
index access, idx and blk attributes will be used together to common to many other applications. The data schema is a
give the indication to the storage engine. table with 9 integer columns, which are time, systemID,
When the query is with predicate of high selectivity on deviceID, eventType, port, inBytes, outBytes, inPackets,
the indexed attribute, the index access method will be outPackets respectively. time is the number of seconds
chosen. Before starting up the MapReduce tasks, several since the Epoch. It starts from 1235750430 in the
traversals through the index are taken to locate the start and benchmark, and increases by 1 every 131072 records.
end positions in the leaf nodes for each predicate value. systemID is uniformly distributed in the integer range [1,
The index entries in the leaf nodes between the start and the 15]. deviceID and eventType are respectively uniformly
end positions are those pointing to the records satisfying distributed in the integer range [1, 50] and [100000000,
the predicate. If the predicate is a range, only two traversals 100000014]. port, inBytes, outBytes, inPackets and
will be needed. Because the height of tree is usually very outPackets are all uniformly distributed in the integer range
low, this process will not take much time compared to the [0, 65535]. The whole dataset has about 471,859,200
later processing using MapReduce. records over a time range of one hour.
According to the start and end positions, Maps are The queries we use are
generated attaching to the leaf node pages in the selected SELECT truncate(time/60,0), systemID, deviceID, eventType,
ranges as shown in Fig. 4. Each Map will add the sum(inBytes), sum(outBytes), sum(inPackets), sum(outPackets)
predicates on blk and idx to the where clause of the SQL FROM table
[WHERE port IN (port_list)]
query. idx parameter specifies the index to be used, and blk GROUP BY truncate(time/60, 0), systemID, deviceID, eventType
parameters now indicate the start and end offsets to the leaf
nodes. During the execution of the sub-query, the storage where the where clause may vary in different experiments.
engine will scan the index entries in the selected range of This query actually groups and summaries the data by
index leaf nodes, and retrieves the tuple using the offset in systemID, deviceID and eventType attributes in a minute-
the index entry. The other phases of the execution are the granularity with a predicate on the port attribute.
same with that for the full scan case. We compare three systems: Hadoop, HadoopDB-like
In this execution mode, the number of Maps is related system (HadoopDB-L for short) which is implemented on
to the selectivity of the predicate. Using the index, a top of MySQL, and our database engine integrated Hadoop
minimal number of Maps are generated and only the system (DBEHadoop for short). The data in Hadoop are in
qualified records are read in. text format with columns separated by space characters,
and the whole dataset in the benchmark occupies about
D. Summary 25GB space. The data in HadoopDB-L and DBEHadoop
We have proposed a new system architecture are in paged binary format with each value of attribute
integrating modified database engines as a read-only occupying four bytes, and the whole dataset occupies about
execution layer into Hadoop. Data replications are handled 15GB space. All systems are configured with 2 replicas in
by HDFS naturally. The modified database engine is able to 64MB block granularity for the data, so the actual storage
process the data from HDFS file at the block level, so fits space for the dataset doubles. The data of HadoopDB-L are
very well into MapReduce. The global index access manually replicated across all nodes, and each block of
mechanism is added according to MapReduce paradigm. 64MB data is simulated using a separate table. For each
The loading speed can also be guaranteed using HDFS. block (table), there is a local index on the port attribute. In
This integrated system satisfies our application very much. DBEHadoop, a global index on port is created. Because of
The essential difference with HadoopDB is that we being in a public platform, we set the max number of
construct our system on a Hadoop basis, rather than a parallel Maps per node to 3, which is less than the number
DBMS basis. DBMS provides us the efficient operators, of cores in the node.
while the managing of data is handled by other Hadoop-
B. Query without Predicate
based components.
The first experiment is on the query without where
V. EXPERIMENTS clause, so all the data needs to be scanned. The result set
contains 2,287,500 records. The system buffer is cleaned
A. Configurations before the execution.
The experiments are conducted in a cluster consisting of Fig. 5 shows the running time for the three systems to
15 nodes connected by a gigabit Ethernet, which is a part of execute this query. The execution time of Hadoop is much
a public computing platform. Each node has two dual-core longer than that of HadoopDB and DBEHadoop. There are
AMD Opteron™ Processor 275, 8GB DRAM, and a two factors affecting the performance of Hadoop. The first
136GB SCSI disk. The kernel of the operating system is one is the raw size of the data file, and the second one is the
Linux 2.6.9-4.2.ELsmp x86_64. The bandwidth of the local CPU efficiency during the execution. The raw size of data
file system sequential I/O is about 60MB/s. Hadoop 0.19.2 file in Hadoop is larger than the two other systems, so the
23
8. disk I/O is larger in Hadoop. In addition, larger data file 160 Hadoop HadoopDB-L DBEHadoop
running time in seconds
needs more Maps to process, which incurs more control
overhead. In terms of CPU efficiency, due to Java language, 120
Hadoop is less I/O bound when processing large amount of
records with more columns like the one in our case, 80
especially in text format. For this query, Hadoop takes
much more CPU time.
40
200
running time in seconds
0
160
1 10 20
number of ports selected
120
Figure 6. Running time for full scan execution.
80
Next we compare the performance of HadoopDB-L and
40
DBEHadoop with index assistance. Fig. 7 shows the result.
0 For the case of one port, index access is better than full
scan, because random read for a very small amount of
Hadoop HadoopDB-L DBEHadoop
records outperforms the sequential scan of the large dataset.
For HadoopDB-L, index access is only a little better than
Figure 5. Running time for the query without predicate.
full scan, and this is because in both modes, each data
The performance of HadoopDB-L and DBEHadoop are block needs a Map with actually little work to do due to the
almost the same. They have similar data format, so the data high selectivity, so control overhead takes a considerable
sizes are similar. Reading data through HDFS doesn’t large part of the total running time. DBEHadoop is much
cause too much overhead for DBEHadoop because of the more efficient than Hadoop-L, because of the very small
paged sequential I/O, which makes the two systems number of Maps needed. For the case of 10 ports, the
essentially same in the query execution. Using the database performance of index access is almost the same with full
for underlying execution improves the CPU efficiency, so scan execution, which indicates that index access will not
the systems are more I/O bound compared to Hadoop. be superior from this point. For HadoopDB-L, the cost of
random read offsets the benefit of small I/O volume. For
C. Query with Predicate DBEHadoop, random read and communication cost
We now add predicate to the query. The query is to only together offset the benefit. For the case of 20 ports,
process the records with port value in the port_list. When DBEHadoop index access method is much more expensive
the predicate of high selectivity exists, it has a chance to than full scan method. Compared to local index, the cost of
use an index to accelerate the execution. So we conduct this global index access increases more evidently due to the
set of experiments under high selectivity predicate, where network communication in addition to the random disc I/O,
index makes sense. We select 1, 10 and 20 ports which offsets the benefit of the small number of Maps.
respectively to repeat the experiment. Because the value of However, in these conditions, full scan method will be
port is uniformly distributed, the number of selected ports chosen.
rather than the specific values determines the performance. 200
Hadoop DB-L full scan
The system buffer is cleaned before execution. Hadoop DB-L index access
running time in seconds
160
Fig. 6 is the result of the three systems using full scan DBEHadoop full scan
DBEHadoop index access
execution which scans all the data and applies predicate to 120
qualify the records. The time is shorter than that without
predicate for each system, because fewer records need 80
processing after applying the predicate. Full scan has the
same performance in all cases for each specific system, 40
because with only small number of qualified records, the
I/O and control cost actually dominate the total time. 0
Hadoop takes more time than the other systems mainly 1 10 20
number of ports selected
because of the larger data size (more I/O and Map
instances). HadoopDB-L and DBEHadoop have the similar Figure 7. Full scan vs. index access.
performance.
D. Execution under Warm Buffer
In many cases, the same set of data may be queried by
different users, so it is necessary to evaluate the
performance under warm buffer. The previous experiments
24
9. are repeated using the same way except that the data are generates the data itself and loads it into the system. Doing
already in the system buffer cache. this way ignores the overhead of communication between
Fig. 8 shows the result for full scan execution. When the data source and the loader, thus evaluates the raw
without the predicate, the time for Hadoop is similar to that loading speed of the underlying system. For HadoopDB-L,
under cold buffer, due to the CPU bound behavior. the loader loads the records into local MySQL server
HadoopDB-L and DBEHadoop are faster than the case with through prepared batch inserts using JDBC, and no
cold buffer. When with predicate of high selectivity, all replication is considered. For DBEHadoop, the loader loads
three systems take much shorter time than the cold buffer records through routines that format the data into pages and
200 write them onto HDFS, and a replication of degree 2 is
automatically maintained by HDFS.
Hadoop
Fig. 10 gives the result, from which we can see
running time in seconds
160 HadoopDB-L
DBEHadoop DBEHadoop is much faster than HadoopDB-L.
120 HadoopDB-L must load the data through DBMS logic, so
the speed is very low, especially in the online mode for
80 streaming data. Although only using MySQL server here, it
is in the same order of magnitude for other DBMSs as we
40 know. DBEHadoop reserves the advantage of high loading
speed of Hadoop though direct writing to HDFS.
0 Replication mechanism causes extra overhead, and when
all 1 10 20
multiple loaders work in parallel, the overhead increases
number of ports selected due to increased network traffic. However, the loading
speed is fast enough for common applications.
Figure 8. Running time for full scan execution under warm buffer.
350
case, which is a result of saving on I/O cost. The difference
loading speed (MB/s) HadoopDB-L
300
in performance between Hadoop and other two systems 250 DBEHadoop
becomes smaller, which is because decreased operations
200
involved hide the inefficiency of Hadoop in some degree.
HadoopDB-L and DBEHadoop have similar performance 150
in all cases. 100
Fig. 9 gives the result for the index access execution. 50
Different with the cold buffer case, DBEHadoop index
0
access method gets the equal performance with full scan
method when 20 ports are selected, while HadoopDB-L 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
index access method is almost the same with full scan number of loaders
method in all cases. Without I/O cost, the small number of
Maps mainly contributes to the improvement of
Figure 10. Data loading speed.
performance.
F. Summary
120 HadoopDB-L & DBEHadoop full scan Through these experiments, we show that DBEHadoop
HadoopDB-L index access
running time in seconds
DBEHadoop index access is as efficient as HadoopDB-L for full scan queries, and for
queries with predicate of high selectivity, the global index
80 access mechanism adopted in DBEHadoop is much more
efficient than HadoopDB-L. For the data loading,
DBEHadoop achieves very good performance, which is far
40 better than HadoopDB-L.
VI. CONCLUSION
0 Hadoop and DBMS are not ideal for large dataset
analysis. HadoopDB as an integrated system merging
1 10 20
number of ports selected techniques from both is promising, but still limited due to
some reasons which are difficult to overcome. We believe
Figure 9. Running time for index access execution under warm buffer. that it is the way by which HadoopDB is constructed that
makes itself hard to satisfy the emerging applications.
Taking a Hadoop basis, rather than a DBMS basis, and
E. Data Loading
incorporating DBMS techniques is the right way to
In the cluster, we start several loaders one on each node construct systems for large-scale data processing
to test the streaming data loading speed. The loader just applications.
25
10. We propose a new system architecture integrating [6] Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden,
modified DBMS engines as a read-only execution layer Erik Paulson, Andrew Pavlo, Alexander Rasin, “MapReduce and
parallel DBMSs: friends or foes?” communications of the acm, vol.
into Hadoop, where DBMS plays a role of providing 53, no. 1, 2010.
efficient operators instead of managing the data. Besides [7] Azza Abouzeid, Kamil Bajda-pawlikowski, Daniel Abadi, Avi
the same advantages with HadoopDB, our system solves Silberschatz, Er Rasin, “HadoopDB: An architectural hybrid of
the limitation posed by HadoopDB in real scenes. The MapReduce and DBMS technologies for analytical workloads,” in
HDFS-based storage solves the fault tolerance problem in Proc. VLDB’09, 2009
the data layer. The modified database engine is able to [8] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain and Zheng Shao,
process data from HDFS file at the block level, so fits very “Hive – A warehousing solution over a MapReduce framework,” in
Proc. VLDB’09, 2009.
well into MapReduce. A global index access mechanism
[9] Christopher Olston, Benjamin Reed and Utkarsh Srivastava, “Pig
adapted according to the MapReduce paradigm is added Latin: A not-so-foreign language for data processing,” in Proc.
and shows better performance compared to HadoopDB for SIGMOD’08, 2008.
certain queries. The proposed system reserves the [10] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda,
advantage of Hadoop in the data loading speed, which is far and J. Currey, “DryadLINQ: A system for general-purpose
better than HadoopDB. All the properties make the system distributed data-parallel computing using a high-level language,”
more appropriate for large-scale dataset analysis 2008.
applications. [11] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the
data: Parallel analysis with Sawzall,” Scientific Programming, vol.
ACKNOWLEDGMENT 13, no. 4, 2005.
[12] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled
We would like to thank the anonymous reviewers for Elmeleegy, Scott Shenker, Ion Stoica, “Job scheduling for multi-
their valuable feedback on this work. This research is User MapReduce clusters,” technical report No. UCB/EECS-2009-
supported by National Natural Science Foundation of China 55, 2009.
(Grant No. 60903047). [13] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein,
Khaled Elmeleegy, Russell Sears, “MapReduce online,” technical
REFERENCES report No. UCB/EECS-2009-136
[1] S. Ghemawat, H. Gobioff, and S-T. Leung, “The Google file [14] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski,
system,” in Proc. SOSP’03, 2003, p. 29. Christos Kozyrakis, “Evaluating MapReduce for multi-core and
multiprocessor Systems,” in Proc. HPCA’07, 2007
[2] J. Dean and S. Ghemawat, “MapReduce: simplified data processing
on large clusters,” Communications of the ACM, vol. 51 (1), pp. [15] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz,
107-113, Jan. 2008. Ion Stoica, “Improving MapReduce performance in heterogeneous
environments,” in Proc. OSDI’08, 2008
[3] Hadoop website. [Online]. Available:
http://lucene.apache.org/hadoop [16] Jimmy Lin , Shravya Konda , Samantha Mahindrakar, “Low-latency,
high-throughput access to static global sources within the Hadoop
[4] Andrew Pavlo, Erik Paulson, Alexander Rasin, “A comparison of framework,” HCIL Technical Report HCIL-2009-01, 2009.
approaches to large-scale data analysis,” in Proc. SIGMOD’09,
2009, p. 165. [17] Joseph M. Hellerstein, Michael Stonebraker, James Hamilton,
“Architecture of a database system,” Foundations and Trends in
[5] Daniel J. Abadi, “Data management in the cloud: limitations and Databases, Vol. 1, No. 2 (2007) 141–259, 2007.
opportunities,” Bulletin of the IEEE Computer Society Technical
Committee on Data Engineering, 2009.
26