SlideShare ist ein Scribd-Unternehmen logo
1 von 10
Downloaden Sie, um offline zu lesen
The 11th International Conference on Parallel and Distributed Computing, Applications and Technologies



                 Integrating DBMSs as a Read-Only Execution Layer into Hadoop


                  Mingyuan An, Yang Wang                                            Weiping Wang, Ninghui Sun
           Key Laboratory of Computer System and                              Key Laboratory of Computer System and
        Architecture, Chinese Academy of Sciences                          Architecture, Chinese Academy of Sciences
         Institute of Computing Technology, Chinese                         Institute of Computing Technology, Chinese
                    Academy of Sciences                                                Academy of Sciences
         Graduate University of Chinese Academy of                                         Beijing, China
                         Sciences                                                    {wpwang, snh}@ncic.ac.cn
                        Beijing, China
               {anmingyuan, aaron}@ncic.ac.cn

 ABSTRACT—To obtain the efficiency of DBMS, HadoopDB                  techniques has greatly pushed the popularity, and many
 combines Hadoop and DBMS, and claims the superiority over            systems have been constructed on top of Hadoop.
 Hadoop in terms of performance. However, the approach of                  The loose constraints on the data schema and execution
 HadoopDB is simply putting MapReduce onto unmodified                 style in Hadoop bring the user the maximum flexibility.
 single-machined DBMSs which has several obvious
 weaknesses. In essence, HadoopDB is a parallel DBMS with
                                                                      The user can implement the upper application in an
 fault tolerance, which incurs unnecessary overhead due to the        unrestrictive way. However, as a very thin layer with basic
 DBMS legacy. Instead of augmenting DBMS with Hadoop                  mechanism and functionality, Hadoop is not efficient
 techniques, we propose a new system architecture integrating         enough directly facing the user [4][5][6]. In common cases,
 modified DBMS engines as a read-only execution layer into            it lacks many performance-critical optimizations such as
 Hadoop, where DBMS plays a role of providing efficient read-         compact data representation, helper structures, etc.
 only operators rather than managing the data. Besides the                 Database management system, by comparison, has
 obtained efficiency from DBMS engine, there are other                optimized implementation improving the efficiency greatly.
 advantages. The modified DBMS engine is able to directly             It has read optimized storage format, sophisticated query
 process data from the HDFS (Hadoop Distributed File
 System) files at the block level, which means that the data
                                                                      execution, different kinds of indexes, data or query cache
 replication can be handled by HDFS naturally, and the block-         better understanding the semantics of the application, etc.
 level parallelism is easily achieved. The global index access        However, lack of fault tolerance, as one of the most
 mechanism is added according to the MapReduce paradigm.              important reasons, makes DBMS incompetent for large-
 The data loading speed is also guaranteed by directly writing        scale data processing applications.
 the data into HDFS with simplified logic. Experiments show                HadoopDB [7] puts a middleware between Hadoop and
 that our system outperforms both original Hadoop and                 DBMSs, so gets fault tolerance from Hadoop. HadoopDB
 HadoopDB styled system.                                              makes itself a parallel DBMS with fault tolerance, claiming
                                                                      the ability to support large-scale data processing
     Keywords-Hadoop, database, large-scale data processing,          applications. But this method simply takes complete
 global index access
                                                                      DBMSs as the underlying storage and execution units,
                                                                      which has some problems.
                       I.   INTRODUCTION                                   First, with respect to fault tolerance, although it can
     Google File System (GFS) [1] and MapReduce [2] are               take advantage of MapReduce to achieve it in the execution
 developed (or popularized) by Google for large-scale                 layer, replication in the data layer is not fully implemented.
 dataset storage and processing. GFS is a distributed file            In the experiments conducted in HadoopDB project, the
 system optimized for large sequential read operations, and           data replicas are maintained manually. Before starting the
 provides fault tolerance mechanism in the data layer.                benchmark, the data are first split into chunks and
 MapReduce is a programming paradigm for parallel                     replicated onto the nodes in batch mode using some kind of
 processing. Using MapReduce, the user can easily express             scripts. This approach obvious does not support online
 the application task without the complexity of detailed              loading. Without the fault tolerance in the data layer, it will
 parallel execution. The runtime system of MapReduce                  still suffer from failures, so will not be a real scalable
 parallelizes and schedules the job to take full use of the           system in a large-scale environment. Implementing the
 available resources in a large-scale parallel system, while          replication mechanism in the middleware on top of DBMSs
 provides fault tolerance mechanism in the execution layer.           will need great effort, amounting to a big project of the
     Because of the fault tolerance, high scalability and ease        fault tolerance domain. All these make HadoopDB hardly
 for use, the techniques underlying MapReduce and GFS are             be used in practice. Actually, HadoopDB is just a prototype
 very attractive for large-scale data processing applications.        focusing on testing the query execution performance with
 Hadoop [3] as an open source system implementing these               the prepared data replicas in advance, rather than a
                                                                      complete system architecture solution.

978-0-7695-4287-4/10 $26.00 © 2010 IEEE                          17
DOI 10.1109/PDCAT.2010.43
Second, the underlying single-machined DBMSs are                 techniques [10][11]. These works mainly deal with the
unable to use global index residing outside each single              language problem.
system, so may be not optimal in terms of performance for                The second one is for efficiency. Some works have
some kind of queries. Maybe this can be handled in the               been done to improve the kernel part of Hadoop
middleware in some way, but it is still making a detour.             [12][13][14][15]. Some other works improve the system by
    Third, the data loading speed of DBMS is slow due to             external mechanisms [7] [16]. The work on integrating
very strict constraints on the data schema and semantics.            DBMS and Hadoop comes from the thoughts in [5], which
The data usually need to go through complex logic before             point out that the brute-force style work of MapReduce is
being stored into the storage, which is not necessary for            not optimal in terms of efficiency, and something should be
many applications.                                                   done to merge the techniques from MapReduce and DBMS.
    The construction of HadoopDB takes a DBMS basis, so              HadoopDB [7] is the first, as we know, to try to merge the
incurs inevitable limitations due to the DBMS legacy. We             two systems. Although pioneered the idea, HadoopDB still
believe that Hadoop has already done a lot to make itself a          has limitations which will bottleneck the application in real
competent system for large-scale data processing                     scene.
applications, so the right way is to take a Hadoop basis, and
borrow DBMS techniques when appropriate.                                  III. SYSTEMS FOR LARGE-SCALE DATA ANALYSIS
    In this paper, we propose our approach to integrate                  In this section we will first introduce the requirements
DBMSs as a read-only execution layer into Hadoop. Based              of our application of large dataset analysis. We believe that
on Hadoop, we incorporate modified DBMS engines which                these requirements are typical for many other applications.
are augmented with a customized storage engine capable of            Based on these requirements, we revisit existing systems
directly accessing data from HDFS and taking use of global           for data processing. Finally, in terms of merging the
index access method. In this architecture, DBMS plays a              techniques of both Hadoop and DBMS, we determine the
role of providing efficient read-only operators, instead of          position where our system should stand.
managing the data. The following benefits are obtained:
(1) With data being put on HDFS, the fault tolerance                 A. Application Requirements
    problem in the data layer is solved naturally.                       Our case is a network security application. There are
(2) The DBMS engine executes the sub-queries with the                some monitors keeping watching the whole network and
    efficiency advantage as is the case for HadoopDB.                generating sampling records for captured events. These
    Besides, based on HDFS and MapReduce, a global                   generated data are streaming into the analytic platform in
    index mechanism is able to be put into action with the           real time. Once stored into the system, the data should be
    DBMS engines, and significantly improves the                     available for ad-hoc read-only queries. This analytic
    performance for certain queries.                                 platform is the one we focus on.
(3) As a read-only layer, DBMS is not responsible for the
    data loading, but instead, the data are loaded through a            1) Large-scale Parallel Processing
    loader outside the DBMS, or the user can write the data              The large-scale parallel processing power is a basic
    directly to the HDFS in a predefine manner. Doing this           requirement for all large dataset analysis applications.
    greatly accelerates the data loading speed to the raw            Different with the point operations of key-value model in
    speed of HDFS writing, while keeps the convenience               web services, analysis work usually needs to access big part
    and flexibility for the user.                                    of the whole dataset even for a single query, which must
    The remainder of the paper is organized as follows:              resort to large-scale parallel processing for huge raw power.
Section II introduces the related work; Section III analyzes         In such environment, automatic mechanism is very critical.
the application requirements and existing systems, and then          Automatic parallelization, scheduling and fault handling
positions the desired system; Section IV describes our               liberate the user from heavy programming and maintenance
proposed system; Section V gives experimental results;               work. All these play an important role in guaranteeing the
Section VI concludes the paper.                                      scalability of the system.
                                                                        2) High Efficiency
                    II. RELATED WORK
                                                                         Efficiency is a necessary concern because it will take up
    With the wide deployment in the data analysis field,             so many resources for each single query, and higher
there appear two types of works to improve Hadoop-based              efficiency can lower the cost significantly.
systems. The first one is for the usability. Despite the                 Besides the common reasons, there is a special one in
flexibility, for many users, the low level language of               our case. In many other applications, the data have a short
MapReduce is somewhat inconvenient to use compared to                life cycle. The data are loaded into the system in batch
that in a higher level such as SQL of relational DBMS. So            mode, then some almost fixed queries are put onto the data,
some systems on top of Hadoop are developed. Facebook’s              and after that, the data will be removed or offloaded to the
Hive [8] and Yahoo’s Pig [9] are examples of this kind.              offline system. In such condition, organizing the data into
They provide simple declarative languages capable of                 some sophisticated structure is not worthwhile given the
expressing complex ad-hoc queries on structured data.                extra maintenance cost and the low utility. Sometimes the
Some other higher level languages or system level products           tasks on the data are simply to generate statistical reports as
are also developed on top of either MapReduce or similar             timing jobs, for example, only working during night, so it


                                                                18
may be all right even if the execution takes a less efficient        applied to every layer and component of DBMS, such as
way. These applications can be seen as data processing               optimized data storage format, diverse access methods,
rather than data analyzing.                                          sophisticated query execution, efficient data cache, etc.
    By comparison, our application is to deal with ad-hoc            Many of these techniques are widely copied and reinvented
analytic queries over long existing dataset. It is worthwhile        by other systems [17]. This is the advantage of DBMS, and
to adopt some optimized data structures and execution                is desirable for large dataset analysis.
mechanism to improve the query efficiency. For example,                  For data loading, there is the limitation in DBMS-based
the queries are often with predicates on some attributes, for        systems. Due to strict constraints such as ACID property,
which using index can reduce the execution cost in certain           the system can not load the data in an efficient way,
cases. Although it has extra cost on index maintenance, this         especially for online loading. The online loading speed of
will be amortized by the repeated usage. Ad-hoc queries              any DBMS node is lower than 10MB/s, as far as we know.
also make the cache usable. Queries on the same set of data          The systems in data warehouse application are often offline
may exist, so the cache will make sense. Using index also            systems with very weak online loading requirement, in
poses demand for index data cache. After all, the long life          which the common case is to load data in batch at regular
cycle of the data and ad-hoc queries will justify the effort         time points.
for optimization.                                                        Hadoop: Many applications replace DBMS with
                                                                     Hadoop for a couple of reasons. One of the most important
  3) Continuous High Speed Data Loading                              ones is that Hadoop is scalable due to the fault tolerance. In
    The data in our application are streaming in                     addition, Hadoop is easier to deploy and use. For data
continuously at relatively high speed. This requires the             processing tasks, MapReduce provides very simple and
system be capable of loading data with high speed in an              flexible parallel programming paradigm, and is able to
online mode. The data can be stored in appending manner,             express complex queries.
and once stored there will be no update on them. High                    As to parallel processing, Hadoop is totally born for it.
speed online loading requires the logic on the path be               MapReduce run-time system has full ability to parallelize
simple enough. Unnecessary strict semantics checking                 the whole processing in large-scale systems. Both
should be avoided.                                                   MapReduce and HDFS are completely fault tolerant,
B. Existing Systems Reconsidered                                     making the whole system highly scalable. This is one of the
    In response to the requirements of the application, we           most important properties of Hadoop, and is critical for
consider three types of systems: database management                 large dataset analysis. In addition, the block-level
system, Hadoop, and HadoopDB. If we treat the techniques             replication of HDFS gives MapReduce great opportunities
of DBMS and Hadoop as two extremes, there should be a                for high degree of parallelism and fine-grained execution
broad spectrum between them. To satisfy the requirements             fault-tolerance, which improves the performance
of the application, it is right to draw strength from both.          significantly.
HadoopDB is actually a DBMS equipped with some                           For efficiency, Hadoop seems to have a long way to go.
Hadoop techniques. However, we believe that our system               MapReduce is working in a brute-force way. Whatever the
should start from the other side.                                    query is, it has to scan all the data without helper structures
                                                                     such as index. The scan and processing are not guaranteed
   1) Existing Systems                                               to be efficient, because it is the user’s work to implement
    DBMS: DBMS has long story for data management.                   the details. The data are often stored on HDFS in text
The irreplaceable domain of DBMS is transaction                      format which is straightforward to the user but not compact.
processing, where ACID property must be guaranteed. As               In these aspects, Hadoop is not as good as DBMS, at least
to the data analysis, DBMSs also hold an important                   for now.
position. Parallel DBMSs are very popular in data                        As to data loading, writing data directly to HDFS can
warehouse market. In this domain, parallel DBMS provides             be guaranteed with an acceptable speed. For many large-
a high degree of parallelism and achieves good                       scale data analysis applications including ours, weak
performance for analytic queries.                                    consistency is enough. The data can go though without
    For parallel processing, parallel DBMS only competes             complex logic, so writing structured data can also achieve
at a limited scale. The serious problem with DBMSs is that           the same speed as the case for unstructured byte stream.
most of them are not fault tolerant. If something goes               Although the speed can be lowered when replicas are
wrong during the execution of a query, it has to restart             stored, this is an inevitable tradeoff with fault tolerance on
entirely. In a large-scale system consisting of thousands of         data.
components, a long running query will never succeed                      HadoopDB: DBMS and Hadoop have their own
considering the high failure rate. While parallel DBMSs are          superiority in the appropriate domain. To better satisfy the
competent for data warehouse, they are not suitable for              emerging applications, DBMS must incorporate fault
large-scale applications in this aspect. The largest DBMS-           tolerance in order to be scalable, while Hadoop should
based analytic system as we know consists of only 100                borrow techniques from DBMS to improve the efficiency.
machines.                                                            HadoopDB is constructed based on this idea. DBMSs are
    For efficiency, DBMS has embodied decades of                     taken as the storage and execution units, and MapReduce
academic and industrial research. Optimizations have been            mechanism takes responsibility for parallelization and fault


                                                                19
tolerance on top of the underlying DBMSs. Fig. 1 shows               DBMS camp, which seems capable of supporting this kind
the architecture of HadoopDB briefly. HDFS is used to                of applications.
store the system metadata and the result set of the query.               HadoopDB merges techniques from both DBMS and
All source data are stored in DBMSs. When executing a                Hadoop, but it is hardly be used for the applications due to
query, Maps are scheduled to the nodes according to the              the DBMS legacy and the further required implementing
metadata which tells the location of each block of the data.         work. So now we must identify two different ways of
Maps issue SQL queries to the underlying DBMSs and                   merging DBMS and Hadoop techniques, which can help
emit the result records to Reduces. Reduces aggregate                positioning the desired system. The difference between the
result sets from multiple nodes, and write the final results         two ways is about the starting point for constructing an
onto HDFS.                                                           integrated system: one from DBMS, another from Hadoop.
                                                                         HadoopDB tries to introduce fault tolerance and fine-
                                                                     grained parallelism into the parallel DBMS, so belongs to
                                                                     the first category. While high efficiency is reserved, all
                                                                     strict constraints of DBMS are also inherited. HadoopDB is
                                                                     desired for applications where the strict schema and
                                                                     semantics of data are given high priority. So it is capable of
                                                                     dealing with traditional database applications. After all,
                                                                     HadoopDB means Hadoop database, not database Hadoop.
                                                                         The system we need should go from the other side.
                                                                     Hadoop satisfies the majority of our need except that in
                                                                     efficiency, so we should integrate DBMS techniques into
                                                                     Hadoop-based system, rather than the reverse. MapReduce
              Figure 1. Architecture of HadoopDB.                    and HDFS are all developed for large-scale data processing
                                                                     applications, and it is only the efficiency that needs special
    For parallel processing, HadoopDB is just partially fault        concern. Hence, we should position our system closer to
tolerant. MapReduce only guarantees the fault tolerance in           the Hadoop side, while be positive to incorporate desired
the execution layer. The data are stored in DBMSs rather             properties from the DBMS.
than HDFS, so the availability of data should be specially
handled. Common DBMS has no special concern on fine-                 C. Our Approach
grained data replication for intra-query parallelization,                Different with HadoopDB, we take DBMSs as read-
without which the MapReduce framework can not                        only execution components. For a specific query, dataset on
completely exploit the parallelism. HadoopDB uses                    HDFS is split logically into blocks as usual, and each block
batched approach to dump the data out of the database and            is assigned an executor which is now a database execution
replicate them to some other nodes, which doesn’t support            thread; all intermediate results computed by the database
online loading at all. This functionality can be achieved in         engines are aggregated by Reducers which are the same as
the middleware, but implementing replications of fine                before; final results are written onto HDFS naturally.
granularity on top of table of DBMS needs non-trivial work,              Using this approach, the parallel execution is still in
which makes it hard to use in practice.                              block granularity, and fault tolerance is guaranteed in both
    For efficiency, the advantage of DBMS in this aspect             the data layer and the execution layer.
can be reflected in the integrated system, because each                  Efficiency now depends on the DBMS layer. Many
query will be translated into sub-queries actually executed          techniques such as data cache, query cache, optimized
by individual DBMS query engine. Despite the                         operators in the database will still make effect. But data
improvement, there is still one limitation on global                 access methods are partially different as before due to the
structure mechanism. Because all DBMSs in HadoopDB                   customized storage engine using HDFS. Index access
are unmodified ones of single-machined version, this layer           mechanism should be reconsidered and adapted to work in
can not take use of any global structures, such as global            this situation.
index which makes sense when the query is with predicates                The data loading process is intuitive. Streaming data
of high selectivity.                                                 can be packed into optimized format as that in DBMS, and
    Finally comes the data loading requirement. In                   then be directly written onto HDFS bypassing the logic for
HadoopDB, DBMS-based storage obviously inherits the                  transactions, which will not cost too much through this
limitation. The data have to be loaded through a complete            simple logic.
DBMS logic, so it is difficult to improve the loading speed.
                                                                          IV. DBMS ENGINE INTEGRATED HADOOP SYSTEM
  2) Discussion
    We have given an analysis of two typical systems and                 In this section, we will give a detailed description of the
the integrated HadoopDB respectively. Lack of fault                  system constructed for our application. We first give the
tolerance eliminates traditional parallel DBMSs from the             overview of the system architecture, and then focus on the
candidates for large-scale data processing applications.             query execution process. Besides the familiar full scan
However, HadoopDB, as a fault tolerant parallel DBMS in              execution, a global index access mechanism in MapReduce
essence, becomes a promising representative from the                 framework is introduced.


                                                                20
A. Overview                                                            fits DBMSs into the MapReduce execution framework very
    The system consists of four parts as shown in Fig. 2.              well.
The bottom is the storage layer HDFS. On top of HDFS are                    From the perspective of the whole system, we embed
the database query engines as the executors. The top is                modified database engines into Hadoop rather than just
MapReduce system. The middleware layer contains the                    gluing them together like HadoopDB, which yields a more
data loader, the indexer, etc.                                         coordinated system.
    The data loader stores the incoming data onto storage in           B. Query Execution
a simple way. The data are packed in binary format into the
pages, each of which is the smallest I/O unit of the database              Fig. 3 describes the framework of the query execution.
query engine. The                                                      The query is first translated into sub-queries expressed in
                                                                       SQL. Sub-queries will be executed by each database engine
                                                                       thread. Besides the operations applied, the sub-query also
                                                                       indicates the position information of the data block on
                                                                       which the database engine thread should process. This
                                                                       position information is figured out by the splitting process
                                                                       on the source data file, and is used by MapReduce runtime
                                                                       system to schedule the sub-tasks. Each sub-query is passed
                                                                       to an instance of Map on a specific node, where the
                                                                       position parameters in the sub-query will be set to the
                                                                       according splitting result values.


                     Figure 2. Architecture.

    page size is fixed according to system parameter, and is
32KB by default. Using binary format reduces the occupied
disk space compared to the text representation, so improves
both the loading and query performance. The binary format
is also obeyed by the customized storage engine in the
database when parsing the data into tuples. Note that
although in binary formant for structured data, the CPU
cost of loading is not increased, and in fact, it is more
efficient than using text format. The data will be replicated                            Figure 3. Query execution.
automatically by HDFS in the block granularity. The
                                                                           Each Map instance issues the sub-query of the SQL
default block size is 64MB, and is configurable. The block
                                                                       format to the local database engine thread, and emits the
size of HDFS is an integer multiple of the page size for
                                                                       results returned by the database. The Map instance doesn’t
ease of implementation.
                                                                       need to aggregate the local intermediate results, because the
    The indexer can create some kind of index on the
                                                                       sub-query executed by the database already finishes this.
loaded data in the batch mode. Because HDFS is append-
                                                                           The most critical part for this process is the customized
only, it will be complex to build an updatable index in the
                                                                       storage engine that provides the ability to access HDFS
real-time manner. Actually, the data are often queried with
                                                                       data at block level.
time range predicate, so creating separated indexes along
time dimension on a periodic basis is an acceptable solution             1) Customized Storage Engine
in real scenes. We support B+-tree index for now. The B+-                  The query engine accesses the data through the storage
tree index searches all the data across the cluster, so it is a        engine using a collection of routines in an iterator manner.
global index structure. The index data are also stored in              init() is first evoked every time the executor wants to
HDFS, and can be seen by each database executor. The                   accessing the data. After that, get_next() routine is called
detailed index structure and index access method in                    repeatedly by the executor, which returns a tuple each time,
MapReduce framework will be described in Section IV.C.                 and the executor applies the operations on the stream of the
    The database executors are actually MySQL server                   tuples. When the query is finished, a close() function is
threads. MapReduce run-time system schedules a sub-task                called, which cleans the context.
(Map instance) to a specific node, on which the sub-task                   We store the dataset on HDFS, so the first thing we do
issues a SQL query to the underlying MySQL server on this              is to implement the routines using HDFS API. The
node. We implement a new storage engine for MySQL so                   implementation is almost the same as that for local file
that the query engine can get tuples from HDFS data files.             system, except that all file system calls is replaced by the
Some tricks are applied to make the query engine capable               HDFS collection. The data are page formatted, and the
of accessing tuples from a specific block of the HDFS file,            reading is on the page basis. The predefined data format is
which provides the ability to execute at the block level and           obeyed when parsing the tuple out of the page. A data



                                                                  21
cache is also implemented in the storage engine, where                  nodes must be launched, because all the local indexes
LRU evict algorithm is adopted as usual.                                search the same value domain. The control overhead such
    The schema definition of the HDFS dataset must be                   as that on setup and cleanup of sub-tasks will occupy a big
registered in the database, so that the query can be executed           part of the total running time. Comparatively, when using
by the database without syntax or semantics exceptions.                 global index, only some of the nodes possessing the
This is achieved by the cooperation of the data loader,                 qualified index entries need to be started. However, global
database query engine and the storage engine. When a new                index access will incur global communications which
table is to be created, the data loader creates necessary data          should not be neglected.
files for this table and issues ‘create table’ to all nodes. The            HadoopDB consists of single-machined DBMSs which
database query engine will process this DDL (Data                       are unable to take use of global index, so the supported one
Definition Language) query and record the information                   is just local index mechanism.
about the table into the metadata. Then the query engine                    In our integrated system, the data are replicated and
will call the create() function of the storage engine with              distributed by HDFS, so the DBMS layer has no knowledge
necessary parameters. In create(), the normal routine is to             about the locality of the data, which means that local index
create data files and other data structures needed, but in our          doesn’t make sense. We choose to implement the global
implementation, it just opens the data files already created            index mechanism. The index mechanism must give
by the data loader, and initiates the context. After executing          consideration to the MapReduce execution style so that
the command, the data loader gets ready to load the                     makes the index access be parallelized in this framework.
incoming data to the new table data file, and all databases
are available for queries on this table.                                  1) Index Creation
    Every database query engine now is able to see the                      The indexer is responsible for building B+-tree index on
whole dataset on HDFS. However, it can only execute                     the dataset, and the index file is stored into HDFS, so it can
query at the table level, so you can not specify which part             be accessed by all the nodes. Because there will only be
of this table the query should process. This manner is not              read requests, the index doesn’t need to support update
appropriate for MapReduce. MapReduce framework                          operations. So the entries in the B+-tree node are dense-
logically splits the dataset into blocks, and assigns each              packed, leaving no free space for later insert. We create the
Map instance a block to process. To fit into this paradigm,             B+-tree      index    as     follows:      first,   sort    the
the database must be able to process at the block level                 (value_on_index_attribute, offset) pairs from all the records
rather than the table level. Here we make use of pseudo                 and write them sequentially into the index file which
column to achieve this goal. Besides the columns for the                directly forms the leaf nodes of the tree; then scan the leaf
data, we introduce an addition pseudo column blk, which                 nodes, create all the intermediate nodes and the root node
exists in the metadata of the database but is not stored                in a bottom-up fashion, and append them to the index file.
actually. This column is used to pass parameters about the              Traditional B+-tree index may be created through insert
position information of the data block which needs                      operations in an online mode which makes the leaf nodes
processing. With determined data block, the Map instance                not physically contiguous in the file, but connected by
adds a predicate on the pseudo column to the where clause               pointers. Our approach guarantees that the leaf nodes
of the SQL query. When executing this modified query, the               occupy a contiguous range of space in the index file which
position constants will be sent to the storage engine, so               facilitates the parallel access to these leaf nodes during
only the tuples in the indicated storage range will be read in.         query. The structure of the index is illustrated in Fig. 4.
During this process, the query engine works as before
without the perception of this matter.
    Till now, the database engines are well integrated into
Hadoop framework.
C. Global Index Mechanism
    One of the useful auxiliary data structures in DBMS is
the index. For certain queries, index assisted execution can
improve the efficiency. For example, with predicate of high
selectivity, the qualified tuples for the query are only a
small part of the whole. The brute-force scan on the whole
dataset will waste too much energy. If there is some index
on the predicate attribute, using index to directly retrieve
the qualified tuples may save a lot.
    In parallel DBMS, there are two kinds of indexes in
                                                                        Figure 4. Leaf nodes occupy continuous space in the index file and Maps
terms of locality. Local index is the one that resides on one           work on selected leaf node data blocks.
node and only searches the local dataset on the same node;
global index is the one that has references to the data across
the whole cluster, and the global index itself usually
distributes across all the nodes. When using local index, all


                                                                   22
2) Index Access                                                     is set up on the cluster, and one MySQL server of version
    To support the index access method, we add another                 5.0 is running on each individual node.
pseudo column idx to the table schema, and modify the                      The benchmark is from our application. Although from
storage engine implementation accordingly. In the case of              specific domain, the data schema and operation are very
index access, idx and blk attributes will be used together to          common to many other applications. The data schema is a
give the indication to the storage engine.                             table with 9 integer columns, which are time, systemID,
    When the query is with predicate of high selectivity on            deviceID, eventType, port, inBytes, outBytes, inPackets,
the indexed attribute, the index access method will be                 outPackets respectively. time is the number of seconds
chosen. Before starting up the MapReduce tasks, several                since the Epoch. It starts from 1235750430 in the
traversals through the index are taken to locate the start and         benchmark, and increases by 1 every 131072 records.
end positions in the leaf nodes for each predicate value.              systemID is uniformly distributed in the integer range [1,
The index entries in the leaf nodes between the start and the          15]. deviceID and eventType are respectively uniformly
end positions are those pointing to the records satisfying             distributed in the integer range [1, 50] and [100000000,
the predicate. If the predicate is a range, only two traversals        100000014]. port, inBytes, outBytes, inPackets and
will be needed. Because the height of tree is usually very             outPackets are all uniformly distributed in the integer range
low, this process will not take much time compared to the              [0, 65535]. The whole dataset has about 471,859,200
later processing using MapReduce.                                      records over a time range of one hour.
    According to the start and end positions, Maps are                     The queries we use are
generated attaching to the leaf node pages in the selected              SELECT truncate(time/60,0), systemID, deviceID, eventType,
ranges as shown in Fig. 4. Each Map will add the                               sum(inBytes), sum(outBytes), sum(inPackets), sum(outPackets)
predicates on blk and idx to the where clause of the SQL                FROM table
                                                                        [WHERE port IN (port_list)]
query. idx parameter specifies the index to be used, and blk            GROUP BY truncate(time/60, 0), systemID, deviceID, eventType
parameters now indicate the start and end offsets to the leaf
nodes. During the execution of the sub-query, the storage              where the where clause may vary in different experiments.
engine will scan the index entries in the selected range of            This query actually groups and summaries the data by
index leaf nodes, and retrieves the tuple using the offset in          systemID, deviceID and eventType attributes in a minute-
the index entry. The other phases of the execution are the             granularity with a predicate on the port attribute.
same with that for the full scan case.                                     We compare three systems: Hadoop, HadoopDB-like
    In this execution mode, the number of Maps is related              system (HadoopDB-L for short) which is implemented on
to the selectivity of the predicate. Using the index, a                top of MySQL, and our database engine integrated Hadoop
minimal number of Maps are generated and only the                      system (DBEHadoop for short). The data in Hadoop are in
qualified records are read in.                                         text format with columns separated by space characters,
                                                                       and the whole dataset in the benchmark occupies about
D. Summary                                                             25GB space. The data in HadoopDB-L and DBEHadoop
    We have proposed a new system architecture                         are in paged binary format with each value of attribute
integrating modified database engines as a read-only                   occupying four bytes, and the whole dataset occupies about
execution layer into Hadoop. Data replications are handled             15GB space. All systems are configured with 2 replicas in
by HDFS naturally. The modified database engine is able to             64MB block granularity for the data, so the actual storage
process the data from HDFS file at the block level, so fits            space for the dataset doubles. The data of HadoopDB-L are
very well into MapReduce. The global index access                      manually replicated across all nodes, and each block of
mechanism is added according to MapReduce paradigm.                    64MB data is simulated using a separate table. For each
The loading speed can also be guaranteed using HDFS.                   block (table), there is a local index on the port attribute. In
This integrated system satisfies our application very much.            DBEHadoop, a global index on port is created. Because of
    The essential difference with HadoopDB is that we                  being in a public platform, we set the max number of
construct our system on a Hadoop basis, rather than a                  parallel Maps per node to 3, which is less than the number
DBMS basis. DBMS provides us the efficient operators,                  of cores in the node.
while the managing of data is handled by other Hadoop-
                                                                       B. Query without Predicate
based components.
                                                                           The first experiment is on the query without where
                     V.    EXPERIMENTS                                 clause, so all the data needs to be scanned. The result set
                                                                       contains 2,287,500 records. The system buffer is cleaned
A. Configurations                                                      before the execution.
    The experiments are conducted in a cluster consisting of               Fig. 5 shows the running time for the three systems to
15 nodes connected by a gigabit Ethernet, which is a part of           execute this query. The execution time of Hadoop is much
a public computing platform. Each node has two dual-core               longer than that of HadoopDB and DBEHadoop. There are
AMD Opteron™ Processor 275, 8GB DRAM, and a                            two factors affecting the performance of Hadoop. The first
136GB SCSI disk. The kernel of the operating system is                 one is the raw size of the data file, and the second one is the
Linux 2.6.9-4.2.ELsmp x86_64. The bandwidth of the local               CPU efficiency during the execution. The raw size of data
file system sequential I/O is about 60MB/s. Hadoop 0.19.2              file in Hadoop is larger than the two other systems, so the


                                                                  23
disk I/O is larger in Hadoop. In addition, larger data file                                                           160           Hadoop      HadoopDB-L          DBEHadoop




                                                                                running time in seconds
needs more Maps to process, which incurs more control
overhead. In terms of CPU efficiency, due to Java language,                                                           120
Hadoop is less I/O bound when processing large amount of
records with more columns like the one in our case,                                                                    80
especially in text format. For this query, Hadoop takes
much more CPU time.
                                                                                                                       40
                                200
      running time in seconds




                                                                                                                        0
                                160
                                                                                                                                    1                  10                     20
                                                                                                                                             number of ports selected
                                120
                                                                                                                        Figure 6. Running time for full scan execution.
                                 80
                                                                                 Next we compare the performance of HadoopDB-L and
                                 40
                                                                             DBEHadoop with index assistance. Fig. 7 shows the result.
                                  0                                          For the case of one port, index access is better than full
                                                                             scan, because random read for a very small amount of
                                      Hadoop   HadoopDB-L   DBEHadoop
                                                                             records outperforms the sequential scan of the large dataset.
                                                                             For HadoopDB-L, index access is only a little better than
             Figure 5. Running time for the query without predicate.
                                                                             full scan, and this is because in both modes, each data
    The performance of HadoopDB-L and DBEHadoop are                          block needs a Map with actually little work to do due to the
almost the same. They have similar data format, so the data                  high selectivity, so control overhead takes a considerable
sizes are similar. Reading data through HDFS doesn’t                         large part of the total running time. DBEHadoop is much
cause too much overhead for DBEHadoop because of the                         more efficient than Hadoop-L, because of the very small
paged sequential I/O, which makes the two systems                            number of Maps needed. For the case of 10 ports, the
essentially same in the query execution. Using the database                  performance of index access is almost the same with full
for underlying execution improves the CPU efficiency, so                     scan execution, which indicates that index access will not
the systems are more I/O bound compared to Hadoop.                           be superior from this point. For HadoopDB-L, the cost of
                                                                             random read offsets the benefit of small I/O volume. For
C. Query with Predicate                                                      DBEHadoop, random read and communication cost
    We now add predicate to the query. The query is to only                  together offset the benefit. For the case of 20 ports,
process the records with port value in the port_list. When                   DBEHadoop index access method is much more expensive
the predicate of high selectivity exists, it has a chance to                 than full scan method. Compared to local index, the cost of
use an index to accelerate the execution. So we conduct this                 global index access increases more evidently due to the
set of experiments under high selectivity predicate, where                   network communication in addition to the random disc I/O,
index makes sense. We select 1, 10 and 20 ports                              which offsets the benefit of the small number of Maps.
respectively to repeat the experiment. Because the value of                  However, in these conditions, full scan method will be
port is uniformly distributed, the number of selected ports                  chosen.
rather than the specific values determines the performance.                                                           200
                                                                                                                                                   Hadoop DB-L full scan
The system buffer is cleaned before execution.                                                                                                     Hadoop DB-L index access
                                                                                            running time in seconds




                                                                                                                      160
    Fig. 6 is the result of the three systems using full scan                                                                                      DBEHadoop full scan
                                                                                                                                                   DBEHadoop index access
execution which scans all the data and applies predicate to                                                           120
qualify the records. The time is shorter than that without
predicate for each system, because fewer records need                                                                 80
processing after applying the predicate. Full scan has the
same performance in all cases for each specific system,                                                               40
because with only small number of qualified records, the
I/O and control cost actually dominate the total time.                                                                 0

Hadoop takes more time than the other systems mainly                                                                               1                   10                 20
                                                                                                                                             number of ports selected
because of the larger data size (more I/O and Map
instances). HadoopDB-L and DBEHadoop have the similar                                                                          Figure 7. Full scan vs. index access.
performance.
                                                                             D. Execution under Warm Buffer
                                                                                 In many cases, the same set of data may be queried by
                                                                             different users, so it is necessary to evaluate the
                                                                             performance under warm buffer. The previous experiments



                                                                        24
are repeated using the same way except that the data are                                                         generates the data itself and loads it into the system. Doing
already in the system buffer cache.                                                                              this way ignores the overhead of communication between
    Fig. 8 shows the result for full scan execution. When                                                        the data source and the loader, thus evaluates the raw
without the predicate, the time for Hadoop is similar to that                                                    loading speed of the underlying system. For HadoopDB-L,
under cold buffer, due to the CPU bound behavior.                                                                the loader loads the records into local MySQL server
HadoopDB-L and DBEHadoop are faster than the case with                                                           through prepared batch inserts using JDBC, and no
cold buffer. When with predicate of high selectivity, all                                                        replication is considered. For DBEHadoop, the loader loads
three systems take much shorter time than the cold buffer                                                        records through routines that format the data into pages and
                                                      200                                                        write them onto HDFS, and a replication of degree 2 is
                                                                                                                 automatically maintained by HDFS.
                                                                                               Hadoop
                                                                                                                     Fig. 10 gives the result, from which we can see
                            running time in seconds




                                                      160                                      HadoopDB-L
                                                                                               DBEHadoop         DBEHadoop is much faster than HadoopDB-L.
                                                      120                                                        HadoopDB-L must load the data through DBMS logic, so
                                                                                                                 the speed is very low, especially in the online mode for
                                                      80                                                         streaming data. Although only using MySQL server here, it
                                                                                                                 is in the same order of magnitude for other DBMSs as we
                                                      40                                                         know. DBEHadoop reserves the advantage of high loading
                                                                                                                 speed of Hadoop though direct writing to HDFS.
                                                       0                                                         Replication mechanism causes extra overhead, and when
                                                            all        1              10            20
                                                                                                                 multiple loaders work in parallel, the overhead increases
                                                                    number of ports selected                     due to increased network traffic. However, the loading
                                                                                                                 speed is fast enough for common applications.
       Figure 8. Running time for full scan execution under warm buffer.
                                                                                                                                          350
case, which is a result of saving on I/O cost. The difference
                                                                                                                   loading speed (MB/s)                     HadoopDB-L
                                                                                                                                          300
in performance between Hadoop and other two systems                                                                                       250               DBEHadoop
becomes smaller, which is because decreased operations
                                                                                                                                          200
involved hide the inefficiency of Hadoop in some degree.
HadoopDB-L and DBEHadoop have similar performance                                                                                         150
in all cases.                                                                                                                             100
    Fig. 9 gives the result for the index access execution.                                                                                50
Different with the cold buffer case, DBEHadoop index
                                                                                                                                            0
access method gets the equal performance with full scan
method when 20 ports are selected, while HadoopDB-L                                                                                             1   2   3    4   5    6   7   8   9 10 11 12 13 14 15
index access method is almost the same with full scan                                                                                                                number of loaders
method in all cases. Without I/O cost, the small number of
Maps mainly contributes to the improvement of
                                                                                                                                                    Figure 10. Data loading speed.
performance.
                                                                                                                 F. Summary
                      120                                    HadoopDB-L & DBEHadoop full scan                        Through these experiments, we show that DBEHadoop
                                                             HadoopDB-L index access
  running time in seconds




                                                             DBEHadoop index access                              is as efficient as HadoopDB-L for full scan queries, and for
                                                                                                                 queries with predicate of high selectivity, the global index
                            80                                                                                   access mechanism adopted in DBEHadoop is much more
                                                                                                                 efficient than HadoopDB-L. For the data loading,
                                                                                                                 DBEHadoop achieves very good performance, which is far
                            40                                                                                   better than HadoopDB-L.
                                                                                                                                      VI. CONCLUSION
                                 0                                                                                   Hadoop and DBMS are not ideal for large dataset
                                                                                                                 analysis. HadoopDB as an integrated system merging
                                                            1               10                     20
                                                                  number of ports selected                       techniques from both is promising, but still limited due to
                                                                                                                 some reasons which are difficult to overcome. We believe
Figure 9. Running time for index access execution under warm buffer.                                             that it is the way by which HadoopDB is constructed that
                                                                                                                 makes itself hard to satisfy the emerging applications.
                                                                                                                 Taking a Hadoop basis, rather than a DBMS basis, and
E. Data Loading
                                                                                                                 incorporating DBMS techniques is the right way to
    In the cluster, we start several loaders one on each node                                                    construct systems for large-scale data processing
to test the streaming data loading speed. The loader just                                                        applications.


                                                                                                            25
We propose a new system architecture integrating                          [6]    Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden,
modified DBMS engines as a read-only execution layer                                 Erik Paulson, Andrew Pavlo, Alexander Rasin, “MapReduce and
                                                                                     parallel DBMSs: friends or foes?” communications of the acm, vol.
into Hadoop, where DBMS plays a role of providing                                    53, no. 1, 2010.
efficient operators instead of managing the data. Besides                     [7]    Azza Abouzeid, Kamil Bajda-pawlikowski, Daniel Abadi, Avi
the same advantages with HadoopDB, our system solves                                 Silberschatz, Er Rasin, “HadoopDB: An architectural hybrid of
the limitation posed by HadoopDB in real scenes. The                                 MapReduce and DBMS technologies for analytical workloads,” in
HDFS-based storage solves the fault tolerance problem in                             Proc. VLDB’09, 2009
the data layer. The modified database engine is able to                       [8]    Ashish Thusoo, Joydeep Sen Sarma, Namit Jain and Zheng Shao,
process data from HDFS file at the block level, so fits very                         “Hive – A warehousing solution over a MapReduce framework,” in
                                                                                     Proc. VLDB’09, 2009.
well into MapReduce. A global index access mechanism
                                                                              [9]    Christopher Olston, Benjamin Reed and Utkarsh Srivastava, “Pig
adapted according to the MapReduce paradigm is added                                 Latin: A not-so-foreign language for data processing,” in Proc.
and shows better performance compared to HadoopDB for                                SIGMOD’08, 2008.
certain queries. The proposed system reserves the                             [10]   Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda,
advantage of Hadoop in the data loading speed, which is far                          and J. Currey, “DryadLINQ: A system for general-purpose
better than HadoopDB. All the properties make the system                             distributed data-parallel computing using a high-level language,”
more appropriate for large-scale dataset analysis                                    2008.
applications.                                                                 [11]   R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the
                                                                                     data: Parallel analysis with Sawzall,” Scientific Programming, vol.
                       ACKNOWLEDGMENT                                                13, no. 4, 2005.
                                                                              [12]   Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled
    We would like to thank the anonymous reviewers for                               Elmeleegy, Scott Shenker, Ion Stoica, “Job scheduling for multi-
their valuable feedback on this work. This research is                               User MapReduce clusters,” technical report No. UCB/EECS-2009-
supported by National Natural Science Foundation of China                            55, 2009.
(Grant No. 60903047).                                                         [13]   Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein,
                                                                                     Khaled Elmeleegy, Russell Sears, “MapReduce online,” technical
                           REFERENCES                                                report No. UCB/EECS-2009-136
[1]   S. Ghemawat, H. Gobioff, and S-T. Leung, “The Google file               [14]   Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski,
      system,” in Proc. SOSP’03, 2003, p. 29.                                        Christos Kozyrakis, “Evaluating MapReduce for multi-core and
                                                                                     multiprocessor Systems,” in Proc. HPCA’07, 2007
[2]   J. Dean and S. Ghemawat, “MapReduce: simplified data processing
      on large clusters,” Communications of the ACM, vol. 51 (1), pp.         [15]   Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz,
      107-113, Jan. 2008.                                                            Ion Stoica, “Improving MapReduce performance in heterogeneous
                                                                                     environments,” in Proc. OSDI’08, 2008
[3]   Hadoop             website.          [Online].        Available:
      http://lucene.apache.org/hadoop                                         [16]   Jimmy Lin , Shravya Konda , Samantha Mahindrakar, “Low-latency,
                                                                                     high-throughput access to static global sources within the Hadoop
[4]   Andrew Pavlo, Erik Paulson, Alexander Rasin, “A comparison of                  framework,” HCIL Technical Report HCIL-2009-01, 2009.
      approaches to large-scale data analysis,” in Proc. SIGMOD’09,
      2009, p. 165.                                                           [17]   Joseph M. Hellerstein, Michael Stonebraker, James Hamilton,
                                                                                     “Architecture of a database system,” Foundations and Trends in
[5]   Daniel J. Abadi, “Data management in the cloud: limitations and                Databases, Vol. 1, No. 2 (2007) 141–259, 2007.
      opportunities,” Bulletin of the IEEE Computer Society Technical
      Committee on Data Engineering, 2009.




                                                                         26

Weitere ähnliche Inhalte

Was ist angesagt?

Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows AzureJeremy Taylor
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows AzureJeremy Taylor
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...
Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...EMC
 
Evaluation and analysis of green hdfs a self-adaptive, energy-conserving var...
Evaluation and analysis of green hdfs  a self-adaptive, energy-conserving var...Evaluation and analysis of green hdfs  a self-adaptive, energy-conserving var...
Evaluation and analysis of green hdfs a self-adaptive, energy-conserving var...João Gabriel Lima
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONSBUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONSijccsa
 
White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction   White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction EMC
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32jujukoko
 

Was ist angesagt? (17)

Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
SQL CUDA
SQL CUDASQL CUDA
SQL CUDA
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows Azure
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows Azure
 
Cppt
CpptCppt
Cppt
 
Hadoop
HadoopHadoop
Hadoop
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...
Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...
 
Evaluation and analysis of green hdfs a self-adaptive, energy-conserving var...
Evaluation and analysis of green hdfs  a self-adaptive, energy-conserving var...Evaluation and analysis of green hdfs  a self-adaptive, energy-conserving var...
Evaluation and analysis of green hdfs a self-adaptive, energy-conserving var...
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONSBUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
 
White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction   White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32
 
Hadoop
HadoopHadoop
Hadoop
 

Ähnlich wie Integrating dbm ss as a read only execution layer into hadoop

Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on webcsandit
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)ijdpsjournal
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperScott Gray
 
Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Daniel Abadi
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, TIB Academy
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangaloreTIB Academy
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Attaching cloud storage to a campus grid using parrot, chirp, and hadoop
Attaching cloud storage to a campus grid using parrot, chirp, and hadoopAttaching cloud storage to a campus grid using parrot, chirp, and hadoop
Attaching cloud storage to a campus grid using parrot, chirp, and hadoopJoão Gabriel Lima
 

Ähnlich wie Integrating dbm ss as a read only execution layer into hadoop (20)

Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_Whitepaper
 
Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers,
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Attaching cloud storage to a campus grid using parrot, chirp, and hadoop
Attaching cloud storage to a campus grid using parrot, chirp, and hadoopAttaching cloud storage to a campus grid using parrot, chirp, and hadoop
Attaching cloud storage to a campus grid using parrot, chirp, and hadoop
 

Mehr von João Gabriel Lima

Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationJoão Gabriel Lima
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackJoão Gabriel Lima
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitJoão Gabriel Lima
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência ArtificialJoão Gabriel Lima
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão LinearJoão Gabriel Lima
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoJoão Gabriel Lima
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google HackingJoão Gabriel Lima
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisJoão Gabriel Lima
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...João Gabriel Lima
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoJoão Gabriel Lima
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaJoão Gabriel Lima
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideJoão Gabriel Lima
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?João Gabriel Lima
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...João Gabriel Lima
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosJoão Gabriel Lima
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.jsJoão Gabriel Lima
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptJoão Gabriel Lima
 

Mehr von João Gabriel Lima (20)

Cooking with data
Cooking with dataCooking with data
Cooking with data
 
Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer Segmentation
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full Stack
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKit
 
JS - IA
JS - IAJS - IA
JS - IA
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência Artificial
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão Linear
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de caso
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google Hacking
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentais
 
Web Machine Learning
Web Machine LearningWeb Machine Learning
Web Machine Learning
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - Clusterização
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e Weka
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark side
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãos
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com Javascript
 

Kürzlich hochgeladen

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Kürzlich hochgeladen (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Integrating dbm ss as a read only execution layer into hadoop

  • 1. The 11th International Conference on Parallel and Distributed Computing, Applications and Technologies Integrating DBMSs as a Read-Only Execution Layer into Hadoop Mingyuan An, Yang Wang Weiping Wang, Ninghui Sun Key Laboratory of Computer System and Key Laboratory of Computer System and Architecture, Chinese Academy of Sciences Architecture, Chinese Academy of Sciences Institute of Computing Technology, Chinese Institute of Computing Technology, Chinese Academy of Sciences Academy of Sciences Graduate University of Chinese Academy of Beijing, China Sciences {wpwang, snh}@ncic.ac.cn Beijing, China {anmingyuan, aaron}@ncic.ac.cn ABSTRACT—To obtain the efficiency of DBMS, HadoopDB techniques has greatly pushed the popularity, and many combines Hadoop and DBMS, and claims the superiority over systems have been constructed on top of Hadoop. Hadoop in terms of performance. However, the approach of The loose constraints on the data schema and execution HadoopDB is simply putting MapReduce onto unmodified style in Hadoop bring the user the maximum flexibility. single-machined DBMSs which has several obvious weaknesses. In essence, HadoopDB is a parallel DBMS with The user can implement the upper application in an fault tolerance, which incurs unnecessary overhead due to the unrestrictive way. However, as a very thin layer with basic DBMS legacy. Instead of augmenting DBMS with Hadoop mechanism and functionality, Hadoop is not efficient techniques, we propose a new system architecture integrating enough directly facing the user [4][5][6]. In common cases, modified DBMS engines as a read-only execution layer into it lacks many performance-critical optimizations such as Hadoop, where DBMS plays a role of providing efficient read- compact data representation, helper structures, etc. only operators rather than managing the data. Besides the Database management system, by comparison, has obtained efficiency from DBMS engine, there are other optimized implementation improving the efficiency greatly. advantages. The modified DBMS engine is able to directly It has read optimized storage format, sophisticated query process data from the HDFS (Hadoop Distributed File System) files at the block level, which means that the data execution, different kinds of indexes, data or query cache replication can be handled by HDFS naturally, and the block- better understanding the semantics of the application, etc. level parallelism is easily achieved. The global index access However, lack of fault tolerance, as one of the most mechanism is added according to the MapReduce paradigm. important reasons, makes DBMS incompetent for large- The data loading speed is also guaranteed by directly writing scale data processing applications. the data into HDFS with simplified logic. Experiments show HadoopDB [7] puts a middleware between Hadoop and that our system outperforms both original Hadoop and DBMSs, so gets fault tolerance from Hadoop. HadoopDB HadoopDB styled system. makes itself a parallel DBMS with fault tolerance, claiming the ability to support large-scale data processing Keywords-Hadoop, database, large-scale data processing, applications. But this method simply takes complete global index access DBMSs as the underlying storage and execution units, which has some problems. I. INTRODUCTION First, with respect to fault tolerance, although it can Google File System (GFS) [1] and MapReduce [2] are take advantage of MapReduce to achieve it in the execution developed (or popularized) by Google for large-scale layer, replication in the data layer is not fully implemented. dataset storage and processing. GFS is a distributed file In the experiments conducted in HadoopDB project, the system optimized for large sequential read operations, and data replicas are maintained manually. Before starting the provides fault tolerance mechanism in the data layer. benchmark, the data are first split into chunks and MapReduce is a programming paradigm for parallel replicated onto the nodes in batch mode using some kind of processing. Using MapReduce, the user can easily express scripts. This approach obvious does not support online the application task without the complexity of detailed loading. Without the fault tolerance in the data layer, it will parallel execution. The runtime system of MapReduce still suffer from failures, so will not be a real scalable parallelizes and schedules the job to take full use of the system in a large-scale environment. Implementing the available resources in a large-scale parallel system, while replication mechanism in the middleware on top of DBMSs provides fault tolerance mechanism in the execution layer. will need great effort, amounting to a big project of the Because of the fault tolerance, high scalability and ease fault tolerance domain. All these make HadoopDB hardly for use, the techniques underlying MapReduce and GFS are be used in practice. Actually, HadoopDB is just a prototype very attractive for large-scale data processing applications. focusing on testing the query execution performance with Hadoop [3] as an open source system implementing these the prepared data replicas in advance, rather than a complete system architecture solution. 978-0-7695-4287-4/10 $26.00 © 2010 IEEE 17 DOI 10.1109/PDCAT.2010.43
  • 2. Second, the underlying single-machined DBMSs are techniques [10][11]. These works mainly deal with the unable to use global index residing outside each single language problem. system, so may be not optimal in terms of performance for The second one is for efficiency. Some works have some kind of queries. Maybe this can be handled in the been done to improve the kernel part of Hadoop middleware in some way, but it is still making a detour. [12][13][14][15]. Some other works improve the system by Third, the data loading speed of DBMS is slow due to external mechanisms [7] [16]. The work on integrating very strict constraints on the data schema and semantics. DBMS and Hadoop comes from the thoughts in [5], which The data usually need to go through complex logic before point out that the brute-force style work of MapReduce is being stored into the storage, which is not necessary for not optimal in terms of efficiency, and something should be many applications. done to merge the techniques from MapReduce and DBMS. The construction of HadoopDB takes a DBMS basis, so HadoopDB [7] is the first, as we know, to try to merge the incurs inevitable limitations due to the DBMS legacy. We two systems. Although pioneered the idea, HadoopDB still believe that Hadoop has already done a lot to make itself a has limitations which will bottleneck the application in real competent system for large-scale data processing scene. applications, so the right way is to take a Hadoop basis, and borrow DBMS techniques when appropriate. III. SYSTEMS FOR LARGE-SCALE DATA ANALYSIS In this paper, we propose our approach to integrate In this section we will first introduce the requirements DBMSs as a read-only execution layer into Hadoop. Based of our application of large dataset analysis. We believe that on Hadoop, we incorporate modified DBMS engines which these requirements are typical for many other applications. are augmented with a customized storage engine capable of Based on these requirements, we revisit existing systems directly accessing data from HDFS and taking use of global for data processing. Finally, in terms of merging the index access method. In this architecture, DBMS plays a techniques of both Hadoop and DBMS, we determine the role of providing efficient read-only operators, instead of position where our system should stand. managing the data. The following benefits are obtained: (1) With data being put on HDFS, the fault tolerance A. Application Requirements problem in the data layer is solved naturally. Our case is a network security application. There are (2) The DBMS engine executes the sub-queries with the some monitors keeping watching the whole network and efficiency advantage as is the case for HadoopDB. generating sampling records for captured events. These Besides, based on HDFS and MapReduce, a global generated data are streaming into the analytic platform in index mechanism is able to be put into action with the real time. Once stored into the system, the data should be DBMS engines, and significantly improves the available for ad-hoc read-only queries. This analytic performance for certain queries. platform is the one we focus on. (3) As a read-only layer, DBMS is not responsible for the data loading, but instead, the data are loaded through a 1) Large-scale Parallel Processing loader outside the DBMS, or the user can write the data The large-scale parallel processing power is a basic directly to the HDFS in a predefine manner. Doing this requirement for all large dataset analysis applications. greatly accelerates the data loading speed to the raw Different with the point operations of key-value model in speed of HDFS writing, while keeps the convenience web services, analysis work usually needs to access big part and flexibility for the user. of the whole dataset even for a single query, which must The remainder of the paper is organized as follows: resort to large-scale parallel processing for huge raw power. Section II introduces the related work; Section III analyzes In such environment, automatic mechanism is very critical. the application requirements and existing systems, and then Automatic parallelization, scheduling and fault handling positions the desired system; Section IV describes our liberate the user from heavy programming and maintenance proposed system; Section V gives experimental results; work. All these play an important role in guaranteeing the Section VI concludes the paper. scalability of the system. 2) High Efficiency II. RELATED WORK Efficiency is a necessary concern because it will take up With the wide deployment in the data analysis field, so many resources for each single query, and higher there appear two types of works to improve Hadoop-based efficiency can lower the cost significantly. systems. The first one is for the usability. Despite the Besides the common reasons, there is a special one in flexibility, for many users, the low level language of our case. In many other applications, the data have a short MapReduce is somewhat inconvenient to use compared to life cycle. The data are loaded into the system in batch that in a higher level such as SQL of relational DBMS. So mode, then some almost fixed queries are put onto the data, some systems on top of Hadoop are developed. Facebook’s and after that, the data will be removed or offloaded to the Hive [8] and Yahoo’s Pig [9] are examples of this kind. offline system. In such condition, organizing the data into They provide simple declarative languages capable of some sophisticated structure is not worthwhile given the expressing complex ad-hoc queries on structured data. extra maintenance cost and the low utility. Sometimes the Some other higher level languages or system level products tasks on the data are simply to generate statistical reports as are also developed on top of either MapReduce or similar timing jobs, for example, only working during night, so it 18
  • 3. may be all right even if the execution takes a less efficient applied to every layer and component of DBMS, such as way. These applications can be seen as data processing optimized data storage format, diverse access methods, rather than data analyzing. sophisticated query execution, efficient data cache, etc. By comparison, our application is to deal with ad-hoc Many of these techniques are widely copied and reinvented analytic queries over long existing dataset. It is worthwhile by other systems [17]. This is the advantage of DBMS, and to adopt some optimized data structures and execution is desirable for large dataset analysis. mechanism to improve the query efficiency. For example, For data loading, there is the limitation in DBMS-based the queries are often with predicates on some attributes, for systems. Due to strict constraints such as ACID property, which using index can reduce the execution cost in certain the system can not load the data in an efficient way, cases. Although it has extra cost on index maintenance, this especially for online loading. The online loading speed of will be amortized by the repeated usage. Ad-hoc queries any DBMS node is lower than 10MB/s, as far as we know. also make the cache usable. Queries on the same set of data The systems in data warehouse application are often offline may exist, so the cache will make sense. Using index also systems with very weak online loading requirement, in poses demand for index data cache. After all, the long life which the common case is to load data in batch at regular cycle of the data and ad-hoc queries will justify the effort time points. for optimization. Hadoop: Many applications replace DBMS with Hadoop for a couple of reasons. One of the most important 3) Continuous High Speed Data Loading ones is that Hadoop is scalable due to the fault tolerance. In The data in our application are streaming in addition, Hadoop is easier to deploy and use. For data continuously at relatively high speed. This requires the processing tasks, MapReduce provides very simple and system be capable of loading data with high speed in an flexible parallel programming paradigm, and is able to online mode. The data can be stored in appending manner, express complex queries. and once stored there will be no update on them. High As to parallel processing, Hadoop is totally born for it. speed online loading requires the logic on the path be MapReduce run-time system has full ability to parallelize simple enough. Unnecessary strict semantics checking the whole processing in large-scale systems. Both should be avoided. MapReduce and HDFS are completely fault tolerant, B. Existing Systems Reconsidered making the whole system highly scalable. This is one of the In response to the requirements of the application, we most important properties of Hadoop, and is critical for consider three types of systems: database management large dataset analysis. In addition, the block-level system, Hadoop, and HadoopDB. If we treat the techniques replication of HDFS gives MapReduce great opportunities of DBMS and Hadoop as two extremes, there should be a for high degree of parallelism and fine-grained execution broad spectrum between them. To satisfy the requirements fault-tolerance, which improves the performance of the application, it is right to draw strength from both. significantly. HadoopDB is actually a DBMS equipped with some For efficiency, Hadoop seems to have a long way to go. Hadoop techniques. However, we believe that our system MapReduce is working in a brute-force way. Whatever the should start from the other side. query is, it has to scan all the data without helper structures such as index. The scan and processing are not guaranteed 1) Existing Systems to be efficient, because it is the user’s work to implement DBMS: DBMS has long story for data management. the details. The data are often stored on HDFS in text The irreplaceable domain of DBMS is transaction format which is straightforward to the user but not compact. processing, where ACID property must be guaranteed. As In these aspects, Hadoop is not as good as DBMS, at least to the data analysis, DBMSs also hold an important for now. position. Parallel DBMSs are very popular in data As to data loading, writing data directly to HDFS can warehouse market. In this domain, parallel DBMS provides be guaranteed with an acceptable speed. For many large- a high degree of parallelism and achieves good scale data analysis applications including ours, weak performance for analytic queries. consistency is enough. The data can go though without For parallel processing, parallel DBMS only competes complex logic, so writing structured data can also achieve at a limited scale. The serious problem with DBMSs is that the same speed as the case for unstructured byte stream. most of them are not fault tolerant. If something goes Although the speed can be lowered when replicas are wrong during the execution of a query, it has to restart stored, this is an inevitable tradeoff with fault tolerance on entirely. In a large-scale system consisting of thousands of data. components, a long running query will never succeed HadoopDB: DBMS and Hadoop have their own considering the high failure rate. While parallel DBMSs are superiority in the appropriate domain. To better satisfy the competent for data warehouse, they are not suitable for emerging applications, DBMS must incorporate fault large-scale applications in this aspect. The largest DBMS- tolerance in order to be scalable, while Hadoop should based analytic system as we know consists of only 100 borrow techniques from DBMS to improve the efficiency. machines. HadoopDB is constructed based on this idea. DBMSs are For efficiency, DBMS has embodied decades of taken as the storage and execution units, and MapReduce academic and industrial research. Optimizations have been mechanism takes responsibility for parallelization and fault 19
  • 4. tolerance on top of the underlying DBMSs. Fig. 1 shows DBMS camp, which seems capable of supporting this kind the architecture of HadoopDB briefly. HDFS is used to of applications. store the system metadata and the result set of the query. HadoopDB merges techniques from both DBMS and All source data are stored in DBMSs. When executing a Hadoop, but it is hardly be used for the applications due to query, Maps are scheduled to the nodes according to the the DBMS legacy and the further required implementing metadata which tells the location of each block of the data. work. So now we must identify two different ways of Maps issue SQL queries to the underlying DBMSs and merging DBMS and Hadoop techniques, which can help emit the result records to Reduces. Reduces aggregate positioning the desired system. The difference between the result sets from multiple nodes, and write the final results two ways is about the starting point for constructing an onto HDFS. integrated system: one from DBMS, another from Hadoop. HadoopDB tries to introduce fault tolerance and fine- grained parallelism into the parallel DBMS, so belongs to the first category. While high efficiency is reserved, all strict constraints of DBMS are also inherited. HadoopDB is desired for applications where the strict schema and semantics of data are given high priority. So it is capable of dealing with traditional database applications. After all, HadoopDB means Hadoop database, not database Hadoop. The system we need should go from the other side. Hadoop satisfies the majority of our need except that in efficiency, so we should integrate DBMS techniques into Hadoop-based system, rather than the reverse. MapReduce Figure 1. Architecture of HadoopDB. and HDFS are all developed for large-scale data processing applications, and it is only the efficiency that needs special For parallel processing, HadoopDB is just partially fault concern. Hence, we should position our system closer to tolerant. MapReduce only guarantees the fault tolerance in the Hadoop side, while be positive to incorporate desired the execution layer. The data are stored in DBMSs rather properties from the DBMS. than HDFS, so the availability of data should be specially handled. Common DBMS has no special concern on fine- C. Our Approach grained data replication for intra-query parallelization, Different with HadoopDB, we take DBMSs as read- without which the MapReduce framework can not only execution components. For a specific query, dataset on completely exploit the parallelism. HadoopDB uses HDFS is split logically into blocks as usual, and each block batched approach to dump the data out of the database and is assigned an executor which is now a database execution replicate them to some other nodes, which doesn’t support thread; all intermediate results computed by the database online loading at all. This functionality can be achieved in engines are aggregated by Reducers which are the same as the middleware, but implementing replications of fine before; final results are written onto HDFS naturally. granularity on top of table of DBMS needs non-trivial work, Using this approach, the parallel execution is still in which makes it hard to use in practice. block granularity, and fault tolerance is guaranteed in both For efficiency, the advantage of DBMS in this aspect the data layer and the execution layer. can be reflected in the integrated system, because each Efficiency now depends on the DBMS layer. Many query will be translated into sub-queries actually executed techniques such as data cache, query cache, optimized by individual DBMS query engine. Despite the operators in the database will still make effect. But data improvement, there is still one limitation on global access methods are partially different as before due to the structure mechanism. Because all DBMSs in HadoopDB customized storage engine using HDFS. Index access are unmodified ones of single-machined version, this layer mechanism should be reconsidered and adapted to work in can not take use of any global structures, such as global this situation. index which makes sense when the query is with predicates The data loading process is intuitive. Streaming data of high selectivity. can be packed into optimized format as that in DBMS, and Finally comes the data loading requirement. In then be directly written onto HDFS bypassing the logic for HadoopDB, DBMS-based storage obviously inherits the transactions, which will not cost too much through this limitation. The data have to be loaded through a complete simple logic. DBMS logic, so it is difficult to improve the loading speed. IV. DBMS ENGINE INTEGRATED HADOOP SYSTEM 2) Discussion We have given an analysis of two typical systems and In this section, we will give a detailed description of the the integrated HadoopDB respectively. Lack of fault system constructed for our application. We first give the tolerance eliminates traditional parallel DBMSs from the overview of the system architecture, and then focus on the candidates for large-scale data processing applications. query execution process. Besides the familiar full scan However, HadoopDB, as a fault tolerant parallel DBMS in execution, a global index access mechanism in MapReduce essence, becomes a promising representative from the framework is introduced. 20
  • 5. A. Overview fits DBMSs into the MapReduce execution framework very The system consists of four parts as shown in Fig. 2. well. The bottom is the storage layer HDFS. On top of HDFS are From the perspective of the whole system, we embed the database query engines as the executors. The top is modified database engines into Hadoop rather than just MapReduce system. The middleware layer contains the gluing them together like HadoopDB, which yields a more data loader, the indexer, etc. coordinated system. The data loader stores the incoming data onto storage in B. Query Execution a simple way. The data are packed in binary format into the pages, each of which is the smallest I/O unit of the database Fig. 3 describes the framework of the query execution. query engine. The The query is first translated into sub-queries expressed in SQL. Sub-queries will be executed by each database engine thread. Besides the operations applied, the sub-query also indicates the position information of the data block on which the database engine thread should process. This position information is figured out by the splitting process on the source data file, and is used by MapReduce runtime system to schedule the sub-tasks. Each sub-query is passed to an instance of Map on a specific node, where the position parameters in the sub-query will be set to the according splitting result values. Figure 2. Architecture. page size is fixed according to system parameter, and is 32KB by default. Using binary format reduces the occupied disk space compared to the text representation, so improves both the loading and query performance. The binary format is also obeyed by the customized storage engine in the database when parsing the data into tuples. Note that although in binary formant for structured data, the CPU cost of loading is not increased, and in fact, it is more efficient than using text format. The data will be replicated Figure 3. Query execution. automatically by HDFS in the block granularity. The Each Map instance issues the sub-query of the SQL default block size is 64MB, and is configurable. The block format to the local database engine thread, and emits the size of HDFS is an integer multiple of the page size for results returned by the database. The Map instance doesn’t ease of implementation. need to aggregate the local intermediate results, because the The indexer can create some kind of index on the sub-query executed by the database already finishes this. loaded data in the batch mode. Because HDFS is append- The most critical part for this process is the customized only, it will be complex to build an updatable index in the storage engine that provides the ability to access HDFS real-time manner. Actually, the data are often queried with data at block level. time range predicate, so creating separated indexes along time dimension on a periodic basis is an acceptable solution 1) Customized Storage Engine in real scenes. We support B+-tree index for now. The B+- The query engine accesses the data through the storage tree index searches all the data across the cluster, so it is a engine using a collection of routines in an iterator manner. global index structure. The index data are also stored in init() is first evoked every time the executor wants to HDFS, and can be seen by each database executor. The accessing the data. After that, get_next() routine is called detailed index structure and index access method in repeatedly by the executor, which returns a tuple each time, MapReduce framework will be described in Section IV.C. and the executor applies the operations on the stream of the The database executors are actually MySQL server tuples. When the query is finished, a close() function is threads. MapReduce run-time system schedules a sub-task called, which cleans the context. (Map instance) to a specific node, on which the sub-task We store the dataset on HDFS, so the first thing we do issues a SQL query to the underlying MySQL server on this is to implement the routines using HDFS API. The node. We implement a new storage engine for MySQL so implementation is almost the same as that for local file that the query engine can get tuples from HDFS data files. system, except that all file system calls is replaced by the Some tricks are applied to make the query engine capable HDFS collection. The data are page formatted, and the of accessing tuples from a specific block of the HDFS file, reading is on the page basis. The predefined data format is which provides the ability to execute at the block level and obeyed when parsing the tuple out of the page. A data 21
  • 6. cache is also implemented in the storage engine, where nodes must be launched, because all the local indexes LRU evict algorithm is adopted as usual. search the same value domain. The control overhead such The schema definition of the HDFS dataset must be as that on setup and cleanup of sub-tasks will occupy a big registered in the database, so that the query can be executed part of the total running time. Comparatively, when using by the database without syntax or semantics exceptions. global index, only some of the nodes possessing the This is achieved by the cooperation of the data loader, qualified index entries need to be started. However, global database query engine and the storage engine. When a new index access will incur global communications which table is to be created, the data loader creates necessary data should not be neglected. files for this table and issues ‘create table’ to all nodes. The HadoopDB consists of single-machined DBMSs which database query engine will process this DDL (Data are unable to take use of global index, so the supported one Definition Language) query and record the information is just local index mechanism. about the table into the metadata. Then the query engine In our integrated system, the data are replicated and will call the create() function of the storage engine with distributed by HDFS, so the DBMS layer has no knowledge necessary parameters. In create(), the normal routine is to about the locality of the data, which means that local index create data files and other data structures needed, but in our doesn’t make sense. We choose to implement the global implementation, it just opens the data files already created index mechanism. The index mechanism must give by the data loader, and initiates the context. After executing consideration to the MapReduce execution style so that the command, the data loader gets ready to load the makes the index access be parallelized in this framework. incoming data to the new table data file, and all databases are available for queries on this table. 1) Index Creation Every database query engine now is able to see the The indexer is responsible for building B+-tree index on whole dataset on HDFS. However, it can only execute the dataset, and the index file is stored into HDFS, so it can query at the table level, so you can not specify which part be accessed by all the nodes. Because there will only be of this table the query should process. This manner is not read requests, the index doesn’t need to support update appropriate for MapReduce. MapReduce framework operations. So the entries in the B+-tree node are dense- logically splits the dataset into blocks, and assigns each packed, leaving no free space for later insert. We create the Map instance a block to process. To fit into this paradigm, B+-tree index as follows: first, sort the the database must be able to process at the block level (value_on_index_attribute, offset) pairs from all the records rather than the table level. Here we make use of pseudo and write them sequentially into the index file which column to achieve this goal. Besides the columns for the directly forms the leaf nodes of the tree; then scan the leaf data, we introduce an addition pseudo column blk, which nodes, create all the intermediate nodes and the root node exists in the metadata of the database but is not stored in a bottom-up fashion, and append them to the index file. actually. This column is used to pass parameters about the Traditional B+-tree index may be created through insert position information of the data block which needs operations in an online mode which makes the leaf nodes processing. With determined data block, the Map instance not physically contiguous in the file, but connected by adds a predicate on the pseudo column to the where clause pointers. Our approach guarantees that the leaf nodes of the SQL query. When executing this modified query, the occupy a contiguous range of space in the index file which position constants will be sent to the storage engine, so facilitates the parallel access to these leaf nodes during only the tuples in the indicated storage range will be read in. query. The structure of the index is illustrated in Fig. 4. During this process, the query engine works as before without the perception of this matter. Till now, the database engines are well integrated into Hadoop framework. C. Global Index Mechanism One of the useful auxiliary data structures in DBMS is the index. For certain queries, index assisted execution can improve the efficiency. For example, with predicate of high selectivity, the qualified tuples for the query are only a small part of the whole. The brute-force scan on the whole dataset will waste too much energy. If there is some index on the predicate attribute, using index to directly retrieve the qualified tuples may save a lot. In parallel DBMS, there are two kinds of indexes in Figure 4. Leaf nodes occupy continuous space in the index file and Maps terms of locality. Local index is the one that resides on one work on selected leaf node data blocks. node and only searches the local dataset on the same node; global index is the one that has references to the data across the whole cluster, and the global index itself usually distributes across all the nodes. When using local index, all 22
  • 7. 2) Index Access is set up on the cluster, and one MySQL server of version To support the index access method, we add another 5.0 is running on each individual node. pseudo column idx to the table schema, and modify the The benchmark is from our application. Although from storage engine implementation accordingly. In the case of specific domain, the data schema and operation are very index access, idx and blk attributes will be used together to common to many other applications. The data schema is a give the indication to the storage engine. table with 9 integer columns, which are time, systemID, When the query is with predicate of high selectivity on deviceID, eventType, port, inBytes, outBytes, inPackets, the indexed attribute, the index access method will be outPackets respectively. time is the number of seconds chosen. Before starting up the MapReduce tasks, several since the Epoch. It starts from 1235750430 in the traversals through the index are taken to locate the start and benchmark, and increases by 1 every 131072 records. end positions in the leaf nodes for each predicate value. systemID is uniformly distributed in the integer range [1, The index entries in the leaf nodes between the start and the 15]. deviceID and eventType are respectively uniformly end positions are those pointing to the records satisfying distributed in the integer range [1, 50] and [100000000, the predicate. If the predicate is a range, only two traversals 100000014]. port, inBytes, outBytes, inPackets and will be needed. Because the height of tree is usually very outPackets are all uniformly distributed in the integer range low, this process will not take much time compared to the [0, 65535]. The whole dataset has about 471,859,200 later processing using MapReduce. records over a time range of one hour. According to the start and end positions, Maps are The queries we use are generated attaching to the leaf node pages in the selected SELECT truncate(time/60,0), systemID, deviceID, eventType, ranges as shown in Fig. 4. Each Map will add the sum(inBytes), sum(outBytes), sum(inPackets), sum(outPackets) predicates on blk and idx to the where clause of the SQL FROM table [WHERE port IN (port_list)] query. idx parameter specifies the index to be used, and blk GROUP BY truncate(time/60, 0), systemID, deviceID, eventType parameters now indicate the start and end offsets to the leaf nodes. During the execution of the sub-query, the storage where the where clause may vary in different experiments. engine will scan the index entries in the selected range of This query actually groups and summaries the data by index leaf nodes, and retrieves the tuple using the offset in systemID, deviceID and eventType attributes in a minute- the index entry. The other phases of the execution are the granularity with a predicate on the port attribute. same with that for the full scan case. We compare three systems: Hadoop, HadoopDB-like In this execution mode, the number of Maps is related system (HadoopDB-L for short) which is implemented on to the selectivity of the predicate. Using the index, a top of MySQL, and our database engine integrated Hadoop minimal number of Maps are generated and only the system (DBEHadoop for short). The data in Hadoop are in qualified records are read in. text format with columns separated by space characters, and the whole dataset in the benchmark occupies about D. Summary 25GB space. The data in HadoopDB-L and DBEHadoop We have proposed a new system architecture are in paged binary format with each value of attribute integrating modified database engines as a read-only occupying four bytes, and the whole dataset occupies about execution layer into Hadoop. Data replications are handled 15GB space. All systems are configured with 2 replicas in by HDFS naturally. The modified database engine is able to 64MB block granularity for the data, so the actual storage process the data from HDFS file at the block level, so fits space for the dataset doubles. The data of HadoopDB-L are very well into MapReduce. The global index access manually replicated across all nodes, and each block of mechanism is added according to MapReduce paradigm. 64MB data is simulated using a separate table. For each The loading speed can also be guaranteed using HDFS. block (table), there is a local index on the port attribute. In This integrated system satisfies our application very much. DBEHadoop, a global index on port is created. Because of The essential difference with HadoopDB is that we being in a public platform, we set the max number of construct our system on a Hadoop basis, rather than a parallel Maps per node to 3, which is less than the number DBMS basis. DBMS provides us the efficient operators, of cores in the node. while the managing of data is handled by other Hadoop- B. Query without Predicate based components. The first experiment is on the query without where V. EXPERIMENTS clause, so all the data needs to be scanned. The result set contains 2,287,500 records. The system buffer is cleaned A. Configurations before the execution. The experiments are conducted in a cluster consisting of Fig. 5 shows the running time for the three systems to 15 nodes connected by a gigabit Ethernet, which is a part of execute this query. The execution time of Hadoop is much a public computing platform. Each node has two dual-core longer than that of HadoopDB and DBEHadoop. There are AMD Opteron™ Processor 275, 8GB DRAM, and a two factors affecting the performance of Hadoop. The first 136GB SCSI disk. The kernel of the operating system is one is the raw size of the data file, and the second one is the Linux 2.6.9-4.2.ELsmp x86_64. The bandwidth of the local CPU efficiency during the execution. The raw size of data file system sequential I/O is about 60MB/s. Hadoop 0.19.2 file in Hadoop is larger than the two other systems, so the 23
  • 8. disk I/O is larger in Hadoop. In addition, larger data file 160 Hadoop HadoopDB-L DBEHadoop running time in seconds needs more Maps to process, which incurs more control overhead. In terms of CPU efficiency, due to Java language, 120 Hadoop is less I/O bound when processing large amount of records with more columns like the one in our case, 80 especially in text format. For this query, Hadoop takes much more CPU time. 40 200 running time in seconds 0 160 1 10 20 number of ports selected 120 Figure 6. Running time for full scan execution. 80 Next we compare the performance of HadoopDB-L and 40 DBEHadoop with index assistance. Fig. 7 shows the result. 0 For the case of one port, index access is better than full scan, because random read for a very small amount of Hadoop HadoopDB-L DBEHadoop records outperforms the sequential scan of the large dataset. For HadoopDB-L, index access is only a little better than Figure 5. Running time for the query without predicate. full scan, and this is because in both modes, each data The performance of HadoopDB-L and DBEHadoop are block needs a Map with actually little work to do due to the almost the same. They have similar data format, so the data high selectivity, so control overhead takes a considerable sizes are similar. Reading data through HDFS doesn’t large part of the total running time. DBEHadoop is much cause too much overhead for DBEHadoop because of the more efficient than Hadoop-L, because of the very small paged sequential I/O, which makes the two systems number of Maps needed. For the case of 10 ports, the essentially same in the query execution. Using the database performance of index access is almost the same with full for underlying execution improves the CPU efficiency, so scan execution, which indicates that index access will not the systems are more I/O bound compared to Hadoop. be superior from this point. For HadoopDB-L, the cost of random read offsets the benefit of small I/O volume. For C. Query with Predicate DBEHadoop, random read and communication cost We now add predicate to the query. The query is to only together offset the benefit. For the case of 20 ports, process the records with port value in the port_list. When DBEHadoop index access method is much more expensive the predicate of high selectivity exists, it has a chance to than full scan method. Compared to local index, the cost of use an index to accelerate the execution. So we conduct this global index access increases more evidently due to the set of experiments under high selectivity predicate, where network communication in addition to the random disc I/O, index makes sense. We select 1, 10 and 20 ports which offsets the benefit of the small number of Maps. respectively to repeat the experiment. Because the value of However, in these conditions, full scan method will be port is uniformly distributed, the number of selected ports chosen. rather than the specific values determines the performance. 200 Hadoop DB-L full scan The system buffer is cleaned before execution. Hadoop DB-L index access running time in seconds 160 Fig. 6 is the result of the three systems using full scan DBEHadoop full scan DBEHadoop index access execution which scans all the data and applies predicate to 120 qualify the records. The time is shorter than that without predicate for each system, because fewer records need 80 processing after applying the predicate. Full scan has the same performance in all cases for each specific system, 40 because with only small number of qualified records, the I/O and control cost actually dominate the total time. 0 Hadoop takes more time than the other systems mainly 1 10 20 number of ports selected because of the larger data size (more I/O and Map instances). HadoopDB-L and DBEHadoop have the similar Figure 7. Full scan vs. index access. performance. D. Execution under Warm Buffer In many cases, the same set of data may be queried by different users, so it is necessary to evaluate the performance under warm buffer. The previous experiments 24
  • 9. are repeated using the same way except that the data are generates the data itself and loads it into the system. Doing already in the system buffer cache. this way ignores the overhead of communication between Fig. 8 shows the result for full scan execution. When the data source and the loader, thus evaluates the raw without the predicate, the time for Hadoop is similar to that loading speed of the underlying system. For HadoopDB-L, under cold buffer, due to the CPU bound behavior. the loader loads the records into local MySQL server HadoopDB-L and DBEHadoop are faster than the case with through prepared batch inserts using JDBC, and no cold buffer. When with predicate of high selectivity, all replication is considered. For DBEHadoop, the loader loads three systems take much shorter time than the cold buffer records through routines that format the data into pages and 200 write them onto HDFS, and a replication of degree 2 is automatically maintained by HDFS. Hadoop Fig. 10 gives the result, from which we can see running time in seconds 160 HadoopDB-L DBEHadoop DBEHadoop is much faster than HadoopDB-L. 120 HadoopDB-L must load the data through DBMS logic, so the speed is very low, especially in the online mode for 80 streaming data. Although only using MySQL server here, it is in the same order of magnitude for other DBMSs as we 40 know. DBEHadoop reserves the advantage of high loading speed of Hadoop though direct writing to HDFS. 0 Replication mechanism causes extra overhead, and when all 1 10 20 multiple loaders work in parallel, the overhead increases number of ports selected due to increased network traffic. However, the loading speed is fast enough for common applications. Figure 8. Running time for full scan execution under warm buffer. 350 case, which is a result of saving on I/O cost. The difference loading speed (MB/s) HadoopDB-L 300 in performance between Hadoop and other two systems 250 DBEHadoop becomes smaller, which is because decreased operations 200 involved hide the inefficiency of Hadoop in some degree. HadoopDB-L and DBEHadoop have similar performance 150 in all cases. 100 Fig. 9 gives the result for the index access execution. 50 Different with the cold buffer case, DBEHadoop index 0 access method gets the equal performance with full scan method when 20 ports are selected, while HadoopDB-L 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 index access method is almost the same with full scan number of loaders method in all cases. Without I/O cost, the small number of Maps mainly contributes to the improvement of Figure 10. Data loading speed. performance. F. Summary 120 HadoopDB-L & DBEHadoop full scan Through these experiments, we show that DBEHadoop HadoopDB-L index access running time in seconds DBEHadoop index access is as efficient as HadoopDB-L for full scan queries, and for queries with predicate of high selectivity, the global index 80 access mechanism adopted in DBEHadoop is much more efficient than HadoopDB-L. For the data loading, DBEHadoop achieves very good performance, which is far 40 better than HadoopDB-L. VI. CONCLUSION 0 Hadoop and DBMS are not ideal for large dataset analysis. HadoopDB as an integrated system merging 1 10 20 number of ports selected techniques from both is promising, but still limited due to some reasons which are difficult to overcome. We believe Figure 9. Running time for index access execution under warm buffer. that it is the way by which HadoopDB is constructed that makes itself hard to satisfy the emerging applications. Taking a Hadoop basis, rather than a DBMS basis, and E. Data Loading incorporating DBMS techniques is the right way to In the cluster, we start several loaders one on each node construct systems for large-scale data processing to test the streaming data loading speed. The loader just applications. 25
  • 10. We propose a new system architecture integrating [6] Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, modified DBMS engines as a read-only execution layer Erik Paulson, Andrew Pavlo, Alexander Rasin, “MapReduce and parallel DBMSs: friends or foes?” communications of the acm, vol. into Hadoop, where DBMS plays a role of providing 53, no. 1, 2010. efficient operators instead of managing the data. Besides [7] Azza Abouzeid, Kamil Bajda-pawlikowski, Daniel Abadi, Avi the same advantages with HadoopDB, our system solves Silberschatz, Er Rasin, “HadoopDB: An architectural hybrid of the limitation posed by HadoopDB in real scenes. The MapReduce and DBMS technologies for analytical workloads,” in HDFS-based storage solves the fault tolerance problem in Proc. VLDB’09, 2009 the data layer. The modified database engine is able to [8] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain and Zheng Shao, process data from HDFS file at the block level, so fits very “Hive – A warehousing solution over a MapReduce framework,” in Proc. VLDB’09, 2009. well into MapReduce. A global index access mechanism [9] Christopher Olston, Benjamin Reed and Utkarsh Srivastava, “Pig adapted according to the MapReduce paradigm is added Latin: A not-so-foreign language for data processing,” in Proc. and shows better performance compared to HadoopDB for SIGMOD’08, 2008. certain queries. The proposed system reserves the [10] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, advantage of Hadoop in the data loading speed, which is far and J. Currey, “DryadLINQ: A system for general-purpose better than HadoopDB. All the properties make the system distributed data-parallel computing using a high-level language,” more appropriate for large-scale dataset analysis 2008. applications. [11] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the data: Parallel analysis with Sawzall,” Scientific Programming, vol. ACKNOWLEDGMENT 13, no. 4, 2005. [12] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled We would like to thank the anonymous reviewers for Elmeleegy, Scott Shenker, Ion Stoica, “Job scheduling for multi- their valuable feedback on this work. This research is User MapReduce clusters,” technical report No. UCB/EECS-2009- supported by National Natural Science Foundation of China 55, 2009. (Grant No. 60903047). [13] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, Russell Sears, “MapReduce online,” technical REFERENCES report No. UCB/EECS-2009-136 [1] S. Ghemawat, H. Gobioff, and S-T. Leung, “The Google file [14] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, system,” in Proc. SOSP’03, 2003, p. 29. Christos Kozyrakis, “Evaluating MapReduce for multi-core and multiprocessor Systems,” in Proc. HPCA’07, 2007 [2] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51 (1), pp. [15] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, 107-113, Jan. 2008. Ion Stoica, “Improving MapReduce performance in heterogeneous environments,” in Proc. OSDI’08, 2008 [3] Hadoop website. [Online]. Available: http://lucene.apache.org/hadoop [16] Jimmy Lin , Shravya Konda , Samantha Mahindrakar, “Low-latency, high-throughput access to static global sources within the Hadoop [4] Andrew Pavlo, Erik Paulson, Alexander Rasin, “A comparison of framework,” HCIL Technical Report HCIL-2009-01, 2009. approaches to large-scale data analysis,” in Proc. SIGMOD’09, 2009, p. 165. [17] Joseph M. Hellerstein, Michael Stonebraker, James Hamilton, “Architecture of a database system,” Foundations and Trends in [5] Daniel J. Abadi, “Data management in the cloud: limitations and Databases, Vol. 1, No. 2 (2007) 141–259, 2007. opportunities,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2009. 26