SlideShare ist ein Scribd-Unternehmen logo
1 von 1
HadoopXML                                                                              A Suite for Parallel Processing of Massive XML Data with Multiple Twig Pattern Queries
                                                                                                                                                                                                                                                                                                                                                                                     1Computer
                                                             Hyebong Choi1, Kyong-Ha Lee1, Soo-Hyong Kim1, Yoon-Joon Lee1 and Bongki Moon2                                                                                                                                                                                                                                                   Science Department, KAIST, Korea
                                                                                                                                                                                                                                                                                                                                                                     2Computer    Science Department, University of Arizona, USA
                                                               hbchoi@dbserver.kaist.ac.kr           bart7449@gmail.com                                         kimsh@dbserver.kaist.ac.kr                                         yoonjoon.lee@kaist.ac.kr                                                 bkmoon@cs.arizona.edu




                            Motivation                                                                                                                                                          System Architecture                                                                                                                                                                                                 Performance
                                                                                                                                                                                                                                                                             Twig pattern                                                                                                                              Experimental environment
                             Big data in XML                                                                                                                                                                                                                                     join
                                                                                                                                                                                                                                                                                                        Mappers
                                                                                                                                                                                                                                                                                                        Tagging
                                                                                                                                                                                                                                                                                                                                       Reducers                                                          Hadoop
                                                                                                                                                                                                                                                                                                                                                                                                                             CentOS 6.2          1Gb switching hub
                                                                                                                  A large                                                                                                                                                                                                                                                                               0.21.0 [1]
  ▶ More than 100GB of protein sequences and their 
                                                                                                                                                                XML                                                                                                                                    Reducer ID                        Holistic         Final
                                                                                                                  XML file                                                                                                                                                                                                                                                                                              AMD Athlon II x4 620       8GB memory
                                                                                                                                             Pre‐               blocks                           Path                                   Final                                                                                           twig join        answers                                        1 master
                                                                                                                                                                           1st M/R job                       2nd M/R job                                                                                Tagging                                                                                                                4‐cores            7200 RPM HDD
                                                                                                                   XPath                  processing          Query                            Solutions                               Answers                                       Path
   functional information are provided in XML format                                                              queries                                     index                                                                                                                solutions           Reducer ID                        Holistic         Final                                          8 slaves
                                                                                                                                                                                                                                                                                                                                                                                                                         i5‐2500k processor
                                                                                                                                                                                                                                                                                                                                                                                                                               4‐cores
                                                                                                                                                                                                                                                                                                                                                                                                                                                   8GB memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                  7200 RPM HDD
                                                                                                                                                                                                                                                                                                        Tagging                         twig join        answers
   and also updated in every four weeks [2]                                                                                                                                                                                                                                                            Reducer ID       Shuffle by
                                                                                                                                                                                                                                                                                                                        ReducerId                                                              XML dataset statistics                                   Loading time
  ▶ Conventional XML tools like single‐site XML DBMSes                                                                                                                                                                                                                             Size information
                                                                                                                                                                                                                                                                                  for path solutions                 Distributed cache                                            Filename         UniRef100   UniParc UniProtKB XMark1000
                                                                                                                                                                                                                                                                                                                                                                                  File size (MB)      24,500    37,436 105,745     114,414
   and XML pub/sub systems failed to process that size of                                                                                                                                                                                                                             Relationship
                                                                                                                                                                                                                                                                                                                       Multi query                                                # of elements        335M      360M  2,110M      1,670M
                                                                                                                                                                                                                                                                                  btw. path patterns &
   XML data                                                                                                                                                                                                                                             Path query                   twig patterns                      optimizer                                                 # of attributes
                                                                                                                                                                                                                                                                                                                                                                                  Depth in avg.
                                                                                                                                                                                                                                                                                                                                                                                                       589M
                                                                                                                                                                                                                                                                                                                                                                                                      4.5649
                                                                                                                                                                                                                                                                                                                                                                                                               1,215M 2,783M
                                                                                                                                                                                                                                                                                                                                                                                                                3.7753    4.3326
                                                                                                                                                                                                                                                                                                                                                                                                                                     383M
                                                                                                                                                                                                                                                                                                                                                                                                                                    4.7375
                                                                                                                                                         Query index          Query 
  XML DB                    eXist [9]                                BaseX [10]                                                                            builder            index                                                                     processing                                                                                                                Max depth                6         5         7        12
                                                                                                                                                                                                                                                                                           Mappers                                   Reducers
                                                                                                                                                                                                                                                                    XML Label                                                                                                     # distinct paths        30        24       149       548
                                   Query processing                          Query processing                                                                                                    HDFS                                                                                        Path                 Path 
   Data size      Loading time                               Loading time                                                                                       Path                                                                                                                                                                 Counting          Path 
                                    w/ 4000 twig queries                      w/ 4000 twig queries                XPath             Query                                                                                                                          block1                  filtering            solutions
                                                                                                                                                              patterns                                                                                                                                                               solutions       solutions                                                          Overall execution time
                                                                                                                 queries         Decomposition                                                                             XML           Label                      XML Label
       1GB             5m 54s                     failed           2m 1s             2h 48m 7s                                                           Relationship                                                                                                                         Path                Path
                                                                                                                                                                                                                          block1         block1                    block2
                                                                                                                                                          btw. paths                                                                                                                        filtering           solutions            Counting          Path 




                                                                                                                                                                                                                                                                        …
      10GB          1h 5m 21s                     failed         19m 36s          30h 11m 34s                                                                         Copy to                                              XML           Label                                                                                                                                                     Synthetic dataset                                    Real‐world dataset
                                                                                                                                                          and twigs                                                                                                                                                                  Solutions       solutions
                                                                                                                                                                      HDFS                                                block2         block2                     XML Label
     100GB              failed                         ‐            failed                      ‐                                                                                                                                                                                            Path                 Path




                                                                                                                                                                                                                          …



                                                                                                                                                                                                                                          …
                                                                                                                 A large               Partitioning      Label blocks                                                                                              blockn                  filtering            solutions                 <Path ID, a list of labels>
                                           Yfilter [5]                                                           XML file              & Labeling                                                                          XML           Label                                                            <Path ID, label>
                                                                                                                                                          XML blocks
                                                                                                                                                                                                                          blockn         blockn                     Query index
      Data size                  Filtering time        Postprocessing time (twig  pattern join)                                                                                                                                                                                                                               Size information 
                                                                                                                                                                                                                         Block collocation                         Distributed cache
                                                                                                                                                                                                                                                                                                                             for path solutions
          1MB                            2m 4s                                           0.264s
         10MB                          20s 14s                                               16s
       100MB                        3h 22m 6s                                        1h 1m 37s
           1GB                           failed                                                 ‐
                                                                                                                                                                                                            Working Example                                                                                                                                                                 Effect of converting paths 
                                                                                                                                                                                                                                                                                                                                                                                                 to distinct paths
                                                                                                                                                                                                                                                                                                                                                                                                                                                 Effect of block collocation

                                                                                                                                                                                                                                            Label_1
                                                                                                                                                                                           <region>                     block_1    /
                                                                                                              <region>                 Example.xml                                          <Africa>                               1, 24, 1

                          HadoopXML                                                                            <Africa>
                                                                                                                  <item id=“item0”>
                                                                                                                    <quantity>1</quantity>
                                                                                                                    <payment>Creditcard</payment>
                                                                                                                                                                                               <item id=“item0”>
                                                                                                                                                                                                 <quantity>1</quantity>
                                                                                                                                                                                                 <payment>Creditcard</payment>
                                                                                                                                                                                               </item>
                                                                                                                                                                                                                                   2, 15, 2
                                                                                                                                                                                                                                   3, 8, 3
                                                                                                                                                                                                                                   4, 5, 4
                                                                                                                                                                                                                                   6, 7, 4
                                                                                                                                                                                                                                                        Path offset
                                                                                                                                                                                                                                                                                      Path query  Path solution
                                                                                                                                                                                                                                                                                          ID
                                                                                                                                                                                                                                                                                               1.1         3, 8, 3
                                                                                                                  </item>
▶ It efficiently processes many twig pattern queries for                                                                                                                                       <item id=“item1”>           block_2         Label_2                                                        9, 14, 3
                                                                                                                  <item id=“item1”>
                                                                                                                     <quantity>1</quantity>                                                         <quantity>1</quantity>         /region/Africa                                              1.2         4, 5, 4                               Twig query ID Path solution
                                                                                                                     <payment>Money order</payment>                     Preprocessing               <payment>Money order</payment> 9, 14, 3                                                             10, 11, 4        2nd M/R                             1          6, 7, 4
                                                                                                                  </item>                                                Partitioning              </item>                         10, 11, 4               1st M/R                                                     Twig pattern                                                                                                          Effect of multi query optimization
  a massive volume of XML data in parallel                                                                      </Africa>
                                                                                                                <Asia>
                                                                                                                                                                          & labeling             </Africa>
                                                                                                                                                                                                 <Asia>
                                                                                                                                                                                                                                   12, 13, 4
                                                                                                                                                                                                                                   16, 23, 2
                                                                                                                                                                                                                                                         Path filtering
                                                                                                                                                                                                                                                                                               1.3         6, 7, 4
                                                                                                                                                                                                                                                                                                        12, 13, 4          join                              2
                                                                                                                                                                                                                                                                                                                                                                    12, 13, 4
                                                                                                                                                                                                                                                                                                                                                                    17, 22, 3
                                                                                                                  <item id="item135">                                                      <item id="item135">                                                                                  2       17, 22, 3
                                                                                                                     <quantity>2</quantity>
                                                                                                                                                                                                                        block_3          Label_3
 ‐ Block partitioning with no loss of structural information                                                         <payment>Personal Check</payment>
                                                                                                                                                                                                 <quantity>2</quantity>          /region/Asia
                                                                                                                                                                                                 <payment>Personal Check</payment>
                                                                                                                                                                                                                                 17, 22, 3
                                                                                                                  </item>                                                                      </item>                                                                                Path query        Count
                                                                                                                                                                                                                                 18, 19, 4
                                                                                                                </Asia>                                                                      </Asia>                                                                                      ID
 ‐ Path filtering with NFA‐style query indexes [5]                                                            </region>                                                                    </region>
                                                                                                                                                                                                                                 20, 21, 4
                                                                                                                                                                                                                                                                                               1.1               2           Multi query
                                                                                                                                                                                                  Path query ID               Path query                                                       1.2               2            optimizer
 ‐ I/O optimal Holistic twig pattern joins [3]                                                          Twig query ID
                                                                                                             1
                                                                                                                                            Twig query
                                                                                                                        /region/Africa/item[quantity]/payment
                                                                                                                                                                        Query decomposition            1.1         /region/Africa/item                                                         1.3               2
                                                                                                                                                                                                       1.2         /region/Africa/item/quantity          A  query index                         2                1
                                                                                                             2          //Asia/item
                                                                                                                                                                           & Converting to 
▶ Simultaneous processing of multiple twig pattern 
                                                                                                                                                                                                       1.3         /region/Africa/item/payment
                                                                                                             …           .  . . .  .
                                                                                                                                                                          root‐to‐leaf paths            2          /region/Asia/item



  queries                                                                                                                                                                                                                                                                  Load Balancing & 
 ‐ Many twig pattern joins are distributed across nodes and 
                                                                                                                                                  Path Filtering
                                                                                                                                                                                                                                                                        Multi Query Optimization                                                                                                                     References
                                                                                                             <item id=“item1”>
                                                                                                                   <quantity>1</quantity>
                                                                                                                                           block_2                                 /region/Africa      Label_2                         ▶ Twig pattern join, a specialized multi‐way join that reads multiple                                                                         [1] Hadoop. http://hadoop.apache.org, Apache Software Foundation.
    executed in parallel                                                                                           <payment>Money order</payment>                                  9, 14, 3
                                                                                                                                                                                                                                         path solutions                                                                                                                              [2] A. Bairoch et al. The universal protein resource (uniprot). Nucleic acids 
                                                                                                                 </item>                                                           10, 11, 4
                                                                                                                                                                                   12, 13, 4                                              ‐ With static one‐to‐one shuffling scheme, i.e. given partitioned path solutions, reducers                                                    research, 33(suppl 1):D154–D159, 2005.
▶ Optimization of the I/O cost in MapReduce jobs                                                               </Africa>
                                                                                                               <Asia>                                                              16, 23, 2                                               generate incomplete join results                                                                                                          [3] N. Bruno et al. Holistic twig joins: optimal xml pattern matching. In 
                                                                                                                                                                                                                                                                                                     Reducer1                                Missing results!
                                                                                                                                                                                                                                                                                                      Q1: A1 join B1 join C1                 A1 join B2 join C2                         Proceedings of ACM SIGMOD, pages 310–321. ACM, 2002.
 ‐ Sharing input scans and intermediate path solutions                                                                                           startElement(“region”)                                                                                 A1                                            Q2: A1 join C1 join D1                 A2 join B1 join C2
                                                                                                                                                 startElement(“Africa”)                                                                                                                                                                                                              [4] J. Dean et al. Mapreduce: Simplified data processing on large clusters. 
                                                                                                                                                 & SAX events from block_2                                                                                         B1                                 Q3: A1 join B1 join D1
 ‐ Converting redundant path patterns  with {//, *} to a few                                                                                                                                                                                            A2
                                                                                                                                                                                                                                                                                                                                                        …
                                                                                                                                                                                                                                                                                                                                                                                        Communications of the ACM, 51(1):107–113, 2008.
                                                                                                                                                                                                                                                                             C1
                                                                                                                    NFA style                                                                                                            Path solutions  A         B2                 D1             Reducer2                                       Input queries                    [5] Y. Diao et al. Path sharing and predicate evaluation for high‐performance xml 
    distinct root‐to‐leaf paths                                                                                    Query index                                region              1st Mapper                                                                       B         C2                       Q1: A2 join B2 join C2                      Q1: A join B join C                   filtering. ACM Transactions on Database Systems, 28(4):467–516, 2003. 
                                                                                                                                                         &1                                                                                                                           D2              Q2: A2 join C2 join D2                      Q2: A join C join D
                                                                                                                                            Africa                                                                                                                           C
                                                                                                                                                                                                                                                                                                      Q3: A2 join B2 join D2                      Q3: A join B join D                [6] K. Lee et al. Parallel data processing with MapReduce: a survey. ACM 
 ‐ Collocation of XML blocks and corresponding label blocks                                                                                                        Asia
                                                                                                                                                                                                                                                                                      D                                                                                                 SIGMOD Record, 40(4):11–20, 2011.
                                                                                                                                                &2               &3
                                                                                                                                        item                                                                                           ▶ Runtime one‐to‐many data shuffling                                                                                                          [7] Q. Li et al. Indexing and querying xml data for regular path expressions. In 
▶ Runtime load balancing & multi query optimization                                                                         1.1
                                                                                                                                                                       item
                                                                                                                                                                                                                                          ‐ It distributes both queries and data at runtime                                                                                             Proceedings of VLDB, pages 361–370, 2001.
                                                                                                                                          &4                     &5
                                                                                                                     quantity                   payment                                                                                   ‐ Path solutions can be redundantly copied to reducers, involving redundant I/Os
                                                                                                                                                                 2                                                                                                                                                                                                                   [8] T. Nykiel et al. MRshare: Sharing across multiple queries in MapReduce. 
 ‐ XML twig queries may share path patterns each other                                                                                                                              Runtime stack                                         ‐ a straggling reduce task dominates the overall performance of M/R jobs
                                                                                                                                                                                                                                                                                                                                                                                        Proceedings of the VLDB Endowment, 3(1‐2):494–505, 2010
                                                                                                                                                                                                                                          ‐ Optimization problem : find  the optimal way that distributes queries and path solutions across 
                                                                                                                                 &6             &7                                                                                         reducers  so that every reducer is assigned even workload
 ‐ For I/O reduction and workload balance, twig pattern                                                                                                                                                                                                                                                                                                                              [9] W. Meier. eXist: An open source native XML database. Web, Web‐Services, 
                                                                                                                                1.2            1.3                                                                                                                                          Reducer1
                                                                                                                                                                                                                                                         30                                 Q1: A join B join C                                                                         and Database Systems 2002, LNCS 2593, Springer, Berlin (2002), pp. 169–183
    queries that share path patterns are grouped together                                                                                 Path solution                                                                                  Path solutions A     80                         cost = |A|+|B|+|C| = 200        Input queries                                               [10] C. Grün et al. BaseX ‐ Processing and Visualizing XML with a native XML 
                                                                                                                                          For block_2                                                                                                                                                                                               Q1: A join B join C                Database, http://www.basex.org/, 2010.
                                                                                                                                                                                                                                                                        90                               Reducer2
                                                                                                                                                 Path query ID Path solution                                                                                                                                                                        Q2: A join C join D
 ‐ The twig query groups are assigned to reducers at                                                                                                             1.1           9, 14, 3
                                                                                                                                                                                                                                                               B                                         Q2: A join C join D
                                                                                                                                                                                                                                                                                                                                                    Q3: A join B join D
                                                                                                                                                                                                                                                                                                         Q3: A join B Join D
                                                                                                                                                                 1.2          10, 11, 4                                                                                 C
    runtime such that every reducer has the same overall                                                                                                                                                                                                                          5                   cost = |A|+|C|+|D| +
                                                                                                                                                                 1.3          12, 13, 4                                                                                           D                         |A|+|B|+|D| = 240 

    cost of join operations
                                                                                                                                                      This work was partly supported by NRF grant funded by the Korea government (MEST) (no. 2011‐0016282)  

Weitere ähnliche Inhalte

Ähnlich wie A poster version of HadoopXML

Game cloud reference architecture
Game cloud reference architectureGame cloud reference architecture
Game cloud reference architecture
Jonathan Spindel
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
 
Myanmar Alphabet Recognition System Based on Artificial Neural Network
Myanmar Alphabet Recognition System Based on Artificial Neural NetworkMyanmar Alphabet Recognition System Based on Artificial Neural Network
Myanmar Alphabet Recognition System Based on Artificial Neural Network
ijtsrd
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
Rajarshi Guha
 
PAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph ComputationPAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph Computation
1crore projects
 

Ähnlich wie A poster version of HadoopXML (16)

Mastering Differentiated MDSD Requirements at Deutsche Boerse AG
Mastering Differentiated MDSD Requirements at Deutsche Boerse AGMastering Differentiated MDSD Requirements at Deutsche Boerse AG
Mastering Differentiated MDSD Requirements at Deutsche Boerse AG
 
제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata 제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata
 
NASA Facilities GIS
NASA Facilities GISNASA Facilities GIS
NASA Facilities GIS
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
 
Game cloud reference architecture
Game cloud reference architectureGame cloud reference architecture
Game cloud reference architecture
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Myanmar Alphabet Recognition System Based on Artificial Neural Network
Myanmar Alphabet Recognition System Based on Artificial Neural NetworkMyanmar Alphabet Recognition System Based on Artificial Neural Network
Myanmar Alphabet Recognition System Based on Artificial Neural Network
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
PAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph ComputationPAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph Computation
 
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATIONPAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
 
Page a partition aware engine
Page a partition aware enginePage a partition aware engine
Page a partition aware engine
 
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
 PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
 
Business Service Semantics: Ontological Representation & Governance of Busine...
Business Service Semantics: Ontological Representation & Governance of Busine...Business Service Semantics: Ontological Representation & Governance of Busine...
Business Service Semantics: Ontological Representation & Governance of Busine...
 
Visualizing and Forecasting Stocks Using Machine Learning
Visualizing and Forecasting Stocks Using Machine LearningVisualizing and Forecasting Stocks Using Machine Learning
Visualizing and Forecasting Stocks Using Machine Learning
 
Overview of the TriBITS Lifecycle Model
Overview of the TriBITS Lifecycle ModelOverview of the TriBITS Lifecycle Model
Overview of the TriBITS Lifecycle Model
 

Mehr von Kyong-Ha Lee

SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
Kyong-Ha Lee
 

Mehr von Kyong-Ha Lee (7)

SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
좋은 논문 찾기
좋은 논문 찾기좋은 논문 찾기
좋은 논문 찾기
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query Processing
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query Processing
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

A poster version of HadoopXML

  • 1. HadoopXML A Suite for Parallel Processing of Massive XML Data with Multiple Twig Pattern Queries 1Computer Hyebong Choi1, Kyong-Ha Lee1, Soo-Hyong Kim1, Yoon-Joon Lee1 and Bongki Moon2 Science Department, KAIST, Korea 2Computer Science Department, University of Arizona, USA hbchoi@dbserver.kaist.ac.kr bart7449@gmail.com kimsh@dbserver.kaist.ac.kr yoonjoon.lee@kaist.ac.kr bkmoon@cs.arizona.edu Motivation System Architecture Performance Twig pattern  Experimental environment Big data in XML join Mappers Tagging Reducers Hadoop CentOS 6.2 1Gb switching hub A large 0.21.0 [1] ▶ More than 100GB of protein sequences and their  XML  Reducer ID Holistic Final XML file AMD Athlon II x4 620  8GB memory Pre‐ blocks Path Final twig join answers 1 master 1st M/R job 2nd M/R job Tagging 4‐cores 7200 RPM HDD XPath processing Query Solutions Answers Path functional information are provided in XML format  queries index solutions Reducer ID Holistic Final 8 slaves i5‐2500k processor 4‐cores 8GB memory 7200 RPM HDD Tagging twig join answers and also updated in every four weeks [2] Reducer ID Shuffle by ReducerId XML dataset statistics Loading time ▶ Conventional XML tools like single‐site XML DBMSes Size information for path solutions Distributed cache Filename UniRef100 UniParc UniProtKB XMark1000 File size (MB) 24,500 37,436 105,745 114,414 and XML pub/sub systems failed to process that size of  Relationship Multi query # of elements 335M  360M  2,110M 1,670M btw. path patterns & XML data Path query  twig patterns optimizer # of attributes Depth in avg. 589M 4.5649 1,215M 2,783M 3.7753 4.3326 383M 4.7375 Query index  Query  XML DB eXist [9] BaseX [10] builder index processing Max depth 6 5 7 12 Mappers Reducers XML Label # distinct paths 30 24 149 548 Query processing Query processing HDFS Path Path  Data size Loading time Loading time Path  Counting Path  w/ 4000 twig queries w/ 4000 twig queries XPath Query  block1 filtering solutions patterns solutions solutions Overall execution time queries Decomposition XML Label XML Label 1GB 5m 54s failed 2m 1s 2h 48m 7s Relationship Path Path block1 block1 block2 btw. paths  filtering solutions Counting Path  … 10GB 1h 5m 21s failed 19m 36s 30h 11m 34s Copy to  XML Label Synthetic dataset Real‐world dataset and twigs Solutions solutions HDFS block2 block2 XML Label 100GB failed ‐ failed ‐ Path Path … … A large Partitioning Label blocks blockn filtering solutions <Path ID, a list of labels> Yfilter [5] XML file & Labeling XML Label <Path ID, label> XML blocks blockn blockn Query index Data size Filtering time Postprocessing time (twig  pattern join) Size information  Block collocation Distributed cache for path solutions 1MB 2m 4s  0.264s 10MB 20s 14s 16s 100MB 3h 22m 6s 1h 1m 37s 1GB failed ‐ Working Example Effect of converting paths  to distinct paths Effect of block collocation Label_1 <region> block_1 / <region> Example.xml <Africa> 1, 24, 1 HadoopXML <Africa> <item id=“item0”> <quantity>1</quantity> <payment>Creditcard</payment> <item id=“item0”> <quantity>1</quantity> <payment>Creditcard</payment> </item> 2, 15, 2 3, 8, 3 4, 5, 4 6, 7, 4 Path offset Path query  Path solution ID 1.1 3, 8, 3 </item> ▶ It efficiently processes many twig pattern queries for  <item id=“item1”> block_2 Label_2 9, 14, 3 <item id=“item1”> <quantity>1</quantity> <quantity>1</quantity> /region/Africa 1.2 4, 5, 4 Twig query ID Path solution <payment>Money order</payment> Preprocessing <payment>Money order</payment> 9, 14, 3 10, 11, 4  2nd M/R 1 6, 7, 4 </item> Partitioning  </item> 10, 11, 4 1st M/R Twig pattern  Effect of multi query optimization a massive volume of XML data in parallel </Africa> <Asia> & labeling </Africa> <Asia> 12, 13, 4 16, 23, 2 Path filtering 1.3 6, 7, 4 12, 13, 4 join 2 12, 13, 4 17, 22, 3 <item id="item135"> <item id="item135"> 2 17, 22, 3 <quantity>2</quantity> block_3 Label_3 ‐ Block partitioning with no loss of structural information <payment>Personal Check</payment> <quantity>2</quantity> /region/Asia <payment>Personal Check</payment> 17, 22, 3 </item> </item> Path query  Count 18, 19, 4 </Asia> </Asia> ID ‐ Path filtering with NFA‐style query indexes [5] </region> </region> 20, 21, 4 1.1 2 Multi query Path query ID  Path query 1.2 2 optimizer ‐ I/O optimal Holistic twig pattern joins [3]  Twig query ID 1 Twig query /region/Africa/item[quantity]/payment Query decomposition 1.1 /region/Africa/item 1.3 2 1.2 /region/Africa/item/quantity A  query index 2 1 2 //Asia/item & Converting to  ▶ Simultaneous processing of multiple twig pattern  1.3 /region/Africa/item/payment …  .  . . .  . root‐to‐leaf paths 2 /region/Asia/item queries Load Balancing &  ‐ Many twig pattern joins are distributed across nodes and  Path Filtering Multi Query Optimization References <item id=“item1”> <quantity>1</quantity> block_2 /region/Africa Label_2 ▶ Twig pattern join, a specialized multi‐way join that reads multiple  [1] Hadoop. http://hadoop.apache.org, Apache Software Foundation. executed in parallel <payment>Money order</payment> 9, 14, 3 path solutions  [2] A. Bairoch et al. The universal protein resource (uniprot). Nucleic acids  </item> 10, 11, 4 12, 13, 4 ‐ With static one‐to‐one shuffling scheme, i.e. given partitioned path solutions, reducers  research, 33(suppl 1):D154–D159, 2005. ▶ Optimization of the I/O cost in MapReduce jobs </Africa> <Asia> 16, 23, 2 generate incomplete join results [3] N. Bruno et al. Holistic twig joins: optimal xml pattern matching. In  Reducer1 Missing results! Q1: A1 join B1 join C1 A1 join B2 join C2 Proceedings of ACM SIGMOD, pages 310–321. ACM, 2002. ‐ Sharing input scans and intermediate path solutions startElement(“region”) A1 Q2: A1 join C1 join D1 A2 join B1 join C2 startElement(“Africa”) [4] J. Dean et al. Mapreduce: Simplified data processing on large clusters.  & SAX events from block_2 B1 Q3: A1 join B1 join D1 ‐ Converting redundant path patterns  with {//, *} to a few  A2 … Communications of the ACM, 51(1):107–113, 2008. C1 NFA style Path solutions  A B2 D1 Reducer2 Input queries [5] Y. Diao et al. Path sharing and predicate evaluation for high‐performance xml  distinct root‐to‐leaf paths Query index region 1st Mapper B C2 Q1: A2 join B2 join C2 Q1: A join B join C filtering. ACM Transactions on Database Systems, 28(4):467–516, 2003.  &1 D2 Q2: A2 join C2 join D2 Q2: A join C join D Africa C Q3: A2 join B2 join D2 Q3: A join B join D [6] K. Lee et al. Parallel data processing with MapReduce: a survey. ACM  ‐ Collocation of XML blocks and corresponding label blocks Asia D SIGMOD Record, 40(4):11–20, 2011. &2 &3 item ▶ Runtime one‐to‐many data shuffling [7] Q. Li et al. Indexing and querying xml data for regular path expressions. In  ▶ Runtime load balancing & multi query optimization 1.1 item ‐ It distributes both queries and data at runtime Proceedings of VLDB, pages 361–370, 2001. &4 &5 quantity payment ‐ Path solutions can be redundantly copied to reducers, involving redundant I/Os 2 [8] T. Nykiel et al. MRshare: Sharing across multiple queries in MapReduce.  ‐ XML twig queries may share path patterns each other Runtime stack ‐ a straggling reduce task dominates the overall performance of M/R jobs Proceedings of the VLDB Endowment, 3(1‐2):494–505, 2010 ‐ Optimization problem : find  the optimal way that distributes queries and path solutions across  &6 &7 reducers  so that every reducer is assigned even workload ‐ For I/O reduction and workload balance, twig pattern  [9] W. Meier. eXist: An open source native XML database. Web, Web‐Services,  1.2 1.3 Reducer1 30 Q1: A join B join C and Database Systems 2002, LNCS 2593, Springer, Berlin (2002), pp. 169–183 queries that share path patterns are grouped together Path solution  Path solutions A 80 cost = |A|+|B|+|C| = 200 Input queries [10] C. Grün et al. BaseX ‐ Processing and Visualizing XML with a native XML  For block_2 Q1: A join B join C Database, http://www.basex.org/, 2010. 90 Reducer2 Path query ID Path solution Q2: A join C join D ‐ The twig query groups are assigned to reducers at  1.1 9, 14, 3 B Q2: A join C join D Q3: A join B join D Q3: A join B Join D 1.2 10, 11, 4  C runtime such that every reducer has the same overall  5 cost = |A|+|C|+|D| + 1.3 12, 13, 4 D |A|+|B|+|D| = 240  cost of join operations This work was partly supported by NRF grant funded by the Korea government (MEST) (no. 2011‐0016282)