SlideShare ist ein Scribd-Unternehmen logo
1 von 20
XML::XParent
Another way to store XML elements...

             Marco Masetti(grubert) - masetti@linux.it
                                     grubert65@gmail.com
Ways of storing XML files
• Plain files, simple scripts to perform XPath
  queries
 – trivial, very limited scalability, search and element handling
• DBMS as BLOBs (text)
 – Limited search features, performance and scalability. No
   inherent element handling.
• DBMS with XML support
 – Document oriented. Not supported by all. Different features
   provided.
• Native XML databases (Tamino, Basex, eXist,...)
 – Ok…but then I need something else to talk of…
• Custom DBMS schemas
 – Data oriented, element handling trivial, scale very well
Custom DBMS schemas

• Structure mapping:
 – the design of the database schema is based on the
   understanding of XML Schema or DTDs

• Model mapping:
 – A fixed database schema for all XML documents
   without assistance of DTD or XML schemes
Structure-mapping schema: XML::RDB!
• Perl module to convert XML files into RDB schemas and
  populate, and unpopulate them. You end up with 1 table
  per each xml element type.
• Pros:
  ●
    Does what he means
  ●
    Quite fast
  ●
    Works with XML Schemas too
  ●
    Could eventually treat value types properly
• Cons:
  ●
    Inherent hierarchical structure lost
  ●
    Not good if XML files belongs to different schemas
  ●
    Does only what he means...
  ●
    Not very well maintained...
  ●
    SQL schemas can easily become unreadable...
Model-mapping schema: XParent !

• XParent is a very simple DBMS schema that can be
  used to store XML elements
• Does not require the XML schema (Schema-oblivious)
• Highly normalized
• Cons:
  
    Values are stored as text
XParent: how it works...
                     Table LabelPath
                      id | len |                               path                               
                     ­­­­+­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
                       1 |   4 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace
                       2 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@colorReferenceFlag
<?xml version="1.0" encoding="ISO­8859­1"?>
                       3 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@type
  <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG­7_Schema"
         xmlns:xsi="http://www.w3.org/2000/10/XMLSchema­instance">
    <DescriptionUnit xsi:type="DescriptorCollectionType">
      <Descriptor size="5" xsi:type="DominantColorType">
                     Table Element
        <ColorSpace type="HSV" colorReferenceFlag="false"/>
                      did | pathid | ordinal 
        <SpatialCoherency>0</SpatialCoherency>
                     ­­­­­+­­­­­­­­+­­­­­­­­­
        <Values>        1 |      1 |       1
        <Percentage>2</Percentage>
                        2 |      2 |       1
        <Index>10 6 0</Index>
                        3 |      3 |       2
        </Values>
        <Values>
          <Percentage>15</Percentage>
                     Table Data
          <Index>6 16 9</Index>
                      did | pathid | ordinal |                    value                     
        </Values>
                     ­­­­­+­­­­­­­­+­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
        <Values>
                        2 |      2 |       1 | false
          <Percentage>3</Percentage>
                        3 |      3 |       2 | HSV
          <Index>7 18 4</Index>
      </Values>
    </Descriptor>
  </DescriptionUnit>
</Mpeg7>             Table DataPath
                      pid | cid 
                     ­­­­­+­­­­­
                        1 |   2
                        1 |   3
The XML::XParent module
• Perl module to handle XML documents on a XParent
  schema
• Can load any XML file into the same SQL schema
• Plugins can be registered for custom logic on elements
• Provides utilities to:
  ●
    Create the XParent schema for SQLite and Postgresql
  ●
    Parse and load an XML file ( xparent-parse.pl )
  ●
    Query the XParent schema ( xparent-search.pl )
• Classes:
  ●
    XML::XParent::Parser: XML parser based on XML::Twig
  ●
    XML::XParent::Parser::Plugin: base interface class to
    be implemented by any plugin
  ●
    XML::XParent::Schema: base class (interface) to the
    XParent schema
  ●
    XML::XParent::Elem: class that describes an XML
    element
XML::XParent::Schema drivers

• The XML::XParent::Schema class implements the
  Driver/Interface pattern: in this way custom drivers can
  be implemented for specific data stores
• 2 generic drivers implemented so far:
  
    XML::XParent::Schema::DBIx: driver implementation based on
    DBIx::Class
    ●
      All advantages of an ORM (but who cares ?)
     ●
         Quite slow!
 
     XML::XParent::Schema::DBI: driver implementation
     based on DBI
     ●
       Direct integration with the data store
     ●
       Much faster...
The quest for speed...

●
    Tests performed on my laptop:
    ●
        CPU0: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05
    ●
        CPU1: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05

●
    Reference XML file:
    ●
        Size: 45 MB
    ●
        XML elements: ~600.000
●
    Reference DBMS: PostgreSQL 8.4.13

●
    Parsing of the reference file with the DBIx driver:
    ●
        perl xparent­parse.pl ­i <ref.xml> ­­driver DBIx
    ●
      Execution time: > 3000 mins !!!
●
    Parsing of the reference file with the DBI driver:
    ●
        perl xparent­parse.pl ­i <ref.xml> ­­driver DBI
    ●
        Execution time: ~ 400 mins.
...But then...

  ●
      I realized loading times were divergent!

  ●
   I realized there was a stupid error in the implementation of
  the algorith...
Exec Time
(log t)
        4
                            3000
        3
                            400
                                                     177
        2


                                                     28
        1



                                               ...
                       m
                        .                  ed.
                    le                  ch
              Im
                   p
                                    pat
         f.                    go
      Re                     Al
...But then...

●
  I realized that records in Data and DataPath tables are not
referenced by anybody...
●
  They do not need to be inserted one each...
●
  => Bulk Loading!!!
●
  ...given N elements, how many records we have in the
DataPath table ?
Bulk Loading
                                                  • Saves a lot of time storing data:
                                                  ­­­ DBI: Bulk loading of 1000000 records ­­­
                                                  All in once:    50.462398 wallclock seconds
                                                  Chunks of 1000: 31.157044 wallclock seconds
                                                  Chunks of 2000: 27.747248 wallclock seconds
                                                  Chunks of 5000: 28.209256 wallclock seconds
Exec Time                                         Chunks of 10000:26.334099 wallclock seconds
(log t)
        4                                         • Distinct inserts of 1000000 records:
                          3000
                                                            Elapsed time: 250.563282 wallclock seconds
       3
                          400
                                                 177
       2                                                                   98
                                                 28
       1                                                                   16



                                           ...                       ...
                      .                 d.                        g.
                    em                he                       in
                  pl                tc                      ad
             Im                   pa                      Lo
        f.                   go                      lk
     Re                    Al                     Bu
...But then...
    • For each element we have to check if path
      already exists...
    • Much better cache it in an hash than go back
      and forth into the DB...
Exec Time
(log t)
       4
                           3000
       3
                           400
                                                    177
       2                                                                      98
                                                                                                             41
                                                    28
                                                                              16
       1                                                                                                     12


                                              ...                       ...                        ...
                                                                                                         .
                       .                   d.                        g.
                      m                   e
                                                                 di
                                                                   n                           t hs
                   le                  ch                                                    Pa
             Im
                  p
                                   pat                       L oa
        f.                    go                        lk                              ed
     Re                     Al                       Bu                               ch
                                                                                   Ca
...But then...
                                      • Added some indexes:
                                      •   CREATE INDEX LabelPath_Path ON LabelPath (Path);
                                      •   CREATE INDEX Element_PathID ON Element (PathID);
                                      •   CREATE INDEX DataPath_Cid ON DataPath (Cid);
                                      •   CREATE INDEX DataPath_Pid ON DataPath (Pid);
                                      •   CREATE INDEX Data_Did ON Data (Did);
Exec Time
(log t)
       4
                              3000
       3
                              400
                                                     177
       2                                                                       98
                                                                                                                 41
                                                     28
                                                                               16                                                         29
       1                                                                                                         12
                                                                                                                                          8

                                                 .                       ...                                 .
                                              ...                     g.                                ..                          ...
                         m
                          .                 ed                      n                                s.                           s.
                      le                  h                       di                               th                          xe
                     p                  tc                      oa                               Pa
                 m                    pa                      L                              d                              de
        f   .I                   go                      lk                                he                             In
     Re                        Al                     Bu                            Ca
                                                                                         c                            +
...But then...
• Realized I could “compact” records...
                 <?xml version="1.0" encoding="ISO­8859­1"?>
                   <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG­7_Schema"
                          xmlns:xsi="http://www.w3.org/2000/10/XMLSchema­instance">
                     <DescriptionUnit xsi:type="DescriptorCollectionType">
                       <Descriptor size="5" xsi:type="DominantColorType">
                         <ColorSpace type="HSV" colorReferenceFlag="false"/>
                         <SpatialCoherency>0</SpatialCoherency>
                         <Values>
                           <Percentage>2</Percentage>
                           <Index>10 6 0</Index>
                         </Values>
                         <Values>
                           <Percentage>15</Percentage>
                           <Index>6 16 9</Index>
                         </Values>
                         <Values>
                           <Percentage>3</Percentage>
                           <Index>7 18 4</Index>
                         </Values>
                     </Descriptor>
                   </DescriptionUnit>
                 </Mpeg7>



Saves another 20%-30%...
Needs some logic at query time (experimental)...
To cut a very long story short...
      Time (mins) to load ~600.000 XML elems
             Reference   Algo      Bulk      Cached   indexes   Compact
                         patched   loading   Paths

      DBIx   > 3000      177       98        41       29        22


      DBI    ~400        28        16        12       8         6




●
    ..and we have still to do:
    ●
      Code profiling...
    ●
      Specific DBMS techniques...
    ●
      Use MapReduce to split jobs among several
      workers...
About retrieval...

• At first I tried implementing an Xpath-to-sql
  translator
• Found it very very hard...
• ...and almost useless
• ...use the power of SQL to express what you
  want!
• XML::XParent provides an API (get_elem) to
  query for a set of elements whose paths match
  a given SQL regex. The API returns a set of
  XML::XParent::Elem objects.
XML::XParent utilities: how to use them
• Configure parameters into xparent.yml file:
                                  ­­­
• To load an XML file:            schema_params:
perl xparent­parse.pl                 ­ 'dbi:Pg:dbname=xparent'
    ­i <input file>               #    ­ 'dbi:SQLite:xparent.db'
    ­­driver <the Schema driver to use>
                                      ­ grubert
                                      ­ grubert
    [­­config_file <the config file>]
    [­­verbose]                       ­
                                          AutoCommit: 1
    [­­clean]                     #plugins:
    [­­compact]                   #    'SLMS::Redis::ParserPlugin': 
• To query the Xparent data store:#        'tag': 'MovingRegion' 
perl xparent­search.pl
   ­­path <path regex>
   ­­driver <the Schema driver to use>
   [­­config_file <the config file>]
• To clean the data store:
perl xparent­clean.pl 
   ­­driver <the Schema driver to use>
   [­­config_file <the config file>]
Contribute!

https://github.com/grubert65/XParent-Perl.git
Thank You !!!!

Weitere ähnliche Inhalte

Ähnlich wie Xml::parent - Yet another way to store XML files

#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code ClinicMike Acton
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
 
(4) cpp abstractions references_copies_and_const-ness
(4) cpp abstractions references_copies_and_const-ness(4) cpp abstractions references_copies_and_const-ness
(4) cpp abstractions references_copies_and_const-nessNico Ludwig
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathSpark Summit
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathSpark Summit
 
Apache Cassandra Opinion and Fact
Apache Cassandra Opinion and FactApache Cassandra Opinion and Fact
Apache Cassandra Opinion and Factmediumdata
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory AnalysisMoabi.com
 
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course PROIDEA
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimizationguest3eed30
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory OptimizationWei Lin
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
DTrace Topics: Introduction
DTrace Topics: IntroductionDTrace Topics: Introduction
DTrace Topics: IntroductionBrendan Gregg
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
 
Drilling Deep Into Exadata Performance
Drilling Deep Into Exadata PerformanceDrilling Deep Into Exadata Performance
Drilling Deep Into Exadata PerformanceEnkitec
 
Linux Perf Tools
Linux Perf ToolsLinux Perf Tools
Linux Perf ToolsRaj Pandey
 
Compilers Are Databases
Compilers Are DatabasesCompilers Are Databases
Compilers Are DatabasesMartin Odersky
 

Ähnlich wie Xml::parent - Yet another way to store XML files (20)

#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code Clinic
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
(4) cpp abstractions references_copies_and_const-ness
(4) cpp abstractions references_copies_and_const-ness(4) cpp abstractions references_copies_and_const-ness
(4) cpp abstractions references_copies_and_const-ness
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
 
Apache Cassandra Opinion and Fact
Apache Cassandra Opinion and FactApache Cassandra Opinion and Fact
Apache Cassandra Opinion and Fact
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis
 
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
DTrace Topics: Introduction
DTrace Topics: IntroductionDTrace Topics: Introduction
DTrace Topics: Introduction
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
Drilling Deep Into Exadata Performance
Drilling Deep Into Exadata PerformanceDrilling Deep Into Exadata Performance
Drilling Deep Into Exadata Performance
 
Linux Perf Tools
Linux Perf ToolsLinux Perf Tools
Linux Perf Tools
 
MATLAB Programming
MATLAB Programming MATLAB Programming
MATLAB Programming
 
Compilers Are Databases
Compilers Are DatabasesCompilers Are Databases
Compilers Are Databases
 

Xml::parent - Yet another way to store XML files

  • 1. XML::XParent Another way to store XML elements... Marco Masetti(grubert) - masetti@linux.it grubert65@gmail.com
  • 2. Ways of storing XML files • Plain files, simple scripts to perform XPath queries – trivial, very limited scalability, search and element handling • DBMS as BLOBs (text) – Limited search features, performance and scalability. No inherent element handling. • DBMS with XML support – Document oriented. Not supported by all. Different features provided. • Native XML databases (Tamino, Basex, eXist,...) – Ok…but then I need something else to talk of… • Custom DBMS schemas – Data oriented, element handling trivial, scale very well
  • 3. Custom DBMS schemas • Structure mapping: – the design of the database schema is based on the understanding of XML Schema or DTDs • Model mapping: – A fixed database schema for all XML documents without assistance of DTD or XML schemes
  • 4. Structure-mapping schema: XML::RDB! • Perl module to convert XML files into RDB schemas and populate, and unpopulate them. You end up with 1 table per each xml element type. • Pros: ● Does what he means ● Quite fast ● Works with XML Schemas too ● Could eventually treat value types properly • Cons: ● Inherent hierarchical structure lost ● Not good if XML files belongs to different schemas ● Does only what he means... ● Not very well maintained... ● SQL schemas can easily become unreadable...
  • 5. Model-mapping schema: XParent ! • XParent is a very simple DBMS schema that can be used to store XML elements • Does not require the XML schema (Schema-oblivious) • Highly normalized • Cons:  Values are stored as text
  • 6. XParent: how it works... Table LabelPath  id | len |                               path                                ­­­­+­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­   1 |   4 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace   2 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@colorReferenceFlag <?xml version="1.0" encoding="ISO­8859­1"?>   3 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@type   <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG­7_Schema"          xmlns:xsi="http://www.w3.org/2000/10/XMLSchema­instance">     <DescriptionUnit xsi:type="DescriptorCollectionType">       <Descriptor size="5" xsi:type="DominantColorType"> Table Element         <ColorSpace type="HSV" colorReferenceFlag="false"/>  did | pathid | ordinal          <SpatialCoherency>0</SpatialCoherency> ­­­­­+­­­­­­­­+­­­­­­­­­         <Values>    1 |      1 |       1         <Percentage>2</Percentage>    2 |      2 |       1         <Index>10 6 0</Index>    3 |      3 |       2         </Values>         <Values>           <Percentage>15</Percentage> Table Data           <Index>6 16 9</Index>  did | pathid | ordinal |                    value                              </Values> ­­­­­+­­­­­­­­+­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­         <Values>    2 |      2 |       1 | false           <Percentage>3</Percentage>    3 |      3 |       2 | HSV           <Index>7 18 4</Index>       </Values>     </Descriptor>   </DescriptionUnit> </Mpeg7> Table DataPath  pid | cid  ­­­­­+­­­­­    1 |   2    1 |   3
  • 7. The XML::XParent module • Perl module to handle XML documents on a XParent schema • Can load any XML file into the same SQL schema • Plugins can be registered for custom logic on elements • Provides utilities to: ● Create the XParent schema for SQLite and Postgresql ● Parse and load an XML file ( xparent-parse.pl ) ● Query the XParent schema ( xparent-search.pl ) • Classes: ● XML::XParent::Parser: XML parser based on XML::Twig ● XML::XParent::Parser::Plugin: base interface class to be implemented by any plugin ● XML::XParent::Schema: base class (interface) to the XParent schema ● XML::XParent::Elem: class that describes an XML element
  • 8. XML::XParent::Schema drivers • The XML::XParent::Schema class implements the Driver/Interface pattern: in this way custom drivers can be implemented for specific data stores • 2 generic drivers implemented so far:  XML::XParent::Schema::DBIx: driver implementation based on DBIx::Class ● All advantages of an ORM (but who cares ?) ● Quite slow!  XML::XParent::Schema::DBI: driver implementation based on DBI ● Direct integration with the data store ● Much faster...
  • 9. The quest for speed... ● Tests performed on my laptop: ● CPU0: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05 ● CPU1: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05 ● Reference XML file: ● Size: 45 MB ● XML elements: ~600.000 ● Reference DBMS: PostgreSQL 8.4.13 ● Parsing of the reference file with the DBIx driver: ● perl xparent­parse.pl ­i <ref.xml> ­­driver DBIx ● Execution time: > 3000 mins !!! ● Parsing of the reference file with the DBI driver: ● perl xparent­parse.pl ­i <ref.xml> ­­driver DBI ● Execution time: ~ 400 mins.
  • 10. ...But then... ● I realized loading times were divergent! ● I realized there was a stupid error in the implementation of the algorith... Exec Time (log t) 4 3000 3 400 177 2 28 1 ... m . ed. le ch Im p pat f. go Re Al
  • 11. ...But then... ● I realized that records in Data and DataPath tables are not referenced by anybody... ● They do not need to be inserted one each... ● => Bulk Loading!!! ● ...given N elements, how many records we have in the DataPath table ?
  • 12. Bulk Loading • Saves a lot of time storing data: ­­­ DBI: Bulk loading of 1000000 records ­­­ All in once:    50.462398 wallclock seconds Chunks of 1000: 31.157044 wallclock seconds Chunks of 2000: 27.747248 wallclock seconds Chunks of 5000: 28.209256 wallclock seconds Exec Time Chunks of 10000:26.334099 wallclock seconds (log t) 4 • Distinct inserts of 1000000 records: 3000 Elapsed time: 250.563282 wallclock seconds 3 400 177 2 98 28 1 16 ... ... . d. g. em he in pl tc ad Im pa Lo f. go lk Re Al Bu
  • 13. ...But then... • For each element we have to check if path already exists... • Much better cache it in an hash than go back and forth into the DB... Exec Time (log t) 4 3000 3 400 177 2 98 41 28 16 1 12 ... ... ... . . d. g. m e di n t hs le ch Pa Im p pat L oa f. go lk ed Re Al Bu ch Ca
  • 14. ...But then... • Added some indexes: • CREATE INDEX LabelPath_Path ON LabelPath (Path); • CREATE INDEX Element_PathID ON Element (PathID); • CREATE INDEX DataPath_Cid ON DataPath (Cid); • CREATE INDEX DataPath_Pid ON DataPath (Pid); • CREATE INDEX Data_Did ON Data (Did); Exec Time (log t) 4 3000 3 400 177 2 98 41 28 16 29 1 12 8 . ... . ... g. .. ... m . ed n s. s. le h di th xe p tc oa Pa m pa L d de f .I go lk he In Re Al Bu Ca c +
  • 15. ...But then... • Realized I could “compact” records... <?xml version="1.0" encoding="ISO­8859­1"?>   <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG­7_Schema"          xmlns:xsi="http://www.w3.org/2000/10/XMLSchema­instance">     <DescriptionUnit xsi:type="DescriptorCollectionType">       <Descriptor size="5" xsi:type="DominantColorType">         <ColorSpace type="HSV" colorReferenceFlag="false"/>         <SpatialCoherency>0</SpatialCoherency>         <Values>           <Percentage>2</Percentage>           <Index>10 6 0</Index>         </Values>         <Values>           <Percentage>15</Percentage>           <Index>6 16 9</Index>         </Values>         <Values>           <Percentage>3</Percentage>           <Index>7 18 4</Index>         </Values>     </Descriptor>   </DescriptionUnit> </Mpeg7> Saves another 20%-30%... Needs some logic at query time (experimental)...
  • 16. To cut a very long story short... Time (mins) to load ~600.000 XML elems Reference Algo Bulk Cached indexes Compact patched loading Paths DBIx > 3000 177 98 41 29 22 DBI ~400 28 16 12 8 6 ● ..and we have still to do: ● Code profiling... ● Specific DBMS techniques... ● Use MapReduce to split jobs among several workers...
  • 17. About retrieval... • At first I tried implementing an Xpath-to-sql translator • Found it very very hard... • ...and almost useless • ...use the power of SQL to express what you want! • XML::XParent provides an API (get_elem) to query for a set of elements whose paths match a given SQL regex. The API returns a set of XML::XParent::Elem objects.
  • 18. XML::XParent utilities: how to use them • Configure parameters into xparent.yml file: ­­­ • To load an XML file: schema_params: perl xparent­parse.pl     ­ 'dbi:Pg:dbname=xparent' ­i <input file> #    ­ 'dbi:SQLite:xparent.db' ­­driver <the Schema driver to use>     ­ grubert     ­ grubert [­­config_file <the config file>] [­­verbose]     ­         AutoCommit: 1 [­­clean] #plugins: [­­compact] #    'SLMS::Redis::ParserPlugin':  • To query the Xparent data store:#        'tag': 'MovingRegion'  perl xparent­search.pl ­­path <path regex> ­­driver <the Schema driver to use> [­­config_file <the config file>] • To clean the data store: perl xparent­clean.pl  ­­driver <the Schema driver to use> [­­config_file <the config file>]