XParent is a simple SQL schema to store XML elements. XML::XParent is a perl module that provides API to store XML files and retrieve XML elements from a XParent data store.
1. XML::XParent
Another way to store XML elements...
Marco Masetti(grubert) - masetti@linux.it
grubert65@gmail.com
2. Ways of storing XML files
• Plain files, simple scripts to perform XPath
queries
– trivial, very limited scalability, search and element handling
• DBMS as BLOBs (text)
– Limited search features, performance and scalability. No
inherent element handling.
• DBMS with XML support
– Document oriented. Not supported by all. Different features
provided.
• Native XML databases (Tamino, Basex, eXist,...)
– Ok…but then I need something else to talk of…
• Custom DBMS schemas
– Data oriented, element handling trivial, scale very well
3. Custom DBMS schemas
• Structure mapping:
– the design of the database schema is based on the
understanding of XML Schema or DTDs
• Model mapping:
– A fixed database schema for all XML documents
without assistance of DTD or XML schemes
4. Structure-mapping schema: XML::RDB!
• Perl module to convert XML files into RDB schemas and
populate, and unpopulate them. You end up with 1 table
per each xml element type.
• Pros:
●
Does what he means
●
Quite fast
●
Works with XML Schemas too
●
Could eventually treat value types properly
• Cons:
●
Inherent hierarchical structure lost
●
Not good if XML files belongs to different schemas
●
Does only what he means...
●
Not very well maintained...
●
SQL schemas can easily become unreadable...
5. Model-mapping schema: XParent !
• XParent is a very simple DBMS schema that can be
used to store XML elements
• Does not require the XML schema (Schema-oblivious)
• Highly normalized
• Cons:
Values are stored as text
7. The XML::XParent module
• Perl module to handle XML documents on a XParent
schema
• Can load any XML file into the same SQL schema
• Plugins can be registered for custom logic on elements
• Provides utilities to:
●
Create the XParent schema for SQLite and Postgresql
●
Parse and load an XML file ( xparent-parse.pl )
●
Query the XParent schema ( xparent-search.pl )
• Classes:
●
XML::XParent::Parser: XML parser based on XML::Twig
●
XML::XParent::Parser::Plugin: base interface class to
be implemented by any plugin
●
XML::XParent::Schema: base class (interface) to the
XParent schema
●
XML::XParent::Elem: class that describes an XML
element
8. XML::XParent::Schema drivers
• The XML::XParent::Schema class implements the
Driver/Interface pattern: in this way custom drivers can
be implemented for specific data stores
• 2 generic drivers implemented so far:
XML::XParent::Schema::DBIx: driver implementation based on
DBIx::Class
●
All advantages of an ORM (but who cares ?)
●
Quite slow!
XML::XParent::Schema::DBI: driver implementation
based on DBI
●
Direct integration with the data store
●
Much faster...
9. The quest for speed...
●
Tests performed on my laptop:
●
CPU0: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05
●
CPU1: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05
●
Reference XML file:
●
Size: 45 MB
●
XML elements: ~600.000
●
Reference DBMS: PostgreSQL 8.4.13
●
Parsing of the reference file with the DBIx driver:
●
perl xparentparse.pl i <ref.xml> driver DBIx
●
Execution time: > 3000 mins !!!
●
Parsing of the reference file with the DBI driver:
●
perl xparentparse.pl i <ref.xml> driver DBI
●
Execution time: ~ 400 mins.
10. ...But then...
●
I realized loading times were divergent!
●
I realized there was a stupid error in the implementation of
the algorith...
Exec Time
(log t)
4
3000
3
400
177
2
28
1
...
m
. ed.
le ch
Im
p
pat
f. go
Re Al
11. ...But then...
●
I realized that records in Data and DataPath tables are not
referenced by anybody...
●
They do not need to be inserted one each...
●
=> Bulk Loading!!!
●
...given N elements, how many records we have in the
DataPath table ?
12. Bulk Loading
• Saves a lot of time storing data:
DBI: Bulk loading of 1000000 records
All in once: 50.462398 wallclock seconds
Chunks of 1000: 31.157044 wallclock seconds
Chunks of 2000: 27.747248 wallclock seconds
Chunks of 5000: 28.209256 wallclock seconds
Exec Time Chunks of 10000:26.334099 wallclock seconds
(log t)
4 • Distinct inserts of 1000000 records:
3000
Elapsed time: 250.563282 wallclock seconds
3
400
177
2 98
28
1 16
... ...
. d. g.
em he in
pl tc ad
Im pa Lo
f. go lk
Re Al Bu
13. ...But then...
• For each element we have to check if path
already exists...
• Much better cache it in an hash than go back
and forth into the DB...
Exec Time
(log t)
4
3000
3
400
177
2 98
41
28
16
1 12
... ... ...
.
. d. g.
m e
di
n t hs
le ch Pa
Im
p
pat L oa
f. go lk ed
Re Al Bu ch
Ca
14. ...But then...
• Added some indexes:
• CREATE INDEX LabelPath_Path ON LabelPath (Path);
• CREATE INDEX Element_PathID ON Element (PathID);
• CREATE INDEX DataPath_Cid ON DataPath (Cid);
• CREATE INDEX DataPath_Pid ON DataPath (Pid);
• CREATE INDEX Data_Did ON Data (Did);
Exec Time
(log t)
4
3000
3
400
177
2 98
41
28
16 29
1 12
8
. ... .
... g. .. ...
m
. ed n s. s.
le h di th xe
p tc oa Pa
m pa L d de
f .I go lk he In
Re Al Bu Ca
c +
15. ...But then...
• Realized I could “compact” records...
<?xml version="1.0" encoding="ISO88591"?>
<Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG7_Schema"
xmlns:xsi="http://www.w3.org/2000/10/XMLSchemainstance">
<DescriptionUnit xsi:type="DescriptorCollectionType">
<Descriptor size="5" xsi:type="DominantColorType">
<ColorSpace type="HSV" colorReferenceFlag="false"/>
<SpatialCoherency>0</SpatialCoherency>
<Values>
<Percentage>2</Percentage>
<Index>10 6 0</Index>
</Values>
<Values>
<Percentage>15</Percentage>
<Index>6 16 9</Index>
</Values>
<Values>
<Percentage>3</Percentage>
<Index>7 18 4</Index>
</Values>
</Descriptor>
</DescriptionUnit>
</Mpeg7>
Saves another 20%-30%...
Needs some logic at query time (experimental)...
16. To cut a very long story short...
Time (mins) to load ~600.000 XML elems
Reference Algo Bulk Cached indexes Compact
patched loading Paths
DBIx > 3000 177 98 41 29 22
DBI ~400 28 16 12 8 6
●
..and we have still to do:
●
Code profiling...
●
Specific DBMS techniques...
●
Use MapReduce to split jobs among several
workers...
17. About retrieval...
• At first I tried implementing an Xpath-to-sql
translator
• Found it very very hard...
• ...and almost useless
• ...use the power of SQL to express what you
want!
• XML::XParent provides an API (get_elem) to
query for a set of elements whose paths match
a given SQL regex. The API returns a set of
XML::XParent::Elem objects.
18. XML::XParent utilities: how to use them
• Configure parameters into xparent.yml file:
• To load an XML file: schema_params:
perl xparentparse.pl 'dbi:Pg:dbname=xparent'
i <input file> # 'dbi:SQLite:xparent.db'
driver <the Schema driver to use>
grubert
grubert
[config_file <the config file>]
[verbose]
AutoCommit: 1
[clean] #plugins:
[compact] # 'SLMS::Redis::ParserPlugin':
• To query the Xparent data store:# 'tag': 'MovingRegion'
perl xparentsearch.pl
path <path regex>
driver <the Schema driver to use>
[config_file <the config file>]
• To clean the data store:
perl xparentclean.pl
driver <the Schema driver to use>
[config_file <the config file>]