SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
A Main Memory Index Structure
    to Query Linked Data
                                        Olaf Hartig
                       http://olafhartig.de/foaf.rdf#olaf

                                      Frank Huber
   Database and Information Systems Research Group
                      Humboldt-Universität zu Berlin
The Issue
  no reuse       given 0     0,2 0,4 0,6 0,8         1      0      5 10 15 20 25 30   0   20   40   60   80
                 order
ContactInfoPhillipe
   (Query No. 36)

UnsetPropsPhillipe
   (Query No. 37)

2ndDegree1Phillipe
    (Query No. 38)

2ndDegree2Phillipe
    (Query No. 39)

  IncomingPhillipe
    (Query No. 40)
                         0   0,2 0,4 0,6 0,8         1      0      5 10 15 20 25 30   0   20   40   60   80
                                    hit rate               number of query results    query execution time
                                                                                          (in seconds)
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                             2
The Issue
  no reuse       given 0
                 order
                             0,2 0,4 0,6 0,8         1
                                                           Descriptor20 25 30
                                                           0 5 10 15
                                                                      objects         0   20   40   60   80

                                                           in the query-local
ContactInfoPhillipe
   (Query No. 36)                                             dataset after
                                                            query execution:
UnsetPropsPhillipe
   (Query No. 37)
                                                                        172

2ndDegree1Phillipe                                                      533
    (Query No. 38)

2ndDegree2Phillipe
    (Query No. 39)

  IncomingPhillipe
    (Query No. 40)
                         0   0,2 0,4 0,6 0,8         1      0      5 10 15 20 25 30   0   20   40   60   80
                                    hit rate               number of query results    query execution time
                                                                                          (in seconds)
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                             3
query-local
 Logical representation of                                dataset
 Linked Data from the Web

 Physical representation of
 Linked Data from the Web


                                                           ?
                      What data structure do we use to
                 physically represent the query-local dataset?

Olaf Hartig - A Main Memory Index Structure to Query Linked Data       4
Outline

        1. Requirements + Existing Work


        2. Data Structures


        3. Evaluation


Olaf Hartig - A Main Memory Index Structure to Query Linked Data   5
Requirements
 ●   (Consecutively) build and use ad hoc collections of
     many small sets of RDF triples
 ●   Four main operations:
     ●   Find           … matching triples for a triple pattern in all
                          descriptor objects
     ●   Add, Remove, Replace                               … descriptor objects
 ●   Support of concurrent access (i.e. isolation)

 ●   Non-relevant properties:
     ●   Querying descriptor objects individually is not necessary
     ●   No need to write data back to the Web
     ●   ACID properties not required for complete queries
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                   6
Requirements
 ●   (Consecutively) build and use ad hoc collections of
     many small sets of RDF triples
 ●   Four main operations:
     ●   Find           … matching triples for a triple pattern in all
                          descriptor objects
     ●   Add, Remove, Replace                               … descriptor objects
 ●   Support of concurrent access (i.e. isolation)

 ●   Non-relevant properties:
     ●   Querying descriptor objects individually is not necessary
     ●   No need to write data back to the Web
     ●   ACID properties not required for complete queries
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                   7
Existing Work
 ●   Disk based storage solutions for RDF data
     ●   Unsuitable due to very costly I/O operations
 ●   Main memory based data structures in the literature
     ●   Focus on a large, single set of RDF triples
     ●   Optimized for complete graph pattern queries or path queries
 ●   Main memory based data structures in RDF frameworks
     ●   Focus on Jena, ARQ and NG4J
     ●   Inefficient (see evaluation)




Olaf Hartig - A Main Memory Index Structure to Query Linked Data        8
Outline

        1. Requirements + Existing Work


        2. Data Structures


        3. Evaluation


Olaf Hartig - A Main Memory Index Structure to Query Linked Data   9
Hash-Based Index for RDF Data


 Logical representation

 Physical representation
                                                                          SP    PO     SO
 ●   Dictionary:
     ●   Two-way mapping between RDF                               Dict
         terms and numerical identifiers                                  S      P      O

 ●   6 hash tables:
     ●   Each hash table contains
         all ID-encoded triples
     ●   Efficient support for all types of triple patterns
                                                           *Similar to Harth and Decker, 2005
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                            10
Hash-Based Index for RDF Data


 Logical representation

 Physical representation
                                                                          SP    PO     SO
 ●   Dictionary:
     ●   Two-way mapping between RDF                               Dict
         terms and numerical identifiers                                  S      P      O

 ●   6 hash tables:
     ●   Each hash table contains t = ( id ,id ,id )
         all ID-encoded triples    id     s   p   o

     ●   Efficient support for all types of triple patterns
                                                           *Similar to Harth and Decker, 2005
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                            11
Hash-Based Index for RDF Data


 Logical representation
       ?acq know                                             Find
                     s
 Physical representation
                                          http://bob.name                  SP   PO     SO
 ●   Dictionary:
     ●   Two-way mapping between RDF                                Dict
         terms and numerical identifiers                                   S     P      O

 ●   6 hash tables:
     ●   Each hash table contains t = ( id ,id ,id )
         all ID-encoded triples    id     s   p   o

     ●   Efficient support for all types of triple patterns
                                                           *Similar to Harth and Decker, 2005
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                            12
Individual Indexing

                                                                                query-local
                                                                                 dataset
 Logical representation

 Physical representation                                     SP    PO   SO                         SP      PO   SO




                                                              S    P    O    Dict                  S       P    O




                                                                                        SP    PO    SO


 ●   Idea: Index each descriptor object                                                  S    P        O


           separately
 ●   Implementation of the four operations:
     ●   Add, Remove, and Replace are straightforward
     ●   Find requires iterating over all indexes
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                                 13
Individual Indexing

     ?acq      know
                                                        Find
                    s
                                                                                query-local
                                    http://bob.name                              dataset
 Logical representation

 Physical representation                                     SP    PO   SO                         SP      PO   SO




                                                              S    P    O    Dict                  S       P    O




                                                                                        SP    PO    SO


 ●   Idea: Index each descriptor object                                                  S    P        O


           separately
 ●   Implementation of the four operations:
     ●   Add, Remove, and Replace are straightforward
     ●   Find requires iterating over all indexes
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                                 14
Combined Indexing

                                                                               query-local
                                                                                dataset
 Logical representation

 Physical representation
                                                                          SP        PO       SO

 ●   Idea: Use a single index
     for all descriptor objects                                    Dict
                                                                          S          P       O
     ●   src – maps each triple to a set
               of descriptor object IDs




Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                  15
Combined Indexing

                                                                                    query-local
                                                                                     dataset
 Logical representation

 Physical representation
                                                                               SP        PO       SO

 ●   Idea: Use a single index
     for all descriptor objects                                       Dict
                                                                               S          P       O
     ●   src – maps each triple to a set
               of descriptor object IDs

                                                       tid = ( ids,idp,ido )




Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                       16
Combined Indexing

                                                                                      query-local
                                                                                       dataset
 Logical representation

 Physical representation
                                                                                 SP        PO       SO

 ●   Idea: Use a single index
     for all descriptor objects                                       Dict
                                                                                 S          P       O
     ●   src – maps each triple to a set
               of descriptor object IDs

                                                       tid = ( ids,idp,ido )

                                                                             +
                                                                                 src( tid ) = { , }
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                         17
Quad Indexing

                                                                                   query-local
                                                                                    dataset
 Logical representation

 Physical representation
                                                                              SP        PO           SO

 ●   Idea: Use a single quad index
     for all descriptor objects                                        Dict
                                                                              S          P           O
     ●   quad =        ID-encoded triple
                      + descriptor object ID


                                                                   q = ( (ids,idp,ido) ,         )

Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                          18
Outline

        1. Requirements + Existing Work


        2. Data Structures


        3. Evaluation


Olaf Hartig - A Main Memory Index Structure to Query Linked Data   19
Experiment Setup
                Does this affect the overall execution time for
                  link traversal based query executions ?




Olaf Hartig - A Main Memory Index Structure to Query Linked Data   20
Experiment Setup
                Does this affect the overall execution time for
                  link traversal based query executions ?
 ●   Simulation of the Web of Data
     ●   Linked Data server publishes BSBM dataset (scal. factor: 50)
     ●   Adjusted BSBM queries link to the simulation server
 ●   Experiment:
     ●   Sequence of 200 query mixes
     ●   Reuse of the query-local dataset for the whole sequence
     ●   IndIR, CombIR, and QuadIR (as presented), engine: SQUIN
     ●
         NamedGraphSetImpl (NG4J/Jena), engine: SemWeb Client


Olaf Hartig - A Main Memory Index Structure to Query Linked Data        21
Execution Time
                                                         2500                                                     80
overall number of descr.objects in the queried dataset




                                                                                                                                                                 NG4J (SWClLib
                                                                                                                  70                                             )
                                                         2000                                                                                                    IndIR, m=4
                                                                                                                  60                                             CombIR, m=12
                                                                                                                                                                 CombQuadIR,
                                                                                      execution time in seconds                                                  m=12
                                                                                                                  50
                                                         1500

                                                                                                                  40

                                                         1000
                                                                                                                  30


                                                                                                                  20
                                                          500

                                                                                                                  10


                                                            0                                                     0
                                                                0 40 80 120 160 200                                    0   20   40   60   80      100      120   140   160   180   200
                                                                   query mix                                                                   query mix
                           Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                                                                        22
Execution Time
                                                         2500                                                     80
overall number of descr.objects in the queried dataset




                                                                                                                                                                 NG4J (SWClLib
                                                                                                                  70                                             )
                                                         2000                                                                                                    IndIR, m=4
                                                                                                                  60                                             CombIR, m=12
                                                                                                                                                                 CombQuadIR,
                                                                                      execution time in seconds                                                  m=12
                                                                                                                  50
                                                         1500

                                                                                                                  40

                                                         1000
                                                                                                                  30


                                                                                                                  20
                                                          500

                                                                                                                  10


                                                            0                                                     0
                                                                0 40 80 120 160 200                                    0   20   40   60   80      100      120   140   160   180   200
                                                                   query mix                                                                   query mix
                           Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                                                                        23
Summary
 ●   Three hash index based data structures:
     ●   Individually indexing
     ●   Combined indexing
     ●   Quad indexing
 ●   Findings:
     ●   A single index improves query performance significantly
     ●   Smaller load times with quads
 ●   Also for other use cases of ad hoc storing of Linked Data
     ●   Consecutively retrieved from remote sources
     ●   Used for immediate local processing


Olaf Hartig - A Main Memory Index Structure to Query Linked Data   24
Backup Slides




Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data   25
Combined Indexing

                                                                               query-local
                                                                                dataset
 Logical representation

 Physical representation
                                                                          SP        PO       SO

 ●   Idea: Use a single index
     for all descriptor objects                                    Dict
                                                                          S          P       O
     ●   src – maps each triple to a set
               of descriptor object IDs
 ●   Implementation of the four operations requires:
     ●   status – maps each descriptor object ID to a status
                        ( BeingIndexed, Indexed, ToBeRemoved, BeingRemoved )
     ●   Find reports a triple t only if: ∃d ∈ src(t) : status(d) = Indexed
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                  26
Implementation
 ●   Available as Free Software (http://squin.org)
 ●
     Hash tables with n = 2m buckets
 ●   Hash functions:
     ●   hS( is, ip, io ) = is & bitmask[m]
     ●   hSP( is, ip, io ) = ( is • ip ) & bitmask[m]
     ●   etc.
 ●
     Which m ?*                                                    *
                                                                   see paper
     ●   m = 4 for the individual indexes
     ●   m = 12 for the combined index
     ●   m = 12 for the quad index
Olaf Hartig - A Main Memory Index Structure to Query Linked Data           27
Experiment Setup
 ●   Comparison without link traversal based query execution
 ●   Compared data structures:
     ●   Our implementation of IndIR, CombIR, CombQuadIR
     ●
         NamedGraphSetImpl in NG4J (Jena)
     ●
         DatasetGraphMap in ARQ (Jena)
 ●   Berlin SPARQL Benchmark (BSBM)
     ●   BSBM datasets partitioned into query-local datasets
     ●   BSBM (v2.0) query mixes executed over these datasets




Olaf Hartig - A Main Memory Index Structure to Query Linked Data   28
Required Memory
                                  120
                                                                                             ARQ
                                                                                             NG4J
                                                                                             IndIR (m=4)
                                  100
Estimated required memory in MB




                                                                                             CombIR (m=12)
                                                                                             CombQuad (m=12)

                                   80



                                   60


                                                                                             BSBM      number of     overall
                                   40                                                        scaling   descriptor   number
                                                                                              factor    objects     of triples
                                                                                               50        2,599       22,616
                                                                                              100        4,178       40,133
                                   20
                                                                                              150        5,756       57,524
                                                                                              200        7,329       75,062

                                    0                                                         250        9,873       97,613
                                        0   100        200         300           400   500    300       11,455      115,217
                                                  pc (BSBM scaling factor)                    350       13,954      137,567
                                                                                              500       18,687      190,502
              Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                            29
Execution Time
                                                  2500
average execution time per query mix in seconds




                                                                                                          ARQ
                                                                                                          NG4J
                                                                                                          IndIR (m=4)
                                                  2000                                                    CombIR (m=12)
                                                                                                          CombQuad (m=12)


                                                  1500



                                                  1000
                                                                                                          BSBM      number of     overall
                                                                                                          scaling   descriptor   number
                                                                                                           factor    objects     of triples
                                                                                                            50        2,599       22,616
                                                   500
                                                                                                           100        4,178       40,133
                                                                                                           150        5,756       57,524
                                                                                                           200        7,329       75,062
                                                     0                                                     250        9,873       97,613
                                                         0   100         200         300      400   500
                                                                                                           300       11,455      115,217
                                                                   pc (BSBM scaling factor)                350       13,954      137,567
                                                                                                           500       18,687      190,502
                      Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                                 30
Load Time
                                   10
                                                                                              CombIR (m=12)
                                    9                                                         IndIR (m=4)
                                                                                              CombQuad (m=12)
                                    8
average creation time in seconds




                                    7

                                    6

                                    5

                                    4
                                                                                              BSBM      number of     overall
                                                                                              scaling   descriptor   number
                                    3                                                          factor    objects     of triples
                                                                                                50        2,599       22,616
                                    2                                                          100        4,178       40,133
                                                                                               150        5,756       57,524
                                    1
                                                                                               200        7,329       75,062
                                                                                               250        9,873       97,613
                                    0
                                        0   100       200          300            400   500    300       11,455      115,217

                                                  pc (BSBM scaling factor)                     350       13,954      137,567
                                                                                               500       18,687      190,502
               Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                            31
Execution Time
                            80                           60

                                                                                       NG4J (SWClLib)
                            70                           50                            IndIR, m=4
                                                                                              CombIR,
                                                                                              m=12
                                                                                       CombIR, m=12
                                                                                              CombQuadIR
                                                                                       CombQuadIR, m=12
                            60                           40                                    , m=12
execution time in seconds




                            50
                                                         30

                            40
                                                         20

                            30
                                                         10

                            20
                                                          0
                                                              0            5     10           15         20         25
                            10
                                                                                  query mix

                             0
                                 0   20   40     60           80          100   120    140         160        180        200
                                                                    query mix
       Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                                  32
These slides have been created by
                                      Olaf Hartig

                                              http://olafhartig.de


                     This work is licensed under a
       Creative Commons Attribution-Share Alike 3.0 License
           (http://creativecommons.org/licenses/by-sa/3.0/)




Olaf Hartig - A Main Memory Index Structure to Query Linked Data     33

Weitere ähnliche Inhalte

Was ist angesagt?

SPARQL in the Semantic Web
SPARQL in the Semantic WebSPARQL in the Semantic Web
SPARQL in the Semantic Web
Jan Beeck
 
Heuristic based Query Optimisation for SPARQL
Heuristic based Query Optimisation for SPARQLHeuristic based Query Optimisation for SPARQL
Heuristic based Query Optimisation for SPARQL
PlanetData Network of Excellence
 

Was ist angesagt? (12)

Inference on the Semantic Web
Inference on the Semantic WebInference on the Semantic Web
Inference on the Semantic Web
 
SPARQL in the Semantic Web
SPARQL in the Semantic WebSPARQL in the Semantic Web
SPARQL in the Semantic Web
 
FinTech Achievelinks System
FinTech Achievelinks SystemFinTech Achievelinks System
FinTech Achievelinks System
 
Projection Indexes for HDF5 Datasets
Projection Indexes for HDF5 DatasetsProjection Indexes for HDF5 Datasets
Projection Indexes for HDF5 Datasets
 
SPARQL Query Containment with ShEx Constraints
SPARQL Query Containment with ShEx ConstraintsSPARQL Query Containment with ShEx Constraints
SPARQL Query Containment with ShEx Constraints
 
Dev days 2017 questionnaires (brian postlethwaite)
Dev days 2017 questionnaires (brian postlethwaite)Dev days 2017 questionnaires (brian postlethwaite)
Dev days 2017 questionnaires (brian postlethwaite)
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classification
 
Heuristic based Query Optimisation for SPARQL
Heuristic based Query Optimisation for SPARQLHeuristic based Query Optimisation for SPARQL
Heuristic based Query Optimisation for SPARQL
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
Furore devdays 2017- rdf2(solbrig)
Furore devdays 2017- rdf2(solbrig)Furore devdays 2017- rdf2(solbrig)
Furore devdays 2017- rdf2(solbrig)
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
Data Mining with Excel 2010 and PowerPivot
Data Mining with Excel 2010 and PowerPivotData Mining with Excel 2010 and PowerPivot
Data Mining with Excel 2010 and PowerPivot
 

Ähnlich wie A Main Memory Index Structure to Query Linked Data

Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
MapR Technologies
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 

Ähnlich wie A Main Memory Index Structure to Query Linked Data (20)

Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for Hadoop
 
Thinking About Guideline for Data Interoperability - Design concept and workf...
Thinking About Guideline for Data Interoperability - Design concept and workf...Thinking About Guideline for Data Interoperability - Design concept and workf...
Thinking About Guideline for Data Interoperability - Design concept and workf...
 
RasterFrames + STAC
RasterFrames + STACRasterFrames + STAC
RasterFrames + STAC
 
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
8th TUC Meeting -  Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...8th TUC Meeting -  Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017
 
Ontologies & linked open data
Ontologies & linked open dataOntologies & linked open data
Ontologies & linked open data
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Contributing to the Smart City Through Linked Library Data
Contributing to the Smart City Through Linked Library DataContributing to the Smart City Through Linked Library Data
Contributing to the Smart City Through Linked Library Data
 
Data Integration at the Ontology Engineering Group
Data Integration at the Ontology Engineering GroupData Integration at the Ontology Engineering Group
Data Integration at the Ontology Engineering Group
 
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data GridsSpark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databases
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
JDD 2016 - Michal Matloka - Small Intro To Big Data
JDD 2016 - Michal Matloka - Small Intro To Big DataJDD 2016 - Michal Matloka - Small Intro To Big Data
JDD 2016 - Michal Matloka - Small Intro To Big Data
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 

Mehr von Olaf Hartig

Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...
Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...
Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...
Olaf Hartig
 
The Impact of Data Caching of on Query Execution for Linked Data
The Impact of Data Caching of on Query Execution for Linked DataThe Impact of Data Caching of on Query Execution for Linked Data
The Impact of Data Caching of on Query Execution for Linked Data
Olaf Hartig
 
Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)
Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)
Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)
Olaf Hartig
 

Mehr von Olaf Hartig (20)

LDQL: A Query Language for the Web of Linked Data
LDQL: A Query Language for the Web of Linked DataLDQL: A Query Language for the Web of Linked Data
LDQL: A Query Language for the Web of Linked Data
 
A Context-Based Semantics for SPARQL Property Paths over the Web
A Context-Based Semantics for SPARQL Property Paths over the WebA Context-Based Semantics for SPARQL Property Paths over the Web
A Context-Based Semantics for SPARQL Property Paths over the Web
 
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationRethinking Online SPARQL Querying to Support Incremental Result Visualization
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
 
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
 
Tutorial "Linked Data Query Processing" Part 4 "Execution Process" (WWW 2013 ...
Tutorial "Linked Data Query Processing" Part 4 "Execution Process" (WWW 2013 ...Tutorial "Linked Data Query Processing" Part 4 "Execution Process" (WWW 2013 ...
Tutorial "Linked Data Query Processing" Part 4 "Execution Process" (WWW 2013 ...
 
Tutorial "Linked Data Query Processing" Part 3 "Source Selection Strategies" ...
Tutorial "Linked Data Query Processing" Part 3 "Source Selection Strategies" ...Tutorial "Linked Data Query Processing" Part 3 "Source Selection Strategies" ...
Tutorial "Linked Data Query Processing" Part 3 "Source Selection Strategies" ...
 
Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...
Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...
Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...
 
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
 
An Overview on PROV-AQ: Provenance Access and Query
An Overview on PROV-AQ: Provenance Access and QueryAn Overview on PROV-AQ: Provenance Access and Query
An Overview on PROV-AQ: Provenance Access and Query
 
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
 
Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...
Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...
Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...
 
The Impact of Data Caching of on Query Execution for Linked Data
The Impact of Data Caching of on Query Execution for Linked DataThe Impact of Data Caching of on Query Execution for Linked Data
The Impact of Data Caching of on Query Execution for Linked Data
 
How Caching Improves Efficiency and Result Completeness for Querying Linked Data
How Caching Improves Efficiency and Result Completeness for Querying Linked DataHow Caching Improves Efficiency and Result Completeness for Querying Linked Data
How Caching Improves Efficiency and Result Completeness for Querying Linked Data
 
Towards a Data-Centric Notion of Trust in the Semantic Web (A Position Statem...
Towards a Data-Centric Notion of Trust in the Semantic Web (A Position Statem...Towards a Data-Centric Notion of Trust in the Semantic Web (A Position Statem...
Towards a Data-Centric Notion of Trust in the Semantic Web (A Position Statem...
 
Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)
Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)
Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)
 
Querying Linked Data with SPARQL (2010)
Querying Linked Data with SPARQL (2010)Querying Linked Data with SPARQL (2010)
Querying Linked Data with SPARQL (2010)
 
Answers to usual issues in getting started with consuming Linked Data (2010)
Answers to usual issues in getting started with consuming Linked Data (2010)Answers to usual issues in getting started with consuming Linked Data (2010)
Answers to usual issues in getting started with consuming Linked Data (2010)
 

A Main Memory Index Structure to Query Linked Data

  • 1. A Main Memory Index Structure to Query Linked Data Olaf Hartig http://olafhartig.de/foaf.rdf#olaf Frank Huber Database and Information Systems Research Group Humboldt-Universität zu Berlin
  • 2. The Issue no reuse given 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 order ContactInfoPhillipe (Query No. 36) UnsetPropsPhillipe (Query No. 37) 2ndDegree1Phillipe (Query No. 38) 2ndDegree2Phillipe (Query No. 39) IncomingPhillipe (Query No. 40) 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 hit rate number of query results query execution time (in seconds) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 2
  • 3. The Issue no reuse given 0 order 0,2 0,4 0,6 0,8 1 Descriptor20 25 30 0 5 10 15 objects 0 20 40 60 80 in the query-local ContactInfoPhillipe (Query No. 36) dataset after query execution: UnsetPropsPhillipe (Query No. 37) 172 2ndDegree1Phillipe 533 (Query No. 38) 2ndDegree2Phillipe (Query No. 39) IncomingPhillipe (Query No. 40) 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 hit rate number of query results query execution time (in seconds) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 3
  • 4. query-local Logical representation of dataset Linked Data from the Web Physical representation of Linked Data from the Web ? What data structure do we use to physically represent the query-local dataset? Olaf Hartig - A Main Memory Index Structure to Query Linked Data 4
  • 5. Outline 1. Requirements + Existing Work 2. Data Structures 3. Evaluation Olaf Hartig - A Main Memory Index Structure to Query Linked Data 5
  • 6. Requirements ● (Consecutively) build and use ad hoc collections of many small sets of RDF triples ● Four main operations: ● Find … matching triples for a triple pattern in all descriptor objects ● Add, Remove, Replace … descriptor objects ● Support of concurrent access (i.e. isolation) ● Non-relevant properties: ● Querying descriptor objects individually is not necessary ● No need to write data back to the Web ● ACID properties not required for complete queries Olaf Hartig - A Main Memory Index Structure to Query Linked Data 6
  • 7. Requirements ● (Consecutively) build and use ad hoc collections of many small sets of RDF triples ● Four main operations: ● Find … matching triples for a triple pattern in all descriptor objects ● Add, Remove, Replace … descriptor objects ● Support of concurrent access (i.e. isolation) ● Non-relevant properties: ● Querying descriptor objects individually is not necessary ● No need to write data back to the Web ● ACID properties not required for complete queries Olaf Hartig - A Main Memory Index Structure to Query Linked Data 7
  • 8. Existing Work ● Disk based storage solutions for RDF data ● Unsuitable due to very costly I/O operations ● Main memory based data structures in the literature ● Focus on a large, single set of RDF triples ● Optimized for complete graph pattern queries or path queries ● Main memory based data structures in RDF frameworks ● Focus on Jena, ARQ and NG4J ● Inefficient (see evaluation) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 8
  • 9. Outline 1. Requirements + Existing Work 2. Data Structures 3. Evaluation Olaf Hartig - A Main Memory Index Structure to Query Linked Data 9
  • 10. Hash-Based Index for RDF Data Logical representation Physical representation SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains all ID-encoded triples ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 10
  • 11. Hash-Based Index for RDF Data Logical representation Physical representation SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains t = ( id ,id ,id ) all ID-encoded triples id s p o ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 11
  • 12. Hash-Based Index for RDF Data Logical representation ?acq know Find s Physical representation http://bob.name SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains t = ( id ,id ,id ) all ID-encoded triples id s p o ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 12
  • 13. Individual Indexing query-local dataset Logical representation Physical representation SP PO SO SP PO SO S P O Dict S P O SP PO SO ● Idea: Index each descriptor object S P O separately ● Implementation of the four operations: ● Add, Remove, and Replace are straightforward ● Find requires iterating over all indexes Olaf Hartig - A Main Memory Index Structure to Query Linked Data 13
  • 14. Individual Indexing ?acq know Find s query-local http://bob.name dataset Logical representation Physical representation SP PO SO SP PO SO S P O Dict S P O SP PO SO ● Idea: Index each descriptor object S P O separately ● Implementation of the four operations: ● Add, Remove, and Replace are straightforward ● Find requires iterating over all indexes Olaf Hartig - A Main Memory Index Structure to Query Linked Data 14
  • 15. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs Olaf Hartig - A Main Memory Index Structure to Query Linked Data 15
  • 16. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs tid = ( ids,idp,ido ) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 16
  • 17. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs tid = ( ids,idp,ido ) + src( tid ) = { , } Olaf Hartig - A Main Memory Index Structure to Query Linked Data 17
  • 18. Quad Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single quad index for all descriptor objects Dict S P O ● quad = ID-encoded triple + descriptor object ID q = ( (ids,idp,ido) , ) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 18
  • 19. Outline 1. Requirements + Existing Work 2. Data Structures 3. Evaluation Olaf Hartig - A Main Memory Index Structure to Query Linked Data 19
  • 20. Experiment Setup Does this affect the overall execution time for link traversal based query executions ? Olaf Hartig - A Main Memory Index Structure to Query Linked Data 20
  • 21. Experiment Setup Does this affect the overall execution time for link traversal based query executions ? ● Simulation of the Web of Data ● Linked Data server publishes BSBM dataset (scal. factor: 50) ● Adjusted BSBM queries link to the simulation server ● Experiment: ● Sequence of 200 query mixes ● Reuse of the query-local dataset for the whole sequence ● IndIR, CombIR, and QuadIR (as presented), engine: SQUIN ● NamedGraphSetImpl (NG4J/Jena), engine: SemWeb Client Olaf Hartig - A Main Memory Index Structure to Query Linked Data 21
  • 22. Execution Time 2500 80 overall number of descr.objects in the queried dataset NG4J (SWClLib 70 ) 2000 IndIR, m=4 60 CombIR, m=12 CombQuadIR, execution time in seconds m=12 50 1500 40 1000 30 20 500 10 0 0 0 40 80 120 160 200 0 20 40 60 80 100 120 140 160 180 200 query mix query mix Olaf Hartig - A Main Memory Index Structure to Query Linked Data 22
  • 23. Execution Time 2500 80 overall number of descr.objects in the queried dataset NG4J (SWClLib 70 ) 2000 IndIR, m=4 60 CombIR, m=12 CombQuadIR, execution time in seconds m=12 50 1500 40 1000 30 20 500 10 0 0 0 40 80 120 160 200 0 20 40 60 80 100 120 140 160 180 200 query mix query mix Olaf Hartig - A Main Memory Index Structure to Query Linked Data 23
  • 24. Summary ● Three hash index based data structures: ● Individually indexing ● Combined indexing ● Quad indexing ● Findings: ● A single index improves query performance significantly ● Smaller load times with quads ● Also for other use cases of ad hoc storing of Linked Data ● Consecutively retrieved from remote sources ● Used for immediate local processing Olaf Hartig - A Main Memory Index Structure to Query Linked Data 24
  • 25. Backup Slides Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 25
  • 26. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs ● Implementation of the four operations requires: ● status – maps each descriptor object ID to a status ( BeingIndexed, Indexed, ToBeRemoved, BeingRemoved ) ● Find reports a triple t only if: ∃d ∈ src(t) : status(d) = Indexed Olaf Hartig - A Main Memory Index Structure to Query Linked Data 26
  • 27. Implementation ● Available as Free Software (http://squin.org) ● Hash tables with n = 2m buckets ● Hash functions: ● hS( is, ip, io ) = is & bitmask[m] ● hSP( is, ip, io ) = ( is • ip ) & bitmask[m] ● etc. ● Which m ?* * see paper ● m = 4 for the individual indexes ● m = 12 for the combined index ● m = 12 for the quad index Olaf Hartig - A Main Memory Index Structure to Query Linked Data 27
  • 28. Experiment Setup ● Comparison without link traversal based query execution ● Compared data structures: ● Our implementation of IndIR, CombIR, CombQuadIR ● NamedGraphSetImpl in NG4J (Jena) ● DatasetGraphMap in ARQ (Jena) ● Berlin SPARQL Benchmark (BSBM) ● BSBM datasets partitioned into query-local datasets ● BSBM (v2.0) query mixes executed over these datasets Olaf Hartig - A Main Memory Index Structure to Query Linked Data 28
  • 29. Required Memory 120 ARQ NG4J IndIR (m=4) 100 Estimated required memory in MB CombIR (m=12) CombQuad (m=12) 80 60 BSBM number of overall 40 scaling descriptor number factor objects of triples 50 2,599 22,616 100 4,178 40,133 20 150 5,756 57,524 200 7,329 75,062 0 250 9,873 97,613 0 100 200 300 400 500 300 11,455 115,217 pc (BSBM scaling factor) 350 13,954 137,567 500 18,687 190,502 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 29
  • 30. Execution Time 2500 average execution time per query mix in seconds ARQ NG4J IndIR (m=4) 2000 CombIR (m=12) CombQuad (m=12) 1500 1000 BSBM number of overall scaling descriptor number factor objects of triples 50 2,599 22,616 500 100 4,178 40,133 150 5,756 57,524 200 7,329 75,062 0 250 9,873 97,613 0 100 200 300 400 500 300 11,455 115,217 pc (BSBM scaling factor) 350 13,954 137,567 500 18,687 190,502 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 30
  • 31. Load Time 10 CombIR (m=12) 9 IndIR (m=4) CombQuad (m=12) 8 average creation time in seconds 7 6 5 4 BSBM number of overall scaling descriptor number 3 factor objects of triples 50 2,599 22,616 2 100 4,178 40,133 150 5,756 57,524 1 200 7,329 75,062 250 9,873 97,613 0 0 100 200 300 400 500 300 11,455 115,217 pc (BSBM scaling factor) 350 13,954 137,567 500 18,687 190,502 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 31
  • 32. Execution Time 80 60 NG4J (SWClLib) 70 50 IndIR, m=4 CombIR, m=12 CombIR, m=12 CombQuadIR CombQuadIR, m=12 60 40 , m=12 execution time in seconds 50 30 40 20 30 10 20 0 0 5 10 15 20 25 10 query mix 0 0 20 40 60 80 100 120 140 160 180 200 query mix Olaf Hartig - A Main Memory Index Structure to Query Linked Data 32
  • 33. These slides have been created by Olaf Hartig http://olafhartig.de This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License (http://creativecommons.org/licenses/by-sa/3.0/) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 33