Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Sasaki datathon-madrid-2015
Sasaki datathon-madrid-2015
Wird geladen in …3
×

Hier ansehen

1 von 20 Anzeige

Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID

Herunterladen, um offline zu lesen

The paper presentation was given at the foundations of databases (GvDB) workshop.
https://dbs.cs.uni-duesseldorf.de/gvdb2018/wp-content/uploads/2018/05/GvDB2018_paper_4.pdf

The paper presentation was given at the foundations of databases (GvDB) workshop.
https://dbs.cs.uni-duesseldorf.de/gvdb2018/wp-content/uploads/2018/05/GvDB2018_paper_4.pdf

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID (20)

Anzeige

Aktuellste (20)

Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID

  1. 1. www.moving-project.eu TraininG towards a society of data-saVvy inforMation prOfessionals to enable open leadership INnovation Till Blume and Ansgar Scherp ZBW – Leibniz Information Centre for Economics Christian-Albrechts-Universitat zu Kiel Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID May 23rd, 2018, 30th GI-Workshop on Foundations of Databases (Grundlagen von Datenbanken), 22.05.2018 - 25.05.2018, Wuppertal, Germany.
  2. 2. www.moving-project.eu 2 of 17 Why use a Schema-level Index? Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID Index 1 foaf:Agent dct:subject bibo:Book dct:creator ?! I want more metadata! Where to get it from? … 2 Towards a clean air policy Great Britain. Central Electricity foaf:Agent URI-1 URI-2 bibo:Book dct:subject URI-3 Problem: • We are looking for a specific kind of metadata, e.g., about books. • We do not know in which databases we can find such metadata. • We need an index that can be queried to find matching databases. Solution: • A schema-level index (SLI) summarizes data by storing information of how the data is modelled in a specific database. • We formulate a structural query to find matching databases.
  3. 3. www.moving-project.eu 3 of 17 Real World Application Scenario Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID … Towards a clean air policy Great Britain. Central Electricity foaf:Agent URI-1 URI-2 bibo:Book dct:subject URI-3 MOVING platform Index 1 foaf:Agent dct:subject bibo:Book dct:creator 2 MOVING search scenario: • The MOVING platform1 provides a search for bibliographic resources • We harvest bibliographic metadata using different SLIs • Such metadata is of great value since • We can obtain good search results solely relying on the title [3]. • We can complement existing metadata. • We can train machine learning models to further improve the search [4]. 1http://platform.moving-project.eu 3
  4. 4. www.moving-project.eu 4 of 17 Real World Application Scenario Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID … Towards a clean air policy Great Britain. Central Electricity foaf:Agent URI-1 URI-2 bibo:Book dct:subject URI-3 MOVING platform Index 1 foaf:Agent dct:subject bibo:Book dct:creator 2 MOVING search scenario: • Which SLIs are best suited to find bibliographic metadata in the Web of Data? • Can we find semantically similar databases as well? Proceedings of the … Benjamin Elizalde foaf:Agent URI-9 URI-8 bibo:Proceedings dct:subject URI-6 3
  5. 5. www.moving-project.eu 5 of 17 • All schema-level indices (SLI) summarize data differently, for different purposes, and lack a common formalization [1,2,5,7-11], for example: • Consider incoming and outgoing properties (edges) • Consider properties (edge label) and objects (target node) • Consider types • Consider types and properties • … • Without a common ground, it is difficult to develop new indices and compare them to existing ones. • Even for a single application scenario, a single SLI may not be sufficient since how the data is modelled can vary a lot [6]. Motivation for FLuID Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID
  6. 6. www.moving-project.eu 6 of 17 Approach • Abstract from the Related Work (Bottom-up): Find generic, simple patterns in existing SLIs and use them as basic building blocks to define all (complex) schema structures that exist in previous SLIs. • MOVING search scenario (Top-down): Flexible define indices that can reflect semantic information and can be efficiently computed. Solution 1. We formalized our building blocks using equivalence relations over directed edge labeled multigraph (RDF graph). 2. We demonstrated how to model existing works and beyond. 3. We showed the scalability by conducting a complexity analysis. The FLuID Model Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID
  7. 7. www.moving-project.eu 7 of 17 • FLuID provides 7 schema elements: • 3 simple elements: Object Cluster (OC), Property Cluster (PC), and Property- Object Cluster (POC) • 3 undirected elements: u-OC, u-PC, and u-POC • 1 Complex Schema Element (CSE) • FLuID provides 4 parameterizations: • Label parameterization • Chaining parameterization • Ontology paramaterization • Instance parameterization • In total, FLuID provides 11 building blocks sufficient to model all existing approaches and beyond. The FLuID Model Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID
  8. 8. www.moving-project.eu 8 of 17 • Instances: edges <s,p,o> with same subject node s, i.e., ((i1, p1, o1), (i2, p2, o2)) ∈ I ⇔ i1 = i2. • Edges belong to exactly 1 instance, nodes not necessarily • Since instances partition the data graph, a set of instances also partitions the data graph. FLuID: Equivalence Relation Approach Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 p2 p1 p2 p1 p3 p2 p1
  9. 9. www.moving-project.eu 9 of 17 • Object Cluster: summarize instances that share a set of connected objects, i.e., ([i1]I , [i2]I ) ∈ OC ⇔ ∀(i1, p1, o1)∃(i2, p2, o2) : o1 = o2 ∧ ∀(i2, p2, o2) ∃(i1, p1, o1) : o1 = o2 The FLuID Model Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 p2 p1 p2 p1 p3 p2 p1
  10. 10. www.moving-project.eu 10 of 17 • Label Parameterized Object Cluster: summarize instances that have the set of connected objects, if the property is p1 The FLuID Model Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 p2 p1 p2 p1 p3 p2 p1
  11. 11. www.moving-project.eu 11 of 17 • Label Parameterized Object Cluster: summarize instances that have the set of connected objects, if the property is rdf:type The FLuID Model Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 p2 rdf:type p2 rdf:type p3 p2 rdf:type Bbibo:Book Bfoaf:Agent Bbibo:Proceedings
  12. 12. www.moving-project.eu 12 of 17 • Label Parameterized Object Cluster: summarize instances that have the set of connected objects, if the property is rdf:type • Ontology paramaterization: RDFS Schema Graph The FLuID Model Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 p2 rdf:type p2 rdf:type p3 p2 rdf:type Bbibo:Book Bfoaf:Agent Bbibo:Proceedings
  13. 13. www.moving-project.eu 13 of 17 • Label Parameterized Object Cluster: summarize instances that have the set of connected objects, if the property is rdf:type • Ontology paramaterization: RDFS Schema Graph • Instance parameterization: owl:sameAs The FLuID Model Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 dct:creator rdf:type dct:creator rdf:type owl:sameAs dct:creator rdf:type Bbibo:Book Bfoaf:Agent Bbibo:Proceedings
  14. 14. www.moving-project.eu 14 of 17 A Semantic Schema-level Index Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID Index 1foaf:Agent dct:subject bibo:Book dct:creator … 2 Proceedings of the … Benjamin Elizalde foaf:Agent URI-9 URI-8 bibo:Proceedings dct:subject URI-6 Towards a clean air policy Great Britain. Central Electricity foaf:Agent URI-1 URI-2 bibo:Book dct:subject URI-3 Family planning programmes in Africa dct:creator Pierre Prader URI-0 bibo:Book dct:subject URI-3 URI-4 URI-5 owl:sameAs Pierre Prader URI-5 foaf:Agent
  15. 15. www.moving-project.eu 15 of 17 • Complexity Analysis • We show that every SLI modeled with FLuID can be computed in O(n). • Threat: The on-the-fly inferencing! If there was a linear dependency of RDFS triples and dataset size, we would have quadratic complexity. • Empirical Evaluation to estimate impact of inferencing • We analyzed two real-world datasets from the Web of Data. • TimBL-11M: 11 million triples (edges) crawled from one seed URI. • DyLDO-127M: 127 million triples (edges) crawled from 95,000 seed URIs. • Practical impact of the on-the-fly inferencing: g < 1.001. • Thus, we did not find a linear dependency but rather a constant factor. Evaluation Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID
  16. 16. www.moving-project.eu 16 of 17 • Conclusion • We have presented the novel, parameterized schema-level index model FLuID, which is sufficient to express the functionalities of existing SLIs and beyond. • We showed that the build-time and space complexity of any SLI developed with FLuID scales linear with respect to the number of triples indexed. • Outlook • Implementing FLuID in a single computation- and query-framework • https://github.com/t-blume/fluid-framework • http://lodatio.informatik.uni-kiel.de/ • Qualitatively comparing existing and new approaches. Conclusion & Outlook Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID
  17. 17. www.moving-project.eu 17 of 17 Thank you for your attention! Any questions? Project consortium and funding agency MOVING is funded by the EU Horizon 2020 Programme under the project number INSO-4-2015: 693092
  18. 18. www.moving-project.eu 18 of 17 References 1. F. Benedetti, S. Bergamaschi, and L. Po. Exposing the underlying schema of LOD sources. In Joint IEEE/WIC/ACM WI and IAT, 2015. 2. M. Ciglan, K. Nørv˚ag, and L. Hluch´y. The SemSets model for ad-hoc semantic list search. In WWW, 2012. 3. L. Galke, F. Mai, A. Schelten, D. Brunsch, A. Scherp: Using titles vs. full-text as source for automated semantic document annotation. In: K-CAP 2017 4. L. Galke, A. Saleh, A. Scherp: Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical Information Retrieval. In: INFORMATIK 2017 5. R. Goldman and J. Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In VLDB 1997. 6. J. Jett, T. Nurmikko-Fuller, T.W. Cole, K.R. Page, J.S. Downie: Enhancing scholarly use of digital libraries: A comparative survey and review of bibliographic metadata ontologies. In: JCDL 2016 7. M. Konrath, T. Gottron, S. Staab, and A. Scherp. SchemEX - efficient construction of a data catalogue by stream-based indexing of Linked Data. J. Web Sem., 16:52–58, 2012. 8. J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom. Lore: a database management system for semistructured data. SIGMOD Record, 26(3):54–66, 1997. 9. T. Neumann and G. Moerkotte. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In ICDE, 2011. 10. J. Schaible, T. Gottron, and A. Scherp. TermPicker: Enabling the reuse of vocabulary terms by exploiting data from the Linked Open Data cloud. In ESWC, 2016. 11. B. Spahiu, R. Porrini, M. Palmonari, A. Rula, and A. Maurino. ABSTAT: ontology-driven Linked Data summaries with pattern minimalization. In ESWC Satellite Events, Revised Selected Papers, 2016. Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID
  19. 19. www.moving-project.eu 19 of 17 Search Engine Prototype: LODatio+ Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID http://lodatio.informatik.uni-kiel.de
  20. 20. www.moving-project.eu 20 of 17 Real World Application Scenario Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID http://platform.moving-project.eu

Hinweis der Redaktion

  • Structural query = query without instance information meaning the title or the author

    Index is computed from the data: for example we crawl open databases in the web!
  • Structural query = query without instance information meaning the title or the author

    Index is computed from the data: for example we crawl open databases in the web!
  • Structural query = query without instance information meaning the title or the author

    Index is computed from the data: for example we crawl open databases in the web!
  • Colors indicate partitions on the data graph
  • P1 = rdf:type
    P2 = dct:creator
    P3 = owl:sameAs
    P4 = rdfs:subClassOf
  • P1 = rdf:type
    P2 = dct:creator
    P3 = owl:sameAs
    P4 = rdfs:subClassOf
  • P1 = rdf:type
    P2 = dct:creator
    P3 = owl:sameAs
    P4 = rdfs:subClassOf
  • P1 = rdf:type
    P2 = dct:creator
    P3 = owl:sameAs
    P4 = rdfs:subClassOf
  • P1 = rdf:type
    P2 = dct:creator
    P3 = owl:sameAs
    P4 = rdfs:subClassOf
  • Build time is important for the computation of the index

    Index size influences the query time
  • Structural query = query without instance information meaning the title or the author

    Index is computed from the data: for example we crawl open databases in the web!

×