Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Complex Matching of RDF Datatype Properties

561 Aufrufe

Veröffentlicht am

Paper about complex matching of RDF datatype properties.

DEXA Conference, 2013, Prague, Czech Republic.

Veröffentlicht in: Bildung, Reisen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Complex Matching of RDF Datatype Properties

  1. 1. Complex Matching of RDF Datatype Properties Bernardo Pereira Nunes1,2 , Alexander Mera1 , Marco Antonio Casanova1 , Besnik Fetahu2 , Luiz André P. Paes Leme3 , Stefan Dietze2 1) Department of Informatics-PUC-Rio, 2) L3S Research Center, Leibniz University Hannover, 3) Computer Science Institute, Fluminese Federal University DEXA 2013 – Prague, Czech Republic
  2. 2. Outline • Introduction • Motivation • Related Work • Schema Matching Principles • Our approach: • Phase 1) Estimated Mutual Information – EMI • Phase 2) Genetic Programing - GP • Evaluation • Results • Conclusions Besnik Fetahu 2DEXA 2013 – Prague, Czech Republic
  3. 3. Introduction • Data Integration • Combine different data sources into an unified view of data • Originally fomented by large organizations: • Merge companies databases due to acquisitions • Currently, driven by new Web trends such as: • Improvement of Web-based search • Proliferation of Web applications • e-business • Examples: momondo.de, semantic search, price watchers sites, etc. Besnik Fetahu 3DEXA 2013 – Prague, Czech Republic
  4. 4. Introduction • Challenges • Heterogeneous data • Different data formats • Data quality (data impurities, corrupted information) • Scalability • Adaptability • Costly Besnik Fetahu 4DEXA 2013 – Prague, Czech Republic
  5. 5. Introduction • Initiatives to address data integration problems • Linked Data Principles • Ontology Alignment Initiatives (OAI) • Schema Matching tools Besnik Fetahu 5DEXA 2013 – Prague, Czech Republic
  6. 6. Motivation • Given two schemas S and T a matching from S to T is characterized if an element e from S is mapped to an element e’ from T by some expression that relates both elements. Besnik Fetahu 6DEXA 2013 – Prague, Czech Republic ? ? ?
  7. 7. Related Work • Methods • RiMOM, iMAP, S-Match, DSSim, ATOM, etc. • Schema-based approach • Instance-based approach • Hybrid approach • Cardinality • 1:1 • 1:n • n:m Besnik Fetahu 7DEXA 2013 – Prague, Czech Republic Rahm, E. and Bernstein, P. 2001. A survey of approaches to automatic schema matching. The VLDB Journal 10, 4 (Dec. 2001), 334-350.
  8. 8. Cardinality • Simple match • 1:1 – direct matching • Complex match • 1:1 / n:1 (mapping functions) Besnik Fetahu 8DEXA 2013 – Prague, Czech Republic ISBN 0-671-72287-5 ISBN 0-671-72287-5 Fullname William Shakespeare Firstname Last name William Shakespeare split(fullname) concatenate(f,l)
  9. 9. Our approach • Two-phase approach: • Estimated Mutual Information • Suggest 1:1 and 1:n mappings • Serve as a filtering step (filter out data properties that have no mutual information) • Reduce search space for the next phase (speed up the process) • Genetic Programming • Automatic process for creating mapping functions • Reduces the cost of traversing the search space Besnik Fetahu 9DEXA 2013 – Prague, Czech Republic
  10. 10. Estimated Mutual Information (EMI) • EMI Matrix • p=(p1,…,pu), q=(q1,…,qv) two lists of sets (i.e. sets of data type properties) Besnik Fetahu 10DEXA 2013 – Prague, Czech Republic Cosine Similarity Jaccard Index …..
  11. 11. Estimated Mutual Information (EMI) • Computing the mutual information: • Cosine Similarity • Simple matches: William Shakespeare → William Shakespeare • Jaccard Similarity • Simple and Complex matches: William → William Shakespeare Besnik Fetahu 11DEXA 2013 – Prague, Czech Republic
  12. 12. Genetic Programming (GP) • Genetic programming refers to an automated method to create and evolve programs to solve a problem. • A solution is represented by a tree, whose nodes are labeled with functions (concatenate, split, sum) or with values (strings, numbers, etc). • New individuals are generated by applying genetic operations to the current population of individuals. • Selects individuals that should breed by an evolutionary process. Besnik Fetahu 12DEXA 2013 – Prague, Czech Republic
  13. 13. Genetic Programming (GP) • GP Functions: • Crossover • The act of swapping gene values between two potential solutions, simulating the "mating" of the two solutions. • Mutation • The act of randomly altering the value of a gene in a potential solution. • Reproduction • The act of making a copy of a potential solution Besnik Fetahu 13DEXA 2013 – Prague, Czech Republic
  14. 14. Genetic Programming (GP) • Fitness function • Levenshtein similarity function for string values • KL-divergence measure for numeric values • Different measures are applied since data properties values can have multiple common values (such as 0) and it can lead to a wrong match. Thus, we use measure the probability of two sets being the same with KL. Besnik Fetahu 14DEXA 2013 – Prague, Czech Republic
  15. 15. An Example of Implementation Besnik Fetahu 15DEXA 2013 – Prague, Czech Republic
  16. 16. An Example of Implementation Phase 1 – Co-occurrence matrix 1. Difference between Cosine/Jaccard similarity metrics. Besnik Fetahu 16DEXA 2013 – Prague, Czech Republic
  17. 17. An Example of Implementation Phase 1 – EMI matrix 2. Possible matchings: Besnik Fetahu 17DEXA 2013 – Prague, Czech Republic
  18. 18. An Example of Implementation Besnik Fetahu 18DEXA 2013 – Prague, Czech Republic
  19. 19. An Example of Implementation Besnik Fetahu 19DEXA 2013 – Prague, Czech Republic Complement + NumberAddress + Number Crossover NeighborhoodComplementNumber Address + + mutation Complement + Number reproduction
  20. 20. An Example of Implementation Besnik Fetahu 20DEXA 2013 – Prague, Czech Republic Correct Repetitive and Incorrect mutation
  21. 21. Evaluation • Datasets • “Personal Information” dataset lists information about people • “Real Estate” dataset lists information about houses for sale • “Inventory” dataset describes product inventories With exception of the “Personal Information” dataset due to privacy reasons, other datasets are available at: http://pages.cs.wisc.edu/ anhai/wisc-si-archive/domains/ Besnik Fetahu 21DEXA 2013 – Prague, Czech Republic
  22. 22. Results Besnik Fetahu 22DEXA 2013 – Prague, Czech Republic
  23. 23. Results Besnik Fetahu 23DEXA 2013 – Prague, Czech Republic
  24. 24. 27/08/13Ricardo Kawase 24 Conclusions • Complex schema matching approach • Simple + Complex matching: • Estimated Mutual Information + Genetic Programing • Reduced search space for matching properties • Adaptive to variations of 1:1 and n:1 matching instances • High accuracy on generated matches and coverage
  25. 25. Questions? Thank you! Besnik Fetahu 25DEXA 2013 – Prague, Czech Republic

×