1. Complex Matching of RDF Datatype Properties
Bernardo Pereira Nunes1,2
, Alexander Mera1
, Marco Antonio Casanova1
, Besnik Fetahu2
, Luiz André P. Paes Leme3
, Stefan Dietze2
1) Department of Informatics-PUC-Rio, 2) L3S Research Center, Leibniz University Hannover,
3) Computer Science Institute, Fluminese Federal University
DEXA 2013 – Prague, Czech Republic
3. Introduction
• Data Integration
• Combine different data sources into an unified view of data
• Originally fomented by large organizations:
• Merge companies databases due to acquisitions
• Currently, driven by new Web trends such as:
• Improvement of Web-based search
• Proliferation of Web applications
• e-business
• Examples: momondo.de, semantic search, price watchers sites, etc.
Besnik Fetahu 3DEXA 2013 – Prague, Czech Republic
4. Introduction
• Challenges
• Heterogeneous data
• Different data formats
• Data quality (data impurities, corrupted information)
• Scalability
• Adaptability
• Costly
Besnik Fetahu 4DEXA 2013 – Prague, Czech Republic
5. Introduction
• Initiatives to address data integration problems
• Linked Data Principles
• Ontology Alignment Initiatives (OAI)
• Schema Matching tools
Besnik Fetahu 5DEXA 2013 – Prague, Czech Republic
6. Motivation
• Given two schemas S and T a matching from S to T is characterized if an
element e from S is mapped to an element e’ from T by some expression
that relates both elements.
Besnik Fetahu 6DEXA 2013 – Prague, Czech Republic
?
?
?
7. Related Work
• Methods
• RiMOM, iMAP, S-Match, DSSim, ATOM, etc.
• Schema-based approach
• Instance-based approach
• Hybrid approach
• Cardinality
• 1:1
• 1:n
• n:m
Besnik Fetahu 7DEXA 2013 – Prague, Czech Republic
Rahm, E. and Bernstein, P. 2001. A survey of approaches to automatic schema matching. The VLDB Journal 10,
4 (Dec. 2001), 334-350.
8. Cardinality
• Simple match
• 1:1 – direct matching
• Complex match
• 1:1 / n:1 (mapping functions)
Besnik Fetahu 8DEXA 2013 – Prague, Czech Republic
ISBN
0-671-72287-5
ISBN
0-671-72287-5
Fullname
William Shakespeare
Firstname Last name
William Shakespeare
split(fullname)
concatenate(f,l)
9. Our approach
• Two-phase approach:
• Estimated Mutual Information
• Suggest 1:1 and 1:n mappings
• Serve as a filtering step (filter out data properties that have no mutual information)
• Reduce search space for the next phase (speed up the process)
• Genetic Programming
• Automatic process for creating mapping functions
• Reduces the cost of traversing the search space
Besnik Fetahu 9DEXA 2013 – Prague, Czech Republic
10. Estimated Mutual Information (EMI)
• EMI Matrix
• p=(p1,…,pu), q=(q1,…,qv) two lists of sets (i.e. sets of data type properties)
Besnik Fetahu 10DEXA 2013 – Prague, Czech Republic
Cosine Similarity
Jaccard Index
…..
11. Estimated Mutual Information (EMI)
• Computing the mutual information:
• Cosine Similarity
• Simple matches: William Shakespeare → William Shakespeare
• Jaccard Similarity
• Simple and Complex matches: William → William Shakespeare
Besnik Fetahu 11DEXA 2013 – Prague, Czech Republic
12. Genetic Programming (GP)
• Genetic programming refers to an automated method to create and evolve
programs to solve a problem.
• A solution is represented by a tree, whose nodes are labeled with functions
(concatenate, split, sum) or with values (strings, numbers, etc).
• New individuals are generated by applying genetic operations to the current
population of individuals.
• Selects individuals that should breed by an evolutionary process.
Besnik Fetahu 12DEXA 2013 – Prague, Czech Republic
13. Genetic Programming (GP)
• GP Functions:
• Crossover
• The act of swapping gene values between two potential solutions,
simulating the "mating" of the two solutions.
• Mutation
• The act of randomly altering the value of a gene in a potential solution.
• Reproduction
• The act of making a copy of a potential solution
Besnik Fetahu 13DEXA 2013 – Prague, Czech Republic
14. Genetic Programming (GP)
• Fitness function
• Levenshtein similarity function for string values
• KL-divergence measure for numeric values
• Different measures are applied since data properties values can have
multiple common values (such as 0) and it can lead to a wrong match. Thus,
we use measure the probability of two sets being the same with KL.
Besnik Fetahu 14DEXA 2013 – Prague, Czech Republic
15. An Example of Implementation
Besnik Fetahu 15DEXA 2013 – Prague, Czech Republic
16. An Example of Implementation
Phase 1 – Co-occurrence matrix
1. Difference between Cosine/Jaccard similarity metrics.
Besnik Fetahu 16DEXA 2013 – Prague, Czech Republic
17. An Example of Implementation
Phase 1 – EMI matrix
2. Possible matchings:
Besnik Fetahu 17DEXA 2013 – Prague, Czech Republic
18. An Example of Implementation
Besnik Fetahu 18DEXA 2013 – Prague, Czech Republic
19. An Example of Implementation
Besnik Fetahu 19DEXA 2013 – Prague, Czech Republic
Complement
+
NumberAddress
+
Number
Crossover
NeighborhoodComplementNumber
Address
+
+ mutation
Complement
+
Number
reproduction
20. An Example of Implementation
Besnik Fetahu 20DEXA 2013 – Prague, Czech Republic
Correct
Repetitive and
Incorrect mutation
21. Evaluation
• Datasets
• “Personal Information” dataset lists information about people
• “Real Estate” dataset lists information about houses for sale
• “Inventory” dataset describes product inventories
With exception of the “Personal Information” dataset due to privacy reasons, other datasets are available at:
http://pages.cs.wisc.edu/ anhai/wisc-si-archive/domains/
Besnik Fetahu 21DEXA 2013 – Prague, Czech Republic