Complex Matching of RDF Datatype Properties

Complex Matching of RDF Datatype Properties
Bernardo Pereira Nunes1,2
, Alexander Mera1
, Marco Antonio Casanova1
, Besnik Fetahu2
, Luiz André P. Paes Leme3
, Stefan Dietze2
1) Department of Informatics-PUC-Rio, 2) L3S Research Center, Leibniz University Hannover,
3) Computer Science Institute, Fluminese Federal University
DEXA 2013 – Prague, Czech Republic

Outline
• Introduction
• Motivation
• Related Work
• Schema Matching Principles
• Our approach:
• Phase 1) Estimated Mutual Information – EMI
• Phase 2) Genetic Programing - GP
• Evaluation
• Results
• Conclusions
Besnik Fetahu 2DEXA 2013 – Prague, Czech Republic

Introduction
• Data Integration
• Combine different data sources into an unified view of data
• Originally fomented by large organizations:
• Merge companies databases due to acquisitions
• Currently, driven by new Web trends such as:
• Improvement of Web-based search
• Proliferation of Web applications
• e-business
• Examples: momondo.de, semantic search, price watchers sites, etc.

Introduction
• Challenges
• Heterogeneous data
• Different data formats
• Data quality (data impurities, corrupted information)
• Scalability
• Adaptability
• Costly

Introduction
• Initiatives to address data integration problems
• Linked Data Principles
• Ontology Alignment Initiatives (OAI)
• Schema Matching tools

Motivation
• Given two schemas S and T a matching from S to T is characterized if an
element e from S is mapped to an element e’ from T by some expression
that relates both elements.
?
?
?

Related Work
• Methods
• RiMOM, iMAP, S-Match, DSSim, ATOM, etc.
• Schema-based approach
• Instance-based approach
• Hybrid approach
• Cardinality
• 1:1
• 1:n
• n:m
Rahm, E. and Bernstein, P. 2001. A survey of approaches to automatic schema matching. The VLDB Journal 10,
4 (Dec. 2001), 334-350.

Cardinality
• Simple match
• 1:1 – direct matching
• Complex match
• 1:1 / n:1 (mapping functions)
ISBN
0-671-72287-5
ISBN
0-671-72287-5
Fullname
William Shakespeare
Firstname Last name
William Shakespeare
split(fullname)
concatenate(f,l)

Our approach
• Two-phase approach:
• Estimated Mutual Information
• Suggest 1:1 and 1:n mappings
• Serve as a filtering step (filter out data properties that have no mutual information)
• Reduce search space for the next phase (speed up the process)
• Genetic Programming
• Automatic process for creating mapping functions
• Reduces the cost of traversing the search space

Estimated Mutual Information (EMI)
• EMI Matrix
• p=(p1,…,pu), q=(q1,…,qv) two lists of sets (i.e. sets of data type properties)
Cosine Similarity
Jaccard Index
…..

Estimated Mutual Information (EMI)
• Computing the mutual information:
• Cosine Similarity
• Simple matches: William Shakespeare → William Shakespeare
• Jaccard Similarity
• Simple and Complex matches: William → William Shakespeare

Genetic Programming (GP)
• Genetic programming refers to an automated method to create and evolve
programs to solve a problem.
• A solution is represented by a tree, whose nodes are labeled with functions
(concatenate, split, sum) or with values (strings, numbers, etc).
• New individuals are generated by applying genetic operations to the current
population of individuals.
• Selects individuals that should breed by an evolutionary process.

• GP Functions:
• Crossover
• The act of swapping gene values between two potential solutions,
simulating the "mating" of the two solutions.
• Mutation
• The act of randomly altering the value of a gene in a potential solution.
• Reproduction
• The act of making a copy of a potential solution

• Fitness function
• Levenshtein similarity function for string values
• KL-divergence measure for numeric values
• Different measures are applied since data properties values can have
multiple common values (such as 0) and it can lead to a wrong match. Thus,
we use measure the probability of two sets being the same with KL.

An Example of Implementation

Phase 1 – Co-occurrence matrix
1. Difference between Cosine/Jaccard similarity metrics.

Phase 1 – EMI matrix
2. Possible matchings:

Complement
+
NumberAddress
+
Number
Crossover
NeighborhoodComplementNumber
Address
+
+ mutation
Complement
+
Number
reproduction

Correct
Repetitive and
Incorrect mutation

Evaluation
• Datasets
• “Personal Information” dataset lists information about people
• “Real Estate” dataset lists information about houses for sale
• “Inventory” dataset describes product inventories
With exception of the “Personal Information” dataset due to privacy reasons, other datasets are available at:
http://pages.cs.wisc.edu/ anhai/wisc-si-archive/domains/

Results

27/08/13Ricardo Kawase 24
Conclusions
• Complex schema matching approach
• Simple + Complex matching:
• Estimated Mutual Information + Genetic Programing
• Reduced search space for matching properties
• Adaptive to variations of 1:1 and n:1 matching instances
• High accuracy on generated matches and coverage

Questions?
Thank you!

Complex Matching of RDF Datatype Properties

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (9)

Similar to Complex Matching of RDF Datatype Properties

Similar to Complex Matching of RDF Datatype Properties (20)

Recently uploaded

Recently uploaded (20)

Complex Matching of RDF Datatype Properties

Editor's Notes