SlideShare a Scribd company logo
1 of 25
Complex Matching of RDF Datatype Properties
Bernardo Pereira Nunes1,2
, Alexander Mera1
, Marco Antonio Casanova1
, Besnik Fetahu2
, Luiz André P. Paes Leme3
, Stefan Dietze2
1) Department of Informatics-PUC-Rio, 2) L3S Research Center, Leibniz University Hannover,
3) Computer Science Institute, Fluminese Federal University
DEXA 2013 – Prague, Czech Republic
Outline
• Introduction
• Motivation
• Related Work
• Schema Matching Principles
• Our approach:
• Phase 1) Estimated Mutual Information – EMI
• Phase 2) Genetic Programing - GP
• Evaluation
• Results
• Conclusions
Besnik Fetahu 2DEXA 2013 – Prague, Czech Republic
Introduction
• Data Integration
• Combine different data sources into an unified view of data
• Originally fomented by large organizations:
• Merge companies databases due to acquisitions
• Currently, driven by new Web trends such as:
• Improvement of Web-based search
• Proliferation of Web applications
• e-business
• Examples: momondo.de, semantic search, price watchers sites, etc.
Besnik Fetahu 3DEXA 2013 – Prague, Czech Republic
Introduction
• Challenges
• Heterogeneous data
• Different data formats
• Data quality (data impurities, corrupted information)
• Scalability
• Adaptability
• Costly
Besnik Fetahu 4DEXA 2013 – Prague, Czech Republic
Introduction
• Initiatives to address data integration problems
• Linked Data Principles
• Ontology Alignment Initiatives (OAI)
• Schema Matching tools
Besnik Fetahu 5DEXA 2013 – Prague, Czech Republic
Motivation
• Given two schemas S and T a matching from S to T is characterized if an
element e from S is mapped to an element e’ from T by some expression
that relates both elements.
Besnik Fetahu 6DEXA 2013 – Prague, Czech Republic
?
?
?
Related Work
• Methods
• RiMOM, iMAP, S-Match, DSSim, ATOM, etc.
• Schema-based approach
• Instance-based approach
• Hybrid approach
• Cardinality
• 1:1
• 1:n
• n:m
Besnik Fetahu 7DEXA 2013 – Prague, Czech Republic
Rahm, E. and Bernstein, P. 2001. A survey of approaches to automatic schema matching. The VLDB Journal 10,
4 (Dec. 2001), 334-350.
Cardinality
• Simple match
• 1:1 – direct matching
• Complex match
• 1:1 / n:1 (mapping functions)
Besnik Fetahu 8DEXA 2013 – Prague, Czech Republic
ISBN
0-671-72287-5
ISBN
0-671-72287-5
Fullname
William Shakespeare
Firstname Last name
William Shakespeare
split(fullname)
concatenate(f,l)
Our approach
• Two-phase approach:
• Estimated Mutual Information
• Suggest 1:1 and 1:n mappings
• Serve as a filtering step (filter out data properties that have no mutual information)
• Reduce search space for the next phase (speed up the process)
• Genetic Programming
• Automatic process for creating mapping functions
• Reduces the cost of traversing the search space
Besnik Fetahu 9DEXA 2013 – Prague, Czech Republic
Estimated Mutual Information (EMI)
• EMI Matrix
• p=(p1,…,pu), q=(q1,…,qv) two lists of sets (i.e. sets of data type properties)
Besnik Fetahu 10DEXA 2013 – Prague, Czech Republic
Cosine Similarity
Jaccard Index
…..
Estimated Mutual Information (EMI)
• Computing the mutual information:
• Cosine Similarity
• Simple matches: William Shakespeare → William Shakespeare
• Jaccard Similarity
• Simple and Complex matches: William → William Shakespeare
Besnik Fetahu 11DEXA 2013 – Prague, Czech Republic
Genetic Programming (GP)
• Genetic programming refers to an automated method to create and evolve
programs to solve a problem.
• A solution is represented by a tree, whose nodes are labeled with functions
(concatenate, split, sum) or with values (strings, numbers, etc).
• New individuals are generated by applying genetic operations to the current
population of individuals.
• Selects individuals that should breed by an evolutionary process.
Besnik Fetahu 12DEXA 2013 – Prague, Czech Republic
Genetic Programming (GP)
• GP Functions:
• Crossover
• The act of swapping gene values between two potential solutions,
simulating the "mating" of the two solutions.
• Mutation
• The act of randomly altering the value of a gene in a potential solution.
• Reproduction
• The act of making a copy of a potential solution
Besnik Fetahu 13DEXA 2013 – Prague, Czech Republic
Genetic Programming (GP)
• Fitness function
• Levenshtein similarity function for string values
• KL-divergence measure for numeric values
• Different measures are applied since data properties values can have
multiple common values (such as 0) and it can lead to a wrong match. Thus,
we use measure the probability of two sets being the same with KL.
Besnik Fetahu 14DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Besnik Fetahu 15DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Phase 1 – Co-occurrence matrix
1. Difference between Cosine/Jaccard similarity metrics.
Besnik Fetahu 16DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Phase 1 – EMI matrix
2. Possible matchings:
Besnik Fetahu 17DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Besnik Fetahu 18DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Besnik Fetahu 19DEXA 2013 – Prague, Czech Republic
Complement
+
NumberAddress
+
Number
Crossover
NeighborhoodComplementNumber
Address
+
+ mutation
Complement
+
Number
reproduction
An Example of Implementation
Besnik Fetahu 20DEXA 2013 – Prague, Czech Republic
Correct
Repetitive and
Incorrect mutation
Evaluation
• Datasets
• “Personal Information” dataset lists information about people
• “Real Estate” dataset lists information about houses for sale
• “Inventory” dataset describes product inventories
With exception of the “Personal Information” dataset due to privacy reasons, other datasets are available at:
http://pages.cs.wisc.edu/ anhai/wisc-si-archive/domains/
Besnik Fetahu 21DEXA 2013 – Prague, Czech Republic
Results
Besnik Fetahu 22DEXA 2013 – Prague, Czech Republic
Results
Besnik Fetahu 23DEXA 2013 – Prague, Czech Republic
27/08/13Ricardo Kawase 24
Conclusions
• Complex schema matching approach
• Simple + Complex matching:
• Estimated Mutual Information + Genetic Programing
• Reduced search space for matching properties
• Adaptive to variations of 1:1 and n:1 matching instances
• High accuracy on generated matches and coverage
Questions?
Thank you!
Besnik Fetahu 25DEXA 2013 – Prague, Czech Republic

More Related Content

What's hot

Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Lviv Data Science Summer School
 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
caise2013vlc
 
eNanoMapper database, search tools and templates
eNanoMapper database, search tools and templateseNanoMapper database, search tools and templates
eNanoMapper database, search tools and templates
Nina Jeliazkova
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering engines
unyil96
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
Vsevolod Dyomkin
 

What's hot (16)

Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
 
Rs web context_content__v4.0__20120908_ma
Rs web context_content__v4.0__20120908_maRs web context_content__v4.0__20120908_ma
Rs web context_content__v4.0__20120908_ma
 
IR tutorial
IR tutorialIR tutorial
IR tutorial
 
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
 
eNanoMapper database, search tools and templates
eNanoMapper database, search tools and templateseNanoMapper database, search tools and templates
eNanoMapper database, search tools and templates
 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
 
Mapping Domain Names to Categories
Mapping Domain Names to CategoriesMapping Domain Names to Categories
Mapping Domain Names to Categories
 
A hierarchical approach for semi structured document indexing and
A hierarchical approach for semi structured document indexing andA hierarchical approach for semi structured document indexing and
A hierarchical approach for semi structured document indexing and
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering engines
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий..."Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...
 
Visualising Data on Interactive Maps
Visualising Data on Interactive MapsVisualising Data on Interactive Maps
Visualising Data on Interactive Maps
 

Viewers also liked

Improving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured DataImproving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured Data
Besnik Fetahu
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linking
Besnik Fetahu
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
Besnik Fetahu
 

Viewers also liked (9)

Automated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity PagesAutomated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity Pages
 
Towards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphTowards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data Graph
 
Improving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured DataImproving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured Data
 
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
 
How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linking
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
 
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesA Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
 
Finding News Citations For Wikipedia
Finding News Citations For WikipediaFinding News Citations For Wikipedia
Finding News Citations For Wikipedia
 

Similar to Complex Matching of RDF Datatype Properties

Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
Maribel Acosta Deibe
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Editor IJMTER
 
Searching in metric spaces
Searching in metric spacesSearching in metric spaces
Searching in metric spaces
unyil96
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dass
Diego Pessoa
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 

Similar to Complex Matching of RDF Datatype Properties (20)

Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1
 
Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profiles
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
 
Introduction to Topological Data Analysis
Introduction to Topological Data AnalysisIntroduction to Topological Data Analysis
Introduction to Topological Data Analysis
 
Big Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingBig Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic Processing
 
Searching in metric spaces
Searching in metric spacesSearching in metric spaces
Searching in metric spaces
 
Boosting probabilistic graphical model inference by incorporating prior knowl...
Boosting probabilistic graphical model inference by incorporating prior knowl...Boosting probabilistic graphical model inference by incorporating prior knowl...
Boosting probabilistic graphical model inference by incorporating prior knowl...
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
 
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
 
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databasesIEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
 
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
 
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHCOLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dass
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Recently uploaded (20)

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 

Complex Matching of RDF Datatype Properties

  • 1. Complex Matching of RDF Datatype Properties Bernardo Pereira Nunes1,2 , Alexander Mera1 , Marco Antonio Casanova1 , Besnik Fetahu2 , Luiz André P. Paes Leme3 , Stefan Dietze2 1) Department of Informatics-PUC-Rio, 2) L3S Research Center, Leibniz University Hannover, 3) Computer Science Institute, Fluminese Federal University DEXA 2013 – Prague, Czech Republic
  • 2. Outline • Introduction • Motivation • Related Work • Schema Matching Principles • Our approach: • Phase 1) Estimated Mutual Information – EMI • Phase 2) Genetic Programing - GP • Evaluation • Results • Conclusions Besnik Fetahu 2DEXA 2013 – Prague, Czech Republic
  • 3. Introduction • Data Integration • Combine different data sources into an unified view of data • Originally fomented by large organizations: • Merge companies databases due to acquisitions • Currently, driven by new Web trends such as: • Improvement of Web-based search • Proliferation of Web applications • e-business • Examples: momondo.de, semantic search, price watchers sites, etc. Besnik Fetahu 3DEXA 2013 – Prague, Czech Republic
  • 4. Introduction • Challenges • Heterogeneous data • Different data formats • Data quality (data impurities, corrupted information) • Scalability • Adaptability • Costly Besnik Fetahu 4DEXA 2013 – Prague, Czech Republic
  • 5. Introduction • Initiatives to address data integration problems • Linked Data Principles • Ontology Alignment Initiatives (OAI) • Schema Matching tools Besnik Fetahu 5DEXA 2013 – Prague, Czech Republic
  • 6. Motivation • Given two schemas S and T a matching from S to T is characterized if an element e from S is mapped to an element e’ from T by some expression that relates both elements. Besnik Fetahu 6DEXA 2013 – Prague, Czech Republic ? ? ?
  • 7. Related Work • Methods • RiMOM, iMAP, S-Match, DSSim, ATOM, etc. • Schema-based approach • Instance-based approach • Hybrid approach • Cardinality • 1:1 • 1:n • n:m Besnik Fetahu 7DEXA 2013 – Prague, Czech Republic Rahm, E. and Bernstein, P. 2001. A survey of approaches to automatic schema matching. The VLDB Journal 10, 4 (Dec. 2001), 334-350.
  • 8. Cardinality • Simple match • 1:1 – direct matching • Complex match • 1:1 / n:1 (mapping functions) Besnik Fetahu 8DEXA 2013 – Prague, Czech Republic ISBN 0-671-72287-5 ISBN 0-671-72287-5 Fullname William Shakespeare Firstname Last name William Shakespeare split(fullname) concatenate(f,l)
  • 9. Our approach • Two-phase approach: • Estimated Mutual Information • Suggest 1:1 and 1:n mappings • Serve as a filtering step (filter out data properties that have no mutual information) • Reduce search space for the next phase (speed up the process) • Genetic Programming • Automatic process for creating mapping functions • Reduces the cost of traversing the search space Besnik Fetahu 9DEXA 2013 – Prague, Czech Republic
  • 10. Estimated Mutual Information (EMI) • EMI Matrix • p=(p1,…,pu), q=(q1,…,qv) two lists of sets (i.e. sets of data type properties) Besnik Fetahu 10DEXA 2013 – Prague, Czech Republic Cosine Similarity Jaccard Index …..
  • 11. Estimated Mutual Information (EMI) • Computing the mutual information: • Cosine Similarity • Simple matches: William Shakespeare → William Shakespeare • Jaccard Similarity • Simple and Complex matches: William → William Shakespeare Besnik Fetahu 11DEXA 2013 – Prague, Czech Republic
  • 12. Genetic Programming (GP) • Genetic programming refers to an automated method to create and evolve programs to solve a problem. • A solution is represented by a tree, whose nodes are labeled with functions (concatenate, split, sum) or with values (strings, numbers, etc). • New individuals are generated by applying genetic operations to the current population of individuals. • Selects individuals that should breed by an evolutionary process. Besnik Fetahu 12DEXA 2013 – Prague, Czech Republic
  • 13. Genetic Programming (GP) • GP Functions: • Crossover • The act of swapping gene values between two potential solutions, simulating the "mating" of the two solutions. • Mutation • The act of randomly altering the value of a gene in a potential solution. • Reproduction • The act of making a copy of a potential solution Besnik Fetahu 13DEXA 2013 – Prague, Czech Republic
  • 14. Genetic Programming (GP) • Fitness function • Levenshtein similarity function for string values • KL-divergence measure for numeric values • Different measures are applied since data properties values can have multiple common values (such as 0) and it can lead to a wrong match. Thus, we use measure the probability of two sets being the same with KL. Besnik Fetahu 14DEXA 2013 – Prague, Czech Republic
  • 15. An Example of Implementation Besnik Fetahu 15DEXA 2013 – Prague, Czech Republic
  • 16. An Example of Implementation Phase 1 – Co-occurrence matrix 1. Difference between Cosine/Jaccard similarity metrics. Besnik Fetahu 16DEXA 2013 – Prague, Czech Republic
  • 17. An Example of Implementation Phase 1 – EMI matrix 2. Possible matchings: Besnik Fetahu 17DEXA 2013 – Prague, Czech Republic
  • 18. An Example of Implementation Besnik Fetahu 18DEXA 2013 – Prague, Czech Republic
  • 19. An Example of Implementation Besnik Fetahu 19DEXA 2013 – Prague, Czech Republic Complement + NumberAddress + Number Crossover NeighborhoodComplementNumber Address + + mutation Complement + Number reproduction
  • 20. An Example of Implementation Besnik Fetahu 20DEXA 2013 – Prague, Czech Republic Correct Repetitive and Incorrect mutation
  • 21. Evaluation • Datasets • “Personal Information” dataset lists information about people • “Real Estate” dataset lists information about houses for sale • “Inventory” dataset describes product inventories With exception of the “Personal Information” dataset due to privacy reasons, other datasets are available at: http://pages.cs.wisc.edu/ anhai/wisc-si-archive/domains/ Besnik Fetahu 21DEXA 2013 – Prague, Czech Republic
  • 22. Results Besnik Fetahu 22DEXA 2013 – Prague, Czech Republic
  • 23. Results Besnik Fetahu 23DEXA 2013 – Prague, Czech Republic
  • 24. 27/08/13Ricardo Kawase 24 Conclusions • Complex schema matching approach • Simple + Complex matching: • Estimated Mutual Information + Genetic Programing • Reduced search space for matching properties • Adaptive to variations of 1:1 and n:1 matching instances • High accuracy on generated matches and coverage
  • 25. Questions? Thank you! Besnik Fetahu 25DEXA 2013 – Prague, Czech Republic

Editor's Notes

  1. In case you are not familiar with GP: http://www.geneticprogramming.com/Tutorial/ Or http://jgap.sourceforge.net (we used this package)