SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
05/26/14 Heiko Paulheim 1
Identifying Wrong Links between Datasets
by Multi-dimensional Outlier Detection
Heiko Paulheim
05/26/14 Heiko Paulheim 2
Motivation
• Dataset interlinks can be wrong for many reasons
– Oversimplified heuristic generation (e.g., label equality)
– owl:sameAs abuse (a Starbucks coffee shop ↔ Starbucks Inc.)
– Concept drift of link targets
• e.g., dbpedia:Prong used to denote a band until DBpedia 3.1
• now it's a disambiguation page
04/08/0812/04/07
<http://dbtune.org/bbc/peel/artist/1495> owl:sameAs <http://dbpedia.org/resource/Prong> .
05/26/14 Heiko Paulheim 3
Overall Idea
• Links between datasets follow certain patterns
– e.g., linking a mo:MusicArtist to a dbo:Artist,
and a mo:MusicalWork to a dbo:Album or a dbo:Song
• Wrong links violate those patterns
• Hence, outlier detection should find wrong links
– Definition: “finding patterns in data that do not conform to the expected
normal behavior” (Chandola et al., 2009)
• Difference over related approaches
– does not require the same schema used in both datasets
– nor schema mappings
– no external/human knowledge required
05/26/14 Heiko Paulheim 4
Projection of Links into Vector Space
• Represent each link as a point in an n-dimensional vector space
– e.g., using their direct types
• Outliers are found in sparse areas
05/26/14 Heiko Paulheim 5
Projection of Links into Vector Space
• Types
– each type of LHS and RHS resource becomes a binary (0/1) feature
– types on both sides are treated separately
• i.e., LHS_foaf:person and RHS_foaf:person
are distinct features
• Properties
– each ingoing/outgoing property of LHS and RHS resource
becomes a binary (0/1) feature
– properties on both sides are treated separately
– ingoing and outgoing properties are treated separately
• i.e., LHS_foaf:based_near, RHS_foaf:based_near,
foaf:based_near_LHS and foaf:based_near_RHS
are all distinct features
• Joint feature set of types and properties
05/26/14 Heiko Paulheim 6
Experiments
• Datasets: link sets between
– BBC Peel Sessions and DBpedia (2,087 links)
– DBTropes and DBpedia (4,229 links)
• Gold standard
– 100 randomly sampled links from each set, manually evaluated
– Peel: 90 out of 100 are correct
– Tropes: 76 out of 100 are correct
• We run outlier detection on the whole link set
– and validate the output only on the gold standard
05/26/14 Heiko Paulheim 8
Experiments
• Outlier Detection Approaches
– assign a score (or label) to each data point
– the higher the score, the likelier it is an outlier
• Evaluation
– Ordering descending by outlier score
– Ideally, all outliers are above all non-outliers
– Plot a ROC curve to measure the quality
• i.e., AUC
– F-Measure
• with best possible threshold
05/26/14 Heiko Paulheim 10
Results
• Type features work better than property features
• LoOP delivers reliably good results
– though not the best
• Best performance on Peel dataset
– CBLOF (F1 = 0.537), 1-class SVM (AUC = 0.857)
• Best performance on DBTropes dataset
– LOF (F1 = 0.5, AUC = 0.619)
05/26/14 Heiko Paulheim 11
Results
• ROC curves for Peel dataset
0 1
0
1
GAS k=10
GAS k=25
GAS k=50
LOF
LoOP k=10
LoOP k=25
LoOP k=50
CBLOF
LDCOF
1-class SVM
Note: GAS k=10,25,50 identical, LoOP k=25,50 identical
05/26/14 Heiko Paulheim 12
Results
• ROC curves DBTropes dataset
0 1
0
1
GAS k=10
GAS k=25
GAS k=50
LOF
LoOP k=10
LoOP k=25
LoOP k=50
CBLOF
LDCOF
1-class SVM
Note: GAS k=25,50 mostly identical; LoOP k=25,50 identical,
CBLOF and LDCOF mostly identical
05/26/14 Heiko Paulheim 13
Runtimes
• Most outlier detection algorithms are reasonably fast
– both linksets processed in less than 10 seconds on a normal laptop
• Exceptions:
– clustering (for CBLOF/LDCOF) takes up to 30 seconds
– 1-class SVM takes up to 15 minutes
• ...but creating the feature vector representation
takes much more time
– some hours against public SPARQL endpoint(s)
– reasonably fast with downloaded dumps
05/26/14 Heiko Paulheim 14
Discussion of Results
• Results on Peel dataset better than on DBTropes dataset
• Projection based on types better than on properties
• most likely due to lower dimensionality of vector space
• Peel: #types = 34, #properties = 60
• DBTropes: #types = 81, #properties = 142
• Variation of outlier detection algorithms across datasets
– also observed in other experiments
– general rules of thumb are hard to come up with
05/26/14 Heiko Paulheim 15
Possible Improvements & Future Work
• Other projection methods
– e.g., using numeric counts of relations
• Other outlier detection algorithms
– e.g., Replicating Neural Networks and their generalizations
• Preprocessing
– e.g., Feature Subset Selection
– caveat: the valuable features are often sparse
05/26/14 Heiko Paulheim 16
Possible Improvements & Future Work
• So far, we have looked at owl:sameAs links
• The approach is not limited to that
– should work for other link predicates as well
– e.g., a dataset of persons and a dataset of places
– linked by foaf:based_near
• It is not even limited to linksets
– also for debugging statements inside a knowledge base
– e.g., dbpedia-owl:deathPlace
05/26/14 Heiko Paulheim 17
Identifying Wrong Links between Datasets
by Multi-dimensional Outlier Detection
Heiko Paulheim

Weitere ähnliche Inhalte

Ähnlich wie Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashingJ Singh
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)SoundSoftware ac.uk
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionHeiko Paulheim
 
Querying Cultural Heritage
Querying Cultural HeritageQuerying Cultural Heritage
Querying Cultural HeritageBarry Norton
 
Shared Print in the Orbis Cascade Alliance and Colorado Alliance (Levine-Clark)
Shared Print in the Orbis Cascade Alliance and Colorado Alliance (Levine-Clark)Shared Print in the Orbis Cascade Alliance and Colorado Alliance (Levine-Clark)
Shared Print in the Orbis Cascade Alliance and Colorado Alliance (Levine-Clark)Charleston Conference
 
The Road to Lambda - Mike Duigou
The Road to Lambda - Mike DuigouThe Road to Lambda - Mike Duigou
The Road to Lambda - Mike Duigoujaxconf
 
How is research conducted in my field
How is research conducted in my fieldHow is research conducted in my field
How is research conducted in my fieldCristian Klein
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)Daniele Dell'Aglio
 
Morphosyntactic analysis for stylometry
Morphosyntactic analysis for stylometryMorphosyntactic analysis for stylometry
Morphosyntactic analysis for stylometrySilvie Cinková
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
"The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P..."The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P...lccausp
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsAltuna Akalin
 
Benchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesBenchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesToine Bogers
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataHeiko Paulheim
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscoverygwprice
 

Ähnlich wie Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection (20)

OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashing
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)
 
Big data elasticsearch practical
Big data  elasticsearch practicalBig data  elasticsearch practical
Big data elasticsearch practical
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly Detection
 
Querying Cultural Heritage
Querying Cultural HeritageQuerying Cultural Heritage
Querying Cultural Heritage
 
Shared Print in the Orbis Cascade Alliance and Colorado Alliance (Levine-Clark)
Shared Print in the Orbis Cascade Alliance and Colorado Alliance (Levine-Clark)Shared Print in the Orbis Cascade Alliance and Colorado Alliance (Levine-Clark)
Shared Print in the Orbis Cascade Alliance and Colorado Alliance (Levine-Clark)
 
The Road to Lambda - Mike Duigou
The Road to Lambda - Mike DuigouThe Road to Lambda - Mike Duigou
The Road to Lambda - Mike Duigou
 
How is research conducted in my field
How is research conducted in my fieldHow is research conducted in my field
How is research conducted in my field
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)
 
Morphosyntactic analysis for stylometry
Morphosyntactic analysis for stylometryMorphosyntactic analysis for stylometry
Morphosyntactic analysis for stylometry
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
"The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P..."The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P...
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomics
 
Benchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesBenchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program Committees
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
 

Mehr von Heiko Paulheim

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsHeiko Paulheim
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsHeiko Paulheim
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Heiko Paulheim
 
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph BlockHeiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Heiko Paulheim
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge GraphsHeiko Paulheim
 
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphFrom Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphHeiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Heiko Paulheim
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Heiko Paulheim
 
Machine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsMachine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsHeiko Paulheim
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterHeiko Paulheim
 
Towards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingTowards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingHeiko Paulheim
 
Knowledge Graphs on the Web
Knowledge Graphs on the WebKnowledge Graphs on the Web
Knowledge Graphs on the WebHeiko Paulheim
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge DiscoveryHeiko Paulheim
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerHeiko Paulheim
 

Mehr von Heiko Paulheim (20)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge Graphs
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
 
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphFrom Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
How much is a Triple?
How much is a Triple?How much is a Triple?
How much is a Triple?
 
Machine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsMachine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge Graphs
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Towards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingTowards Knowledge Graph Profiling
Towards Knowledge Graph Profiling
 
Knowledge Graphs on the Web
Knowledge Graphs on the WebKnowledge Graphs on the Web
Knowledge Graphs on the Web
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
 

Kürzlich hochgeladen

怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 

Kürzlich hochgeladen (20)

怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 

Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

  • 1. 05/26/14 Heiko Paulheim 1 Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection Heiko Paulheim
  • 2. 05/26/14 Heiko Paulheim 2 Motivation • Dataset interlinks can be wrong for many reasons – Oversimplified heuristic generation (e.g., label equality) – owl:sameAs abuse (a Starbucks coffee shop ↔ Starbucks Inc.) – Concept drift of link targets • e.g., dbpedia:Prong used to denote a band until DBpedia 3.1 • now it's a disambiguation page 04/08/0812/04/07 <http://dbtune.org/bbc/peel/artist/1495> owl:sameAs <http://dbpedia.org/resource/Prong> .
  • 3. 05/26/14 Heiko Paulheim 3 Overall Idea • Links between datasets follow certain patterns – e.g., linking a mo:MusicArtist to a dbo:Artist, and a mo:MusicalWork to a dbo:Album or a dbo:Song • Wrong links violate those patterns • Hence, outlier detection should find wrong links – Definition: “finding patterns in data that do not conform to the expected normal behavior” (Chandola et al., 2009) • Difference over related approaches – does not require the same schema used in both datasets – nor schema mappings – no external/human knowledge required
  • 4. 05/26/14 Heiko Paulheim 4 Projection of Links into Vector Space • Represent each link as a point in an n-dimensional vector space – e.g., using their direct types • Outliers are found in sparse areas
  • 5. 05/26/14 Heiko Paulheim 5 Projection of Links into Vector Space • Types – each type of LHS and RHS resource becomes a binary (0/1) feature – types on both sides are treated separately • i.e., LHS_foaf:person and RHS_foaf:person are distinct features • Properties – each ingoing/outgoing property of LHS and RHS resource becomes a binary (0/1) feature – properties on both sides are treated separately – ingoing and outgoing properties are treated separately • i.e., LHS_foaf:based_near, RHS_foaf:based_near, foaf:based_near_LHS and foaf:based_near_RHS are all distinct features • Joint feature set of types and properties
  • 6. 05/26/14 Heiko Paulheim 6 Experiments • Datasets: link sets between – BBC Peel Sessions and DBpedia (2,087 links) – DBTropes and DBpedia (4,229 links) • Gold standard – 100 randomly sampled links from each set, manually evaluated – Peel: 90 out of 100 are correct – Tropes: 76 out of 100 are correct • We run outlier detection on the whole link set – and validate the output only on the gold standard
  • 7.
  • 8. 05/26/14 Heiko Paulheim 8 Experiments • Outlier Detection Approaches – assign a score (or label) to each data point – the higher the score, the likelier it is an outlier • Evaluation – Ordering descending by outlier score – Ideally, all outliers are above all non-outliers – Plot a ROC curve to measure the quality • i.e., AUC – F-Measure • with best possible threshold
  • 9.
  • 10. 05/26/14 Heiko Paulheim 10 Results • Type features work better than property features • LoOP delivers reliably good results – though not the best • Best performance on Peel dataset – CBLOF (F1 = 0.537), 1-class SVM (AUC = 0.857) • Best performance on DBTropes dataset – LOF (F1 = 0.5, AUC = 0.619)
  • 11. 05/26/14 Heiko Paulheim 11 Results • ROC curves for Peel dataset 0 1 0 1 GAS k=10 GAS k=25 GAS k=50 LOF LoOP k=10 LoOP k=25 LoOP k=50 CBLOF LDCOF 1-class SVM Note: GAS k=10,25,50 identical, LoOP k=25,50 identical
  • 12. 05/26/14 Heiko Paulheim 12 Results • ROC curves DBTropes dataset 0 1 0 1 GAS k=10 GAS k=25 GAS k=50 LOF LoOP k=10 LoOP k=25 LoOP k=50 CBLOF LDCOF 1-class SVM Note: GAS k=25,50 mostly identical; LoOP k=25,50 identical, CBLOF and LDCOF mostly identical
  • 13. 05/26/14 Heiko Paulheim 13 Runtimes • Most outlier detection algorithms are reasonably fast – both linksets processed in less than 10 seconds on a normal laptop • Exceptions: – clustering (for CBLOF/LDCOF) takes up to 30 seconds – 1-class SVM takes up to 15 minutes • ...but creating the feature vector representation takes much more time – some hours against public SPARQL endpoint(s) – reasonably fast with downloaded dumps
  • 14. 05/26/14 Heiko Paulheim 14 Discussion of Results • Results on Peel dataset better than on DBTropes dataset • Projection based on types better than on properties • most likely due to lower dimensionality of vector space • Peel: #types = 34, #properties = 60 • DBTropes: #types = 81, #properties = 142 • Variation of outlier detection algorithms across datasets – also observed in other experiments – general rules of thumb are hard to come up with
  • 15. 05/26/14 Heiko Paulheim 15 Possible Improvements & Future Work • Other projection methods – e.g., using numeric counts of relations • Other outlier detection algorithms – e.g., Replicating Neural Networks and their generalizations • Preprocessing – e.g., Feature Subset Selection – caveat: the valuable features are often sparse
  • 16. 05/26/14 Heiko Paulheim 16 Possible Improvements & Future Work • So far, we have looked at owl:sameAs links • The approach is not limited to that – should work for other link predicates as well – e.g., a dataset of persons and a dataset of places – linked by foaf:based_near • It is not even limited to linksets – also for debugging statements inside a knowledge base – e.g., dbpedia-owl:deathPlace
  • 17. 05/26/14 Heiko Paulheim 17 Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection Heiko Paulheim