SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Type Inference on Noisy RDF Data

10/31/13 Paulheim, Christian Bizer
Heiko Paulheim, Christian Bizer
Heiko

1
The Problem
•

One promise of the Semantic Web:
– You can issue structured queries
– e.g., „List all presidents that graduated from Harvard Law School“
– SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

10/31/13

Heiko Paulheim, Christian Bizer

2
The Problem
•

SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

•

...if we run this against DBpedia, we get one result
– i.e., Elwell Stephen Otis

•

But...

10/31/13

Heiko Paulheim, Christian Bizer

3
The Problem

10/31/13

Heiko Paulheim, Christian Bizer

4
The Problem
•

So what is going wrong?

•

SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

•

In DBpedia, Barack Obama is not of type President!

•

How can we add missing types?

10/31/13

Heiko Paulheim, Christian Bizer

5
Is It a Big Problem?
•

DBpedia has at least 2.7 million missing type statements
– w.r.t. the DBpedia ontology
– found using co-occurence analysis of matching classes
in YAGO and DBpedia
– a very optimistic lower bound

•

Highly incomplete classes:
– Species: >870,000 missing statements
– Person: >510,000 missing statements
– Event: >150,000 missing statements

10/31/13

Heiko Paulheim, Christian Bizer

6
A Naive Approach
•

Idea: exploit properties with domain and range

•

Pseudo RDFS Reasoning:
– CONSTRUCT {?x a ?t}
WHERE { {?x ?r ?y . ?r rdfs:domain ?t}
UNION
{?y ?r ?x . ?r rdfs:range ?t} }

10/31/13

Heiko Paulheim, Christian Bizer

7
A Naive Approach
•

Experiment with Barack Obama
– Person, PersonFunction, Actor, Organization

•

Experiment with Germany:
– Place, Award, Populated Place, City, SportsTeam, Mountain, Agent,
Organisation, Country, Stadium, RecordLabel, MilitaryUnit, Company,
EducationalInstitution, PersonFunction, EthnicGroup, Architect, WineRegion,
Language, MilitaryConflict, Settlement, RouteOfTransportation

10/31/13

Heiko Paulheim, Christian Bizer

8
A Naive Approach
•

What is going on here?
– DBpedia data is noisy
– One wrong statement is enough for a wrong conclusion
– e.g.: dbpedia:Kurt_H._Debus dbpedia-owl:award dbpedia:Germany

•

Germany example: 69,000 statements
– 20 wrong types can come from 20 wrong statements
– i.e., an error rate of 0.03% is enough for a totally screwed result
– ...but that would be an excellent data quality for a LOD source!

10/31/13

Heiko Paulheim, Christian Bizer

9
SDType Approach
•

Idea: outgoing/incoming properties are indicators
for a resource's type
– e.g.: starring → Movie
– e.g.: author-1 → Writer

•

Basic compiled statistics
– P(C|p) := probability of class C in presence of property p
– e.g.: P(dbpedia:Film|starring) = 0.79
– e.g.: P(dbpedia:Writer|author-1) = 0.44

10/31/13

Heiko Paulheim, Christian Bizer

10
SDType Approach
•

Based on precompiled statistics
– Find types of instances
– Using voting

•

score(C) = avg(all properties p) P(C|p)

•

Refinement:
– Weight for properties: discriminative power
– weight(p) = sum(all classes c) (p(c)-p(c|p))²
– i.e., how strongly this property's class distribution
deviates from the overall class distribution

10/31/13

Heiko Paulheim, Christian Bizer

11
Evaluation
•

Two fold evaluation
– On DBpedia and OpenCyc as „Silver Standard“
(automatic, 10,000 random instances)
– On untyped DBpedia resources (manual, 100 instances)

•

Using only incoming properties
– Using outgoing properties is trivial!

10/31/13

Heiko Paulheim, Christian Bizer

12
Evaluation Results
•

On DBpedia

1
0.9
0.8

Precision

0.7
0.6
min. 1 link
min. 10 links
min. 25 links

0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

10/31/13

Heiko Paulheim, Christian Bizer

13
Evaluation Results
•

On OpenCyc

1
0.9
0.8

Precision

0.7
0.6
min. 1 link
min. 10 links
min. 25 links

0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

10/31/13

Heiko Paulheim, Christian Bizer

14
Evaluation Results
•

Evaluation on untyped resources
– Random sample of 100 untyped resources
– Manual checking of precision

1

12

0.9
10

0.8
0.7
Precision

0.6
0.5

6

0.4
4

0.3
0.2

# found types

8
# found
types
precision

2

0.1
0

0
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Lower bound for threshold

10/31/13

Heiko Paulheim, Christian Bizer

15
Evaluation Results
•

DBpedia:
– works reasonably well (F-measure 0.89)

•

OpenCyc:
– harder because of deeper class hierarchy (F-measure 0.60)

•

General:
– having more links increases precision
(in contrast to RDFS reasoning)
– more general types (e.g., Band) are easier than specific ones
(e.g., PunkRockBand)

10/31/13

Heiko Paulheim, Christian Bizer

16
Deployment
•

Heuristic types have been included in DBpedia 3.9
– for previously untyped instances
– 3.4 million type statements at precision ~0.95

•

Includes also many resources without a Wikipedia page
– i.e., generated from a red link

•

Runtime
– Complexity O(PT)
P: number of property assertions
T: number of type assertions
– ~24h for processing DBpedia

10/31/13

Heiko Paulheim, Christian Bizer

17
Conclusion and Outlook
•

SDType approach works at high quality
– outperforms most state of the art on DBpedia
– deployed for DBpedia 3.9

•

Same approach can be used for
– validating links
– within dataset: deployed for DBpedia 3.9 (removed ~13,000 wrong statements)
– across datasets: to be done

10/31/13

Heiko Paulheim, Christian Bizer

18
Type Inference on Noisy RDF Data

10/31/13 Paulheim, Christian Bizer
Heiko Paulheim, Christian Bizer
Heiko

19

Weitere ähnliche Inhalte

Was ist angesagt?

Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsHeiko Paulheim
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsHeiko Paulheim
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge GraphsHeiko Paulheim
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Heiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Heiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Heiko Paulheim
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopHeiko Paulheim
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge DiscoveryHeiko Paulheim
 
Challenges of Building Web Observatories
Challenges of Building Web ObservatoriesChallenges of Building Web Observatories
Challenges of Building Web ObservatoriesSteffen Staab
 
20150415 keynote open DIET 2015
20150415 keynote open DIET 201520150415 keynote open DIET 2015
20150415 keynote open DIET 2015fpilotti
 

Was ist angesagt? (11)

Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge Graphs
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 
Challenges of Building Web Observatories
Challenges of Building Web ObservatoriesChallenges of Building Web Observatories
Challenges of Building Web Observatories
 
20150415 keynote open DIET 2015
20150415 keynote open DIET 201520150415 keynote open DIET 2015
20150415 keynote open DIET 2015
 

Ähnlich wie Type Inference on Noisy RDF Data

Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceLeon Derczynski
 
Radically Open at the National Archives
Radically Open at the National ArchivesRadically Open at the National Archives
Radically Open at the National ArchivesJon Voss
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsJon Voss
 
DS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spacesDS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spacesPetar Ristoski
 
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11William Ulate
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysisLuke Czarnecki
 
Exploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningExploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningHeiko Paulheim
 
Digital Project Clinic
Digital Project ClinicDigital Project Clinic
Digital Project ClinicWiLS
 
Downsizing Your Depository: Dealing with Mandates from Your Administration
Downsizing Your Depository: Dealing with Mandates from Your AdministrationDownsizing Your Depository: Dealing with Mandates from Your Administration
Downsizing Your Depository: Dealing with Mandates from Your AdministrationChristopher Brown
 
Magnetic - Query Categorization at Scale
Magnetic - Query Categorization at ScaleMagnetic - Query Categorization at Scale
Magnetic - Query Categorization at ScaleAlex Dorman
 
Civil War Data 150 at DLF Fall Forum 2011
Civil War Data 150 at DLF Fall Forum 2011Civil War Data 150 at DLF Fall Forum 2011
Civil War Data 150 at DLF Fall Forum 2011Jon Voss
 
Linked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsLinked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsJon Voss
 
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...Heiko Paulheim
 
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data FusionLearning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data FusionVolha Bryl
 
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...Open Knowledge Maps
 
Eng102 stevenson fall15_m_bdraft2
Eng102 stevenson fall15_m_bdraft2Eng102 stevenson fall15_m_bdraft2
Eng102 stevenson fall15_m_bdraft2SCC Library
 
Intro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsIntro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsJon Voss
 

Ähnlich wie Type Inference on Noisy RDF Data (20)

Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
Radically Open at the National Archives
Radically Open at the National ArchivesRadically Open at the National Archives
Radically Open at the National Archives
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
DS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spacesDS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spaces
 
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysis
 
Exploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningExploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data Mining
 
Digital Project Clinic
Digital Project ClinicDigital Project Clinic
Digital Project Clinic
 
Downsizing Your Depository: Dealing with Mandates from Your Administration
Downsizing Your Depository: Dealing with Mandates from Your AdministrationDownsizing Your Depository: Dealing with Mandates from Your Administration
Downsizing Your Depository: Dealing with Mandates from Your Administration
 
Magnetic - Query Categorization at Scale
Magnetic - Query Categorization at ScaleMagnetic - Query Categorization at Scale
Magnetic - Query Categorization at Scale
 
Civil War Data 150 at DLF Fall Forum 2011
Civil War Data 150 at DLF Fall Forum 2011Civil War Data 150 at DLF Fall Forum 2011
Civil War Data 150 at DLF Fall Forum 2011
 
Linked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsLinked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & Museums
 
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
 
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data FusionLearning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
 
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
 
Eng102 stevenson fall15_m_bdraft2
Eng102 stevenson fall15_m_bdraft2Eng102 stevenson fall15_m_bdraft2
Eng102 stevenson fall15_m_bdraft2
 
Intro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsIntro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & Museums
 

Mehr von Heiko Paulheim

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfHeiko Paulheim
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterHeiko Paulheim
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionHeiko Paulheim
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataHeiko Paulheim
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerHeiko Paulheim
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Heiko Paulheim
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaHeiko Paulheim
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionHeiko Paulheim
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesHeiko Paulheim
 

Mehr von Heiko Paulheim (12)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly Detection
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpedia
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List Pages
 

Kürzlich hochgeladen

Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 

Kürzlich hochgeladen (20)

Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 

Type Inference on Noisy RDF Data

  • 1. Type Inference on Noisy RDF Data 10/31/13 Paulheim, Christian Bizer Heiko Paulheim, Christian Bizer Heiko 1
  • 2. The Problem • One promise of the Semantic Web: – You can issue structured queries – e.g., „List all presidents that graduated from Harvard Law School“ – SELECT ?x WHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } 10/31/13 Heiko Paulheim, Christian Bizer 2
  • 3. The Problem • SELECT ?x WHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } • ...if we run this against DBpedia, we get one result – i.e., Elwell Stephen Otis • But... 10/31/13 Heiko Paulheim, Christian Bizer 3
  • 5. The Problem • So what is going wrong? • SELECT ?x WHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } • In DBpedia, Barack Obama is not of type President! • How can we add missing types? 10/31/13 Heiko Paulheim, Christian Bizer 5
  • 6. Is It a Big Problem? • DBpedia has at least 2.7 million missing type statements – w.r.t. the DBpedia ontology – found using co-occurence analysis of matching classes in YAGO and DBpedia – a very optimistic lower bound • Highly incomplete classes: – Species: >870,000 missing statements – Person: >510,000 missing statements – Event: >150,000 missing statements 10/31/13 Heiko Paulheim, Christian Bizer 6
  • 7. A Naive Approach • Idea: exploit properties with domain and range • Pseudo RDFS Reasoning: – CONSTRUCT {?x a ?t} WHERE { {?x ?r ?y . ?r rdfs:domain ?t} UNION {?y ?r ?x . ?r rdfs:range ?t} } 10/31/13 Heiko Paulheim, Christian Bizer 7
  • 8. A Naive Approach • Experiment with Barack Obama – Person, PersonFunction, Actor, Organization • Experiment with Germany: – Place, Award, Populated Place, City, SportsTeam, Mountain, Agent, Organisation, Country, Stadium, RecordLabel, MilitaryUnit, Company, EducationalInstitution, PersonFunction, EthnicGroup, Architect, WineRegion, Language, MilitaryConflict, Settlement, RouteOfTransportation 10/31/13 Heiko Paulheim, Christian Bizer 8
  • 9. A Naive Approach • What is going on here? – DBpedia data is noisy – One wrong statement is enough for a wrong conclusion – e.g.: dbpedia:Kurt_H._Debus dbpedia-owl:award dbpedia:Germany • Germany example: 69,000 statements – 20 wrong types can come from 20 wrong statements – i.e., an error rate of 0.03% is enough for a totally screwed result – ...but that would be an excellent data quality for a LOD source! 10/31/13 Heiko Paulheim, Christian Bizer 9
  • 10. SDType Approach • Idea: outgoing/incoming properties are indicators for a resource's type – e.g.: starring → Movie – e.g.: author-1 → Writer • Basic compiled statistics – P(C|p) := probability of class C in presence of property p – e.g.: P(dbpedia:Film|starring) = 0.79 – e.g.: P(dbpedia:Writer|author-1) = 0.44 10/31/13 Heiko Paulheim, Christian Bizer 10
  • 11. SDType Approach • Based on precompiled statistics – Find types of instances – Using voting • score(C) = avg(all properties p) P(C|p) • Refinement: – Weight for properties: discriminative power – weight(p) = sum(all classes c) (p(c)-p(c|p))² – i.e., how strongly this property's class distribution deviates from the overall class distribution 10/31/13 Heiko Paulheim, Christian Bizer 11
  • 12. Evaluation • Two fold evaluation – On DBpedia and OpenCyc as „Silver Standard“ (automatic, 10,000 random instances) – On untyped DBpedia resources (manual, 100 instances) • Using only incoming properties – Using outgoing properties is trivial! 10/31/13 Heiko Paulheim, Christian Bizer 12
  • 13. Evaluation Results • On DBpedia 1 0.9 0.8 Precision 0.7 0.6 min. 1 link min. 10 links min. 25 links 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 10/31/13 Heiko Paulheim, Christian Bizer 13
  • 14. Evaluation Results • On OpenCyc 1 0.9 0.8 Precision 0.7 0.6 min. 1 link min. 10 links min. 25 links 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 10/31/13 Heiko Paulheim, Christian Bizer 14
  • 15. Evaluation Results • Evaluation on untyped resources – Random sample of 100 untyped resources – Manual checking of precision 1 12 0.9 10 0.8 0.7 Precision 0.6 0.5 6 0.4 4 0.3 0.2 # found types 8 # found types precision 2 0.1 0 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Lower bound for threshold 10/31/13 Heiko Paulheim, Christian Bizer 15
  • 16. Evaluation Results • DBpedia: – works reasonably well (F-measure 0.89) • OpenCyc: – harder because of deeper class hierarchy (F-measure 0.60) • General: – having more links increases precision (in contrast to RDFS reasoning) – more general types (e.g., Band) are easier than specific ones (e.g., PunkRockBand) 10/31/13 Heiko Paulheim, Christian Bizer 16
  • 17. Deployment • Heuristic types have been included in DBpedia 3.9 – for previously untyped instances – 3.4 million type statements at precision ~0.95 • Includes also many resources without a Wikipedia page – i.e., generated from a red link • Runtime – Complexity O(PT) P: number of property assertions T: number of type assertions – ~24h for processing DBpedia 10/31/13 Heiko Paulheim, Christian Bizer 17
  • 18. Conclusion and Outlook • SDType approach works at high quality – outperforms most state of the art on DBpedia – deployed for DBpedia 3.9 • Same approach can be used for – validating links – within dataset: deployed for DBpedia 3.9 (removed ~13,000 wrong statements) – across datasets: to be done 10/31/13 Heiko Paulheim, Christian Bizer 18
  • 19. Type Inference on Noisy RDF Data 10/31/13 Paulheim, Christian Bizer Heiko Paulheim, Christian Bizer Heiko 19