SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Type Inference on Noisy RDF Data

10/31/13 Paulheim, Christian Bizer
Heiko Paulheim, Christian Bizer
Heiko

1
The Problem
•

One promise of the Semantic Web:
– You can issue structured queries
– e.g., „List all presidents that graduated from Harvard Law School“
– SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

10/31/13

Heiko Paulheim, Christian Bizer

2
The Problem
•

SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

•

...if we run this against DBpedia, we get one result
– i.e., Elwell Stephen Otis

•

But...

10/31/13

Heiko Paulheim, Christian Bizer

3
The Problem

10/31/13

Heiko Paulheim, Christian Bizer

4
The Problem
•

So what is going wrong?

•

SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

•

In DBpedia, Barack Obama is not of type President!

•

How can we add missing types?

10/31/13

Heiko Paulheim, Christian Bizer

5
Is It a Big Problem?
•

DBpedia has at least 2.7 million missing type statements
– w.r.t. the DBpedia ontology
– found using co-occurence analysis of matching classes
in YAGO and DBpedia
– a very optimistic lower bound

•

Highly incomplete classes:
– Species: >870,000 missing statements
– Person: >510,000 missing statements
– Event: >150,000 missing statements

10/31/13

Heiko Paulheim, Christian Bizer

6
A Naive Approach
•

Idea: exploit properties with domain and range

•

Pseudo RDFS Reasoning:
– CONSTRUCT {?x a ?t}
WHERE { {?x ?r ?y . ?r rdfs:domain ?t}
UNION
{?y ?r ?x . ?r rdfs:range ?t} }

10/31/13

Heiko Paulheim, Christian Bizer

7
A Naive Approach
•

Experiment with Barack Obama
– Person, PersonFunction, Actor, Organization

•

Experiment with Germany:
– Place, Award, Populated Place, City, SportsTeam, Mountain, Agent,
Organisation, Country, Stadium, RecordLabel, MilitaryUnit, Company,
EducationalInstitution, PersonFunction, EthnicGroup, Architect, WineRegion,
Language, MilitaryConflict, Settlement, RouteOfTransportation

10/31/13

Heiko Paulheim, Christian Bizer

8
A Naive Approach
•

What is going on here?
– DBpedia data is noisy
– One wrong statement is enough for a wrong conclusion
– e.g.: dbpedia:Kurt_H._Debus dbpedia-owl:award dbpedia:Germany

•

Germany example: 69,000 statements
– 20 wrong types can come from 20 wrong statements
– i.e., an error rate of 0.03% is enough for a totally screwed result
– ...but that would be an excellent data quality for a LOD source!

10/31/13

Heiko Paulheim, Christian Bizer

9
SDType Approach
•

Idea: outgoing/incoming properties are indicators
for a resource's type
– e.g.: starring → Movie
– e.g.: author-1 → Writer

•

Basic compiled statistics
– P(C|p) := probability of class C in presence of property p
– e.g.: P(dbpedia:Film|starring) = 0.79
– e.g.: P(dbpedia:Writer|author-1) = 0.44

10/31/13

Heiko Paulheim, Christian Bizer

10
SDType Approach
•

Based on precompiled statistics
– Find types of instances
– Using voting

•

score(C) = avg(all properties p) P(C|p)

•

Refinement:
– Weight for properties: discriminative power
– weight(p) = sum(all classes c) (p(c)-p(c|p))²
– i.e., how strongly this property's class distribution
deviates from the overall class distribution

10/31/13

Heiko Paulheim, Christian Bizer

11
Evaluation
•

Two fold evaluation
– On DBpedia and OpenCyc as „Silver Standard“
(automatic, 10,000 random instances)
– On untyped DBpedia resources (manual, 100 instances)

•

Using only incoming properties
– Using outgoing properties is trivial!

10/31/13

Heiko Paulheim, Christian Bizer

12
Evaluation Results
•

On DBpedia

1
0.9
0.8

Precision

0.7
0.6
min. 1 link
min. 10 links
min. 25 links

0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

10/31/13

Heiko Paulheim, Christian Bizer

13
Evaluation Results
•

On OpenCyc

1
0.9
0.8

Precision

0.7
0.6
min. 1 link
min. 10 links
min. 25 links

0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

10/31/13

Heiko Paulheim, Christian Bizer

14
Evaluation Results
•

Evaluation on untyped resources
– Random sample of 100 untyped resources
– Manual checking of precision

1

12

0.9
10

0.8
0.7
Precision

0.6
0.5

6

0.4
4

0.3
0.2

# found types

8
# found
types
precision

2

0.1
0

0
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Lower bound for threshold

10/31/13

Heiko Paulheim, Christian Bizer

15
Evaluation Results
•

DBpedia:
– works reasonably well (F-measure 0.89)

•

OpenCyc:
– harder because of deeper class hierarchy (F-measure 0.60)

•

General:
– having more links increases precision
(in contrast to RDFS reasoning)
– more general types (e.g., Band) are easier than specific ones
(e.g., PunkRockBand)

10/31/13

Heiko Paulheim, Christian Bizer

16
Deployment
•

Heuristic types have been included in DBpedia 3.9
– for previously untyped instances
– 3.4 million type statements at precision ~0.95

•

Includes also many resources without a Wikipedia page
– i.e., generated from a red link

•

Runtime
– Complexity O(PT)
P: number of property assertions
T: number of type assertions
– ~24h for processing DBpedia

10/31/13

Heiko Paulheim, Christian Bizer

17
Conclusion and Outlook
•

SDType approach works at high quality
– outperforms most state of the art on DBpedia
– deployed for DBpedia 3.9

•

Same approach can be used for
– validating links
– within dataset: deployed for DBpedia 3.9 (removed ~13,000 wrong statements)
– across datasets: to be done

10/31/13

Heiko Paulheim, Christian Bizer

18
Type Inference on Noisy RDF Data

10/31/13 Paulheim, Christian Bizer
Heiko Paulheim, Christian Bizer
Heiko

19

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (11)

Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge Graphs
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 
Challenges of Building Web Observatories
Challenges of Building Web ObservatoriesChallenges of Building Web Observatories
Challenges of Building Web Observatories
 
20150415 keynote open DIET 2015
20150415 keynote open DIET 201520150415 keynote open DIET 2015
20150415 keynote open DIET 2015
 

Ähnlich wie Type Inference on Noisy RDF Data

Magnetic - Query Categorization at Scale
Magnetic - Query Categorization at ScaleMagnetic - Query Categorization at Scale
Magnetic - Query Categorization at Scale
Alex Dorman
 
Civil War Data 150 at DLF Fall Forum 2011
Civil War Data 150 at DLF Fall Forum 2011Civil War Data 150 at DLF Fall Forum 2011
Civil War Data 150 at DLF Fall Forum 2011
Jon Voss
 

Ähnlich wie Type Inference on Noisy RDF Data (20)

Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
Radically Open at the National Archives
Radically Open at the National ArchivesRadically Open at the National Archives
Radically Open at the National Archives
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
DS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spacesDS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spaces
 
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysis
 
Exploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningExploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data Mining
 
Digital Project Clinic
Digital Project ClinicDigital Project Clinic
Digital Project Clinic
 
Downsizing Your Depository: Dealing with Mandates from Your Administration
Downsizing Your Depository: Dealing with Mandates from Your AdministrationDownsizing Your Depository: Dealing with Mandates from Your Administration
Downsizing Your Depository: Dealing with Mandates from Your Administration
 
Magnetic - Query Categorization at Scale
Magnetic - Query Categorization at ScaleMagnetic - Query Categorization at Scale
Magnetic - Query Categorization at Scale
 
Civil War Data 150 at DLF Fall Forum 2011
Civil War Data 150 at DLF Fall Forum 2011Civil War Data 150 at DLF Fall Forum 2011
Civil War Data 150 at DLF Fall Forum 2011
 
Linked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsLinked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & Museums
 
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
 
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data FusionLearning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
 
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
 
Eng102 stevenson fall15_m_bdraft2
Eng102 stevenson fall15_m_bdraft2Eng102 stevenson fall15_m_bdraft2
Eng102 stevenson fall15_m_bdraft2
 
Intro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsIntro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & Museums
 

Mehr von Heiko Paulheim

Mehr von Heiko Paulheim (12)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly Detection
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpedia
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List Pages
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Type Inference on Noisy RDF Data

  • 1. Type Inference on Noisy RDF Data 10/31/13 Paulheim, Christian Bizer Heiko Paulheim, Christian Bizer Heiko 1
  • 2. The Problem • One promise of the Semantic Web: – You can issue structured queries – e.g., „List all presidents that graduated from Harvard Law School“ – SELECT ?x WHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } 10/31/13 Heiko Paulheim, Christian Bizer 2
  • 3. The Problem • SELECT ?x WHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } • ...if we run this against DBpedia, we get one result – i.e., Elwell Stephen Otis • But... 10/31/13 Heiko Paulheim, Christian Bizer 3
  • 5. The Problem • So what is going wrong? • SELECT ?x WHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } • In DBpedia, Barack Obama is not of type President! • How can we add missing types? 10/31/13 Heiko Paulheim, Christian Bizer 5
  • 6. Is It a Big Problem? • DBpedia has at least 2.7 million missing type statements – w.r.t. the DBpedia ontology – found using co-occurence analysis of matching classes in YAGO and DBpedia – a very optimistic lower bound • Highly incomplete classes: – Species: >870,000 missing statements – Person: >510,000 missing statements – Event: >150,000 missing statements 10/31/13 Heiko Paulheim, Christian Bizer 6
  • 7. A Naive Approach • Idea: exploit properties with domain and range • Pseudo RDFS Reasoning: – CONSTRUCT {?x a ?t} WHERE { {?x ?r ?y . ?r rdfs:domain ?t} UNION {?y ?r ?x . ?r rdfs:range ?t} } 10/31/13 Heiko Paulheim, Christian Bizer 7
  • 8. A Naive Approach • Experiment with Barack Obama – Person, PersonFunction, Actor, Organization • Experiment with Germany: – Place, Award, Populated Place, City, SportsTeam, Mountain, Agent, Organisation, Country, Stadium, RecordLabel, MilitaryUnit, Company, EducationalInstitution, PersonFunction, EthnicGroup, Architect, WineRegion, Language, MilitaryConflict, Settlement, RouteOfTransportation 10/31/13 Heiko Paulheim, Christian Bizer 8
  • 9. A Naive Approach • What is going on here? – DBpedia data is noisy – One wrong statement is enough for a wrong conclusion – e.g.: dbpedia:Kurt_H._Debus dbpedia-owl:award dbpedia:Germany • Germany example: 69,000 statements – 20 wrong types can come from 20 wrong statements – i.e., an error rate of 0.03% is enough for a totally screwed result – ...but that would be an excellent data quality for a LOD source! 10/31/13 Heiko Paulheim, Christian Bizer 9
  • 10. SDType Approach • Idea: outgoing/incoming properties are indicators for a resource's type – e.g.: starring → Movie – e.g.: author-1 → Writer • Basic compiled statistics – P(C|p) := probability of class C in presence of property p – e.g.: P(dbpedia:Film|starring) = 0.79 – e.g.: P(dbpedia:Writer|author-1) = 0.44 10/31/13 Heiko Paulheim, Christian Bizer 10
  • 11. SDType Approach • Based on precompiled statistics – Find types of instances – Using voting • score(C) = avg(all properties p) P(C|p) • Refinement: – Weight for properties: discriminative power – weight(p) = sum(all classes c) (p(c)-p(c|p))² – i.e., how strongly this property's class distribution deviates from the overall class distribution 10/31/13 Heiko Paulheim, Christian Bizer 11
  • 12. Evaluation • Two fold evaluation – On DBpedia and OpenCyc as „Silver Standard“ (automatic, 10,000 random instances) – On untyped DBpedia resources (manual, 100 instances) • Using only incoming properties – Using outgoing properties is trivial! 10/31/13 Heiko Paulheim, Christian Bizer 12
  • 13. Evaluation Results • On DBpedia 1 0.9 0.8 Precision 0.7 0.6 min. 1 link min. 10 links min. 25 links 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 10/31/13 Heiko Paulheim, Christian Bizer 13
  • 14. Evaluation Results • On OpenCyc 1 0.9 0.8 Precision 0.7 0.6 min. 1 link min. 10 links min. 25 links 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 10/31/13 Heiko Paulheim, Christian Bizer 14
  • 15. Evaluation Results • Evaluation on untyped resources – Random sample of 100 untyped resources – Manual checking of precision 1 12 0.9 10 0.8 0.7 Precision 0.6 0.5 6 0.4 4 0.3 0.2 # found types 8 # found types precision 2 0.1 0 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Lower bound for threshold 10/31/13 Heiko Paulheim, Christian Bizer 15
  • 16. Evaluation Results • DBpedia: – works reasonably well (F-measure 0.89) • OpenCyc: – harder because of deeper class hierarchy (F-measure 0.60) • General: – having more links increases precision (in contrast to RDFS reasoning) – more general types (e.g., Band) are easier than specific ones (e.g., PunkRockBand) 10/31/13 Heiko Paulheim, Christian Bizer 16
  • 17. Deployment • Heuristic types have been included in DBpedia 3.9 – for previously untyped instances – 3.4 million type statements at precision ~0.95 • Includes also many resources without a Wikipedia page – i.e., generated from a red link • Runtime – Complexity O(PT) P: number of property assertions T: number of type assertions – ~24h for processing DBpedia 10/31/13 Heiko Paulheim, Christian Bizer 17
  • 18. Conclusion and Outlook • SDType approach works at high quality – outperforms most state of the art on DBpedia – deployed for DBpedia 3.9 • Same approach can be used for – validating links – within dataset: deployed for DBpedia 3.9 (removed ~13,000 wrong statements) – across datasets: to be done 10/31/13 Heiko Paulheim, Christian Bizer 18
  • 19. Type Inference on Noisy RDF Data 10/31/13 Paulheim, Christian Bizer Heiko Paulheim, Christian Bizer Heiko 19