Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Type Inference on Noisy RDF Data

10/31/13 Paulheim, Christian Bizer
Heiko Paulheim, Christian Bizer
Heiko

1
The Problem
•

One promise of the Semantic Web:
– You can issue structured queries
– e.g., „List all presidents that gradu...
The Problem
•

SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

•

.....
The Problem

10/31/13

Heiko Paulheim, Christian Bizer

4
The Problem
•

So what is going wrong?

•

SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia...
Is It a Big Problem?
•

DBpedia has at least 2.7 million missing type statements
– w.r.t. the DBpedia ontology
– found usi...
A Naive Approach
•

Idea: exploit properties with domain and range

•

Pseudo RDFS Reasoning:
– CONSTRUCT {?x a ?t}
WHERE ...
A Naive Approach
•

Experiment with Barack Obama
– Person, PersonFunction, Actor, Organization

•

Experiment with Germany...
A Naive Approach
•

What is going on here?
– DBpedia data is noisy
– One wrong statement is enough for a wrong conclusion
...
SDType Approach
•

Idea: outgoing/incoming properties are indicators
for a resource's type
– e.g.: starring → Movie
– e.g....
SDType Approach
•

Based on precompiled statistics
– Find types of instances
– Using voting

•

score(C) = avg(all propert...
Evaluation
•

Two fold evaluation
– On DBpedia and OpenCyc as „Silver Standard“
(automatic, 10,000 random instances)
– On ...
Evaluation Results
•

On DBpedia

1
0.9
0.8

Precision

0.7
0.6
min. 1 link
min. 10 links
min. 25 links

0.5
0.4
0.3
0.2
0...
Evaluation Results
•

On OpenCyc

1
0.9
0.8

Precision

0.7
0.6
min. 1 link
min. 10 links
min. 25 links

0.5
0.4
0.3
0.2
0...
Evaluation Results
•

Evaluation on untyped resources
– Random sample of 100 untyped resources
– Manual checking of precis...
Evaluation Results
•

DBpedia:
– works reasonably well (F-measure 0.89)

•

OpenCyc:
– harder because of deeper class hier...
Deployment
•

Heuristic types have been included in DBpedia 3.9
– for previously untyped instances
– 3.4 million type stat...
Conclusion and Outlook
•

SDType approach works at high quality
– outperforms most state of the art on DBpedia
– deployed ...
Type Inference on Noisy RDF Data

10/31/13 Paulheim, Christian Bizer
Heiko Paulheim, Christian Bizer
Heiko

19
Nächste SlideShare
Wird geladen in …5
×

Type Inference on Noisy RDF Data

920 Aufrufe

Veröffentlicht am

Talk at ISWC 2013

Veröffentlicht in: Technologie, Bildung
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Type Inference on Noisy RDF Data

  1. 1. Type Inference on Noisy RDF Data 10/31/13 Paulheim, Christian Bizer Heiko Paulheim, Christian Bizer Heiko 1
  2. 2. The Problem • One promise of the Semantic Web: – You can issue structured queries – e.g., „List all presidents that graduated from Harvard Law School“ – SELECT ?x WHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } 10/31/13 Heiko Paulheim, Christian Bizer 2
  3. 3. The Problem • SELECT ?x WHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } • ...if we run this against DBpedia, we get one result – i.e., Elwell Stephen Otis • But... 10/31/13 Heiko Paulheim, Christian Bizer 3
  4. 4. The Problem 10/31/13 Heiko Paulheim, Christian Bizer 4
  5. 5. The Problem • So what is going wrong? • SELECT ?x WHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } • In DBpedia, Barack Obama is not of type President! • How can we add missing types? 10/31/13 Heiko Paulheim, Christian Bizer 5
  6. 6. Is It a Big Problem? • DBpedia has at least 2.7 million missing type statements – w.r.t. the DBpedia ontology – found using co-occurence analysis of matching classes in YAGO and DBpedia – a very optimistic lower bound • Highly incomplete classes: – Species: >870,000 missing statements – Person: >510,000 missing statements – Event: >150,000 missing statements 10/31/13 Heiko Paulheim, Christian Bizer 6
  7. 7. A Naive Approach • Idea: exploit properties with domain and range • Pseudo RDFS Reasoning: – CONSTRUCT {?x a ?t} WHERE { {?x ?r ?y . ?r rdfs:domain ?t} UNION {?y ?r ?x . ?r rdfs:range ?t} } 10/31/13 Heiko Paulheim, Christian Bizer 7
  8. 8. A Naive Approach • Experiment with Barack Obama – Person, PersonFunction, Actor, Organization • Experiment with Germany: – Place, Award, Populated Place, City, SportsTeam, Mountain, Agent, Organisation, Country, Stadium, RecordLabel, MilitaryUnit, Company, EducationalInstitution, PersonFunction, EthnicGroup, Architect, WineRegion, Language, MilitaryConflict, Settlement, RouteOfTransportation 10/31/13 Heiko Paulheim, Christian Bizer 8
  9. 9. A Naive Approach • What is going on here? – DBpedia data is noisy – One wrong statement is enough for a wrong conclusion – e.g.: dbpedia:Kurt_H._Debus dbpedia-owl:award dbpedia:Germany • Germany example: 69,000 statements – 20 wrong types can come from 20 wrong statements – i.e., an error rate of 0.03% is enough for a totally screwed result – ...but that would be an excellent data quality for a LOD source! 10/31/13 Heiko Paulheim, Christian Bizer 9
  10. 10. SDType Approach • Idea: outgoing/incoming properties are indicators for a resource's type – e.g.: starring → Movie – e.g.: author-1 → Writer • Basic compiled statistics – P(C|p) := probability of class C in presence of property p – e.g.: P(dbpedia:Film|starring) = 0.79 – e.g.: P(dbpedia:Writer|author-1) = 0.44 10/31/13 Heiko Paulheim, Christian Bizer 10
  11. 11. SDType Approach • Based on precompiled statistics – Find types of instances – Using voting • score(C) = avg(all properties p) P(C|p) • Refinement: – Weight for properties: discriminative power – weight(p) = sum(all classes c) (p(c)-p(c|p))² – i.e., how strongly this property's class distribution deviates from the overall class distribution 10/31/13 Heiko Paulheim, Christian Bizer 11
  12. 12. Evaluation • Two fold evaluation – On DBpedia and OpenCyc as „Silver Standard“ (automatic, 10,000 random instances) – On untyped DBpedia resources (manual, 100 instances) • Using only incoming properties – Using outgoing properties is trivial! 10/31/13 Heiko Paulheim, Christian Bizer 12
  13. 13. Evaluation Results • On DBpedia 1 0.9 0.8 Precision 0.7 0.6 min. 1 link min. 10 links min. 25 links 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 10/31/13 Heiko Paulheim, Christian Bizer 13
  14. 14. Evaluation Results • On OpenCyc 1 0.9 0.8 Precision 0.7 0.6 min. 1 link min. 10 links min. 25 links 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 10/31/13 Heiko Paulheim, Christian Bizer 14
  15. 15. Evaluation Results • Evaluation on untyped resources – Random sample of 100 untyped resources – Manual checking of precision 1 12 0.9 10 0.8 0.7 Precision 0.6 0.5 6 0.4 4 0.3 0.2 # found types 8 # found types precision 2 0.1 0 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Lower bound for threshold 10/31/13 Heiko Paulheim, Christian Bizer 15
  16. 16. Evaluation Results • DBpedia: – works reasonably well (F-measure 0.89) • OpenCyc: – harder because of deeper class hierarchy (F-measure 0.60) • General: – having more links increases precision (in contrast to RDFS reasoning) – more general types (e.g., Band) are easier than specific ones (e.g., PunkRockBand) 10/31/13 Heiko Paulheim, Christian Bizer 16
  17. 17. Deployment • Heuristic types have been included in DBpedia 3.9 – for previously untyped instances – 3.4 million type statements at precision ~0.95 • Includes also many resources without a Wikipedia page – i.e., generated from a red link • Runtime – Complexity O(PT) P: number of property assertions T: number of type assertions – ~24h for processing DBpedia 10/31/13 Heiko Paulheim, Christian Bizer 17
  18. 18. Conclusion and Outlook • SDType approach works at high quality – outperforms most state of the art on DBpedia – deployed for DBpedia 3.9 • Same approach can be used for – validating links – within dataset: deployed for DBpedia 3.9 (removed ~13,000 wrong statements) – across datasets: to be done 10/31/13 Heiko Paulheim, Christian Bizer 18
  19. 19. Type Inference on Noisy RDF Data 10/31/13 Paulheim, Christian Bizer Heiko Paulheim, Christian Bizer Heiko 19

×