SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
12/10/2019 Heiko Paulheim 1
Beyond DBpedia and YAGO
–
The New Kids
on the Knowledge Graph Block
Heiko Paulheim
12/10/2019 Heiko Paulheim 2
The New Kids on the (Knowledge Graph) Block
Subjective age:
Measured by the fraction
of the audience
that understands a reference
to your young days’
pop culture...
12/10/2019 Heiko Paulheim 3
Knowledge Graphs: Out of the Dark
Google’s
Announcement
DBpedia
YAGO
ResearchCyc WikidataFreebase
NELL
12/10/2019 Heiko Paulheim 4
A Brief History of Knowledge Graphs
• The 1980s: Cyc
– Cyc: Encyclopedic collection of knowledge
– Started by Douglas Lenat in 1984
– Estimation: 350 person years and 250,000 rules
should do the job
of collecting the essence of the world’s knowledge
• The present (as of June 2017)
– ~1,000 person years, $120M total development cost
– 21M axioms and rules
12/10/2019 Heiko Paulheim 5
A Brief History of Knowledge Graphs
• The 2010s
– DBpedia: launched 2007
– YAGO: launched 2008
– Extraction from Wikipedia
using mappings & heuristics
• Present
– Two of the most used knowledge graphs
– ...with Wikidata catching up
12/10/2019 Heiko Paulheim 6
Getting the Most out of Wikipedia
• Study for KG-based
Recommender Systems*
– DBpedia has a coverage of
• 85% for movies
• 63% for music artists
• 31% for books
*) Di Noia, et al.: SPRank: Semantic Path-based Ranking for Top-n
Recommendations using Linked Open Data. In: ACM TIST, 2016
https://grouplens.org/datasets/
12/10/2019 Heiko Paulheim 7
Combining the Best of Three Worlds
• DBpedia: detailed instance / relation extraction
• YAGO: detailed classes
• Cyc: rich axiomatization
• Goal: get the best of all those worlds
public
private
Paulheim: Knowledge graph refinement: A survey of approaches and evaluation
methods. Semantic Web 8:3 (2017), pp. 489-508
12/10/2019 Heiko Paulheim 8
Towards CaLiGraph
• YAGO uses categories for types
– e.g., Category:American Industrial Groups
– but does not analyze them further
• :NineInchNails a :AmericanIndustrialGroup
– “Things, not Strings”?
• :NineInchNails a :MusicalGroup ;
hometown :United_States ;
genre :Industrial .
12/10/2019 Heiko Paulheim 9
Cat2Ax: Axiomatizing Wikipedia Categories
Albums
Albums 
by genre 
Albums 
by artist 
Nine Inch
Nails 
albums 
The Doors 
albums 
Rock 
albums 
Pop 
albums 
Reggae 
albums 
The
Beatles
albums 
... ...
...
 dbo:Album
 dbo:artist.{dbr:Nine_Inch_Nails}
 dbo:genre.{dbr:Rock_Music}
12/10/2019 Heiko Paulheim 10
Cat2Ax: Axiomatizing Wikipedia Categories
Albums
Albums 
by genre 
Albums 
by artist 
Nine Inch
Nails 
albums 
The Doors 
albums 
Rock 
albums 
Pop 
albums 
Reggae 
albums 
The
Beatles
albums 
... ...
...
 dbo:genre.{dbr:Rock_Music} ?
 dbo:artist.{dbr:Rock_(Rapper)} ?
12/10/2019 Heiko Paulheim 11
Cat2Ax: Axiomatizing Wikipedia Categories
– Frequency: how often does the pattern occur in a category?
• i.e.: share of instances that have dbo:genre.{dbr.Rock_Music}?
– Lexical score: likelihood of term as a surface form of object
• i.e.: how often is Rock used to refer to dbr:Rock_Music?
– Sibling score: how likely are sibling categories sharing similar patterns?
• i.e., are there sibling categories with a high score for dbo:genre?
Albums
Albums 
by genre 
Albums 
by artist 
Nine Inch
Nails 
albums 
The Doors 
albums 
Rock 
albums 
Pop 
albums 
Reggae 
albums 
The
Beatles
albums 
... ...
...
12/10/2019 Heiko Paulheim 12
Cat2Ax: Axiomatizing Wikipedia Categories
• Results
12/10/2019 Heiko Paulheim 13
Improving Instance Coverage:
Lists in Wikipedia
• Only existing pages have categories
– Lists may also link to non-existing pages
12/10/2019 Heiko Paulheim 14
CaLiGraph Example
Category: Musical Groups established
in 1987
List of symphonic metal bands
Category: Swedish death metal bands
List of Swedes in Music
12/10/2019 Heiko Paulheim 15
CaLiGraph Statistics
• Class hierarchy from categories and list pages
– 750k classes
• Axioms from categories and list pages
– 200k definitions
• Instances from list pages
– 870k new instances
12/10/2019 Heiko Paulheim 16
CaLiGraph Glitches
12/10/2019 Heiko Paulheim 17
The Future of CaLiGraph
• Going beyond red links • Going beyond explicit lists
12/10/2019 Heiko Paulheim 18
Knowledge Graph Creation Beyond Wikipedia
12/10/2019 Heiko Paulheim 19
A Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Wikipedia Dump
(+ mappings)
• Output:
– DBpedia
DBpedia
Extraction
Framework
12/10/2019 Heiko Paulheim 20
An Even Higher Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Media Wiki Dump
(+ mappings)
• Output:
– A Knowledge Graph
DBpedia
Extraction
Framework
12/10/2019 Heiko Paulheim 21
What if…?
• What if we went from Wikipedia every MediaWiki?
• According to WikiApiary, there’s thousands...
12/10/2019 Heiko Paulheim 22
Why?
• More is better (maybe)
12/10/2019 Heiko Paulheim 23
Why?
• Overcoming Wikipedia’s coverage bias
12/10/2019 Heiko Paulheim 24
A Brief History of DBkWik
• Started as a student project in 2017
• Task: run DBpedia EF on a large Wiki Farm
– ...and see what happens
12/10/2019 Heiko Paulheim 25
DBkWik vs. DBpedia
• Challenges
– Getting dumps: only a fraction of Fandom Wikis has dumps
– Downloadable from Fandom: 12,840 dumps
– Tried: auto-requesting dumps
12/10/2019 Heiko Paulheim 26
Obtaining Dumps
• We had to change our strategy: WikiTeam software
– Produces dumps by crawling Wikis
– Fandom has not blocked us so far :-)
– Current collection: 307,466 Wikis
→ will go into DBkWik 1.2 release
12/10/2019 Heiko Paulheim 27
DBkWik vs. DBpedia
• Mappings do not exist
– no central ontology
– i.e., only raw extraction possible
• Duplicates exist
– origin: pages about the same entity
in different Wikis
– unlike Wikipedia: often not explicitly linked
• Different configurations of MediaWiki
12/10/2019 Heiko Paulheim 28
Absence of Mappings and Ontology
• Every infobox becomes a class:
{infobox actor
→ mywiki:actor a owl:Class
• Every infobox key becomes a property
|role = Harry’s mother
→ mywiki:role a rdf:Property
• The resulting ontology is very shallow
– No class hierarchy
– No distinction of object and data properties
– No domains and ranges
12/10/2019 Heiko Paulheim 29
Duplicates
• Collecting Data from a Multitude of Wikis
12/10/2019 Heiko Paulheim 30
Representational Variety
• No conventions across Wikis (besides using MediaWiki syntax)
{{Person
|name = Trent Reznor
|image = TrentReznor.jpg
|caption - Reznor at the [[83rd Academy Awards]]
|nominations = 1
|wins = 1
|role = Composer
|birthdate = May 17, 1965
|birthloc = Mercer, Pennsylvania, USA}}
{{Infobox musician
| Name = Trent Reznor
| Birth_name = Michael Trent Reznor
| Born = May 17, [[1965]] (age 53)
| Origin = [[Mercer]],
[[Pennsylvania]], [[United States]]
...
}}
{{Infobox cast
|Name=Trent Reznor
|Image=
|ImageCaption=
|character=
|crew=
|Born={{d|May|17|1965}}{{-}}New Castle,
Pennsylvania, United States
...
}
12/10/2019 Heiko Paulheim 31
Data Fusion
12/10/2019 Heiko Paulheim 32
Naive Data Fusion and Linking to DBpedia
• String similarity for schema matching (classes/properties)
• doc2vec similarity on original pages for instance matching
• Results
– Classes and properties work OK
– Instances are trickier
– Internal linking seems easier
F1 score... Internal Linking Linking to DBpedia
Classes .979 .898
Properties .836 .865
Instances .879 .657
maybe...
12/10/2019 Heiko Paulheim 33
Improving Linking and Fusion
• Started a new track at OAEI in 2018
– annual benchmark for matching tools
• In 2019, some tools beat the baseline
– albeit by a small margin only
12/10/2019 Heiko Paulheim 34
Results Data Fusion
• Uneven distribution
– e.g., character appears 5k times
• Currently: no multi-linguality
– e.g., Main Page, Hauptseite
• Probably overloaded fusion (false positives)
– e.g., next, location
12/10/2019 Heiko Paulheim 35
Light-weight Schema Induction
• Class hierarchy and domain/range induction
– Using association rule mining
• e.g., Artist(x) → Person(x)
– 5k class subsumption axioms
– 59k domain restrictions
– 114k range restrictions
• Instance typing
– With a light-weight version of SDType
– Using the learned ranges as approximations
of actual distributions
• Result:
~100k new instance types
Person?
Artist
Person
12/10/2019 Heiko Paulheim 36
Big Picture
Dump
Downloader
DBpedia
Extraction
Framework
Interlinking
Instance
Matcher
Schema
Matcher
MediaWiki Dumps
Extracted
RDF
Internal Linking
Instance
Matcher
Schema
Matcher
Consolidated
Knowledge Graph
DBkWik
Linked
Data
Endpoint
Ontology
Knowledge
Graph
Fusion
Instance
Matcher
Domain/
Range
Type
SDType
Light
SubclassMaterialization
12/10/2019 Heiko Paulheim 37
DBkWik 1.1
• Source: ~15k Wiki dumps from Fandom
– 52.4GB of data (roughly the size of the English Wikipedia)
Raw Final
Instances 14,212,535 11,163,719
Typed instances 1,880,189 1,372,971
Triples 107,833,322 91,526,001
Avg. indegree 0.624 0.703
Avg. outdegree 7.506 8.169
Classes 71,580 12,029
Properties 506,487 128,566
12/10/2019 Heiko Paulheim 38
DBkWik 1.1
• Fused graphs from 15k Wikis
http://dbkwik.webdatacommons.org/
12/10/2019 Heiko Paulheim 39
DBkWik 1.1 vs. other Knowledge Graphs
• Caveat:
– Minus non-recognized duplicates!
12/10/2019 Heiko Paulheim 40
DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
• Challenge:
– We only have an incomplete and partly correct mapping M
– But: we know its precision P and recall R
• Trick (see KI paper 2017):
– O is the actual overlap (unknown),
T  M is the true part of M (unknown)
• By definition:
– P = |T| / |M|
→ |T| = P * |M|
– R = |T| / |O|
→ |T| = R * |O|
→ |O| = |M| * P / R
DBkWik DBpedia
12/10/2019 Heiko Paulheim 41
DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
– |O| = |M| * P / R
– Overlap: ~500k instances
• In other words:
– 95% of all entities in DBkWik
are not in DBpedia
– 90% of all entities in DBpedia
are not in DBkWik
DBkWik DBpedia
12/10/2019 Heiko Paulheim 42
NewNif
Extractor
Towards DBkWik 1.2
• Current crawl: 307,466 Wikis
• Extraction: more robust for non-infobox templates
– e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists
• Robust abstract extraction
– using SWEBLE parser
– no local MediaWiki instance
• Better matching
• New gold standard
Source
Simple
WikiParser
LinkExtractor
Page
NifExtractor
AST
Destination
Graph
HTML
12/10/2019 Heiko Paulheim 43
Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles
12/10/2019 Heiko Paulheim 44
Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles
12/10/2019 Heiko Paulheim 45
Further Open Challenges
• More detailed profiling of knowledge graphs
– e.g., do we reduce or increase bias?
• Task-based downstream evaluations
– Does it improve, e.g., recommender systems?
• Fusion policies
– e.g., identify outdated information
12/10/2019 Heiko Paulheim 46
Contributors
• Contributors (past&present)
Sven Hertling Alexandra
Hofmann
Samresh
Perchani
Jan Portisch Nicolas
Heist
12/10/2019 Heiko Paulheim 47
Beyond DBpedia and YAGO
–
The New Kids
on the Knowledge Graph Block
Heiko Paulheim

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge GraphsHeiko Paulheim
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Heiko Paulheim
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
Type Inference on Noisy RDF Data
Type Inference on Noisy RDF DataType Inference on Noisy RDF Data
Type Inference on Noisy RDF DataHeiko Paulheim
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyHeiko Paulheim
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataHeiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Heiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Heiko Paulheim
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopHeiko Paulheim
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge DiscoveryHeiko Paulheim
 
Researcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized WebResearcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized WebHerbert Van de Sompel
 
The web is rotting and what to do about it
The web is rotting and what to do about itThe web is rotting and what to do about it
The web is rotting and what to do about itHerbert Van de Sompel
 
The drawbridge to knowledge - Linking scholarly publications and research inf...
The drawbridge to knowledge - Linking scholarly publications and research inf...The drawbridge to knowledge - Linking scholarly publications and research inf...
The drawbridge to knowledge - Linking scholarly publications and research inf...Lukas Koster
 
Semantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoSemantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoFrank van Harmelen
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationRichard Wallis
 
Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013Lars G. Svensson
 
Linked Data for Abbreviations and Segmentation
Linked Data for Abbreviations and SegmentationLinked Data for Abbreviations and Segmentation
Linked Data for Abbreviations and SegmentationSebastian Hellmann
 
Schema.org where did that come from?
Schema.org where did that come from?Schema.org where did that come from?
Schema.org where did that come from?Richard Wallis
 

Was ist angesagt? (20)

Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Type Inference on Noisy RDF Data
Type Inference on Noisy RDF DataType Inference on Noisy RDF Data
Type Inference on Noisy RDF Data
 
How much is a Triple?
How much is a Triple?How much is a Triple?
How much is a Triple?
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
 
Ld4 dh tutorial
Ld4 dh tutorialLd4 dh tutorial
Ld4 dh tutorial
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 
Researcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized WebResearcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized Web
 
The web is rotting and what to do about it
The web is rotting and what to do about itThe web is rotting and what to do about it
The web is rotting and what to do about it
 
The drawbridge to knowledge - Linking scholarly publications and research inf...
The drawbridge to knowledge - Linking scholarly publications and research inf...The drawbridge to knowledge - Linking scholarly publications and research inf...
The drawbridge to knowledge - Linking scholarly publications and research inf...
 
Semantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoSemantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years ago
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data Foundation
 
Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013
 
Linked Data for Abbreviations and Segmentation
Linked Data for Abbreviations and SegmentationLinked Data for Abbreviations and Segmentation
Linked Data for Abbreviations and Segmentation
 
Schema.org where did that come from?
Schema.org where did that come from?Schema.org where did that come from?
Schema.org where did that come from?
 

Ähnlich wie Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block

Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...
Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...
Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...CILIP MDG
 
Intro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brighton
Intro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brightonIntro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brighton
Intro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brightonJack Brighton
 
Introduction to Omeka
Introduction to OmekaIntroduction to Omeka
Introduction to OmekaShawn Day
 
LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 PrattSILS
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...jessica666
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...jessica666
 
Rich Data? Poor Data? Depends on...
Rich Data? Poor Data? Depends on...Rich Data? Poor Data? Depends on...
Rich Data? Poor Data? Depends on...Lars G. Svensson
 
Creating Narrative with Digital Objects
Creating Narrative with Digital ObjectsCreating Narrative with Digital Objects
Creating Narrative with Digital ObjectsShawn Day
 
Buy Custom Essay Papers
Buy Custom Essay PapersBuy Custom Essay Papers
Buy Custom Essay PapersKrystal Fallin
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked DataRichard Wallis
 
The Continuted Evolution of DAMs in the Nonprofit Sector
The Continuted Evolution of DAMs in the Nonprofit SectorThe Continuted Evolution of DAMs in the Nonprofit Sector
The Continuted Evolution of DAMs in the Nonprofit SectorThe Metropolitan Museum of Art
 
Lessonplan history
Lessonplan historyLessonplan history
Lessonplan historySusan Ferdon
 
Linked Open Data Publications through Wikidata & Persistent Identification...
Linked Open Data  Publications through  Wikidata &  Persistent Identification...Linked Open Data  Publications through  Wikidata &  Persistent Identification...
Linked Open Data Publications through Wikidata & Persistent Identification...PACKED vzw
 
Open Culture - How Wiki loves art and data - Packed
 Open Culture - How Wiki loves art and data - Packed Open Culture - How Wiki loves art and data - Packed
Open Culture - How Wiki loves art and data - PackedOpen Knowledge Belgium
 
Linked Open Data Publications through Wikidata & Persistent Identification in...
Linked Open Data Publications through Wikidata & Persistent Identification in...Linked Open Data Publications through Wikidata & Persistent Identification in...
Linked Open Data Publications through Wikidata & Persistent Identification in...meemoo, Vlaams instituut voor het archief
 
20130527 library linkeddata
20130527 library linkeddata20130527 library linkeddata
20130527 library linkeddataStefan Gradmann
 

Ähnlich wie Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block (20)

Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...
Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...
Capturing the visual in collections metadata / Victoria Webb (Wellcome Collec...
 
Intro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brighton
Intro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brightonIntro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brighton
Intro to PBCore Zen, PBCore workshop, AMIA 2011 - jack brighton
 
Introduction to Omeka
Introduction to OmekaIntroduction to Omeka
Introduction to Omeka
 
Seminario Cristian Lai, 06-09-2012
Seminario Cristian Lai, 06-09-2012Seminario Cristian Lai, 06-09-2012
Seminario Cristian Lai, 06-09-2012
 
LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
 
Rich Data? Poor Data? Depends on...
Rich Data? Poor Data? Depends on...Rich Data? Poor Data? Depends on...
Rich Data? Poor Data? Depends on...
 
Creating Narrative with Digital Objects
Creating Narrative with Digital ObjectsCreating Narrative with Digital Objects
Creating Narrative with Digital Objects
 
Factual proposal
Factual proposalFactual proposal
Factual proposal
 
Buy Custom Essay Papers
Buy Custom Essay PapersBuy Custom Essay Papers
Buy Custom Essay Papers
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
 
The Continuted Evolution of DAMs in the Nonprofit Sector
The Continuted Evolution of DAMs in the Nonprofit SectorThe Continuted Evolution of DAMs in the Nonprofit Sector
The Continuted Evolution of DAMs in the Nonprofit Sector
 
Lessonplan history
Lessonplan historyLessonplan history
Lessonplan history
 
One Big Library
One Big LibraryOne Big Library
One Big Library
 
Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"
 
Linked Open Data Publications through Wikidata & Persistent Identification...
Linked Open Data  Publications through  Wikidata &  Persistent Identification...Linked Open Data  Publications through  Wikidata &  Persistent Identification...
Linked Open Data Publications through Wikidata & Persistent Identification...
 
Open Culture - How Wiki loves art and data - Packed
 Open Culture - How Wiki loves art and data - Packed Open Culture - How Wiki loves art and data - Packed
Open Culture - How Wiki loves art and data - Packed
 
Linked Open Data Publications through Wikidata & Persistent Identification in...
Linked Open Data Publications through Wikidata & Persistent Identification in...Linked Open Data Publications through Wikidata & Persistent Identification in...
Linked Open Data Publications through Wikidata & Persistent Identification in...
 
20130527 library linkeddata
20130527 library linkeddata20130527 library linkeddata
20130527 library linkeddata
 

Mehr von Heiko Paulheim

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfHeiko Paulheim
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterHeiko Paulheim
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionHeiko Paulheim
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerHeiko Paulheim
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Heiko Paulheim
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaHeiko Paulheim
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionHeiko Paulheim
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesHeiko Paulheim
 

Mehr von Heiko Paulheim (11)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly Detection
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpedia
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List Pages
 

Kürzlich hochgeladen

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfRabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j
 
prediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approachprediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approachAdekunleJoseph4
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
testingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdftestingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdfDSP Mutual Fund
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Inference rules in artificial intelligence
Inference rules in artificial intelligenceInference rules in artificial intelligence
Inference rules in artificial intelligencePriyadharshiniG41
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsbaAdobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsbas73678sri
 
Data Discovery With Power Query in excel
Data Discovery With Power Query in excelData Discovery With Power Query in excel
Data Discovery With Power Query in excelKapilSidhpuria3
 

Kürzlich hochgeladen (20)

Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfRabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
 
prediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approachprediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approach
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
testingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdftestingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdf
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Inference rules in artificial intelligence
Inference rules in artificial intelligenceInference rules in artificial intelligence
Inference rules in artificial intelligence
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsbaAdobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
 
Data Discovery With Power Query in excel
Data Discovery With Power Query in excelData Discovery With Power Query in excel
Data Discovery With Power Query in excel
 

Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block

  • 1. 12/10/2019 Heiko Paulheim 1 Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block Heiko Paulheim
  • 2. 12/10/2019 Heiko Paulheim 2 The New Kids on the (Knowledge Graph) Block Subjective age: Measured by the fraction of the audience that understands a reference to your young days’ pop culture...
  • 3. 12/10/2019 Heiko Paulheim 3 Knowledge Graphs: Out of the Dark Google’s Announcement DBpedia YAGO ResearchCyc WikidataFreebase NELL
  • 4. 12/10/2019 Heiko Paulheim 4 A Brief History of Knowledge Graphs • The 1980s: Cyc – Cyc: Encyclopedic collection of knowledge – Started by Douglas Lenat in 1984 – Estimation: 350 person years and 250,000 rules should do the job of collecting the essence of the world’s knowledge • The present (as of June 2017) – ~1,000 person years, $120M total development cost – 21M axioms and rules
  • 5. 12/10/2019 Heiko Paulheim 5 A Brief History of Knowledge Graphs • The 2010s – DBpedia: launched 2007 – YAGO: launched 2008 – Extraction from Wikipedia using mappings & heuristics • Present – Two of the most used knowledge graphs – ...with Wikidata catching up
  • 6. 12/10/2019 Heiko Paulheim 6 Getting the Most out of Wikipedia • Study for KG-based Recommender Systems* – DBpedia has a coverage of • 85% for movies • 63% for music artists • 31% for books *) Di Noia, et al.: SPRank: Semantic Path-based Ranking for Top-n Recommendations using Linked Open Data. In: ACM TIST, 2016 https://grouplens.org/datasets/
  • 7. 12/10/2019 Heiko Paulheim 7 Combining the Best of Three Worlds • DBpedia: detailed instance / relation extraction • YAGO: detailed classes • Cyc: rich axiomatization • Goal: get the best of all those worlds public private Paulheim: Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8:3 (2017), pp. 489-508
  • 8. 12/10/2019 Heiko Paulheim 8 Towards CaLiGraph • YAGO uses categories for types – e.g., Category:American Industrial Groups – but does not analyze them further • :NineInchNails a :AmericanIndustrialGroup – “Things, not Strings”? • :NineInchNails a :MusicalGroup ; hometown :United_States ; genre :Industrial .
  • 9. 12/10/2019 Heiko Paulheim 9 Cat2Ax: Axiomatizing Wikipedia Categories Albums Albums  by genre  Albums  by artist  Nine Inch Nails  albums  The Doors  albums  Rock  albums  Pop  albums  Reggae  albums  The Beatles albums  ... ... ...  dbo:Album  dbo:artist.{dbr:Nine_Inch_Nails}  dbo:genre.{dbr:Rock_Music}
  • 10. 12/10/2019 Heiko Paulheim 10 Cat2Ax: Axiomatizing Wikipedia Categories Albums Albums  by genre  Albums  by artist  Nine Inch Nails  albums  The Doors  albums  Rock  albums  Pop  albums  Reggae  albums  The Beatles albums  ... ... ...  dbo:genre.{dbr:Rock_Music} ?  dbo:artist.{dbr:Rock_(Rapper)} ?
  • 11. 12/10/2019 Heiko Paulheim 11 Cat2Ax: Axiomatizing Wikipedia Categories – Frequency: how often does the pattern occur in a category? • i.e.: share of instances that have dbo:genre.{dbr.Rock_Music}? – Lexical score: likelihood of term as a surface form of object • i.e.: how often is Rock used to refer to dbr:Rock_Music? – Sibling score: how likely are sibling categories sharing similar patterns? • i.e., are there sibling categories with a high score for dbo:genre? Albums Albums  by genre  Albums  by artist  Nine Inch Nails  albums  The Doors  albums  Rock  albums  Pop  albums  Reggae  albums  The Beatles albums  ... ... ...
  • 12. 12/10/2019 Heiko Paulheim 12 Cat2Ax: Axiomatizing Wikipedia Categories • Results
  • 13. 12/10/2019 Heiko Paulheim 13 Improving Instance Coverage: Lists in Wikipedia • Only existing pages have categories – Lists may also link to non-existing pages
  • 14. 12/10/2019 Heiko Paulheim 14 CaLiGraph Example Category: Musical Groups established in 1987 List of symphonic metal bands Category: Swedish death metal bands List of Swedes in Music
  • 15. 12/10/2019 Heiko Paulheim 15 CaLiGraph Statistics • Class hierarchy from categories and list pages – 750k classes • Axioms from categories and list pages – 200k definitions • Instances from list pages – 870k new instances
  • 16. 12/10/2019 Heiko Paulheim 16 CaLiGraph Glitches
  • 17. 12/10/2019 Heiko Paulheim 17 The Future of CaLiGraph • Going beyond red links • Going beyond explicit lists
  • 18. 12/10/2019 Heiko Paulheim 18 Knowledge Graph Creation Beyond Wikipedia
  • 19. 12/10/2019 Heiko Paulheim 19 A Bird’s Eye View on DBpedia EF • DBpedia Extraction Framework • Input: – A Wikipedia Dump (+ mappings) • Output: – DBpedia DBpedia Extraction Framework
  • 20. 12/10/2019 Heiko Paulheim 20 An Even Higher Bird’s Eye View on DBpedia EF • DBpedia Extraction Framework • Input: – A Media Wiki Dump (+ mappings) • Output: – A Knowledge Graph DBpedia Extraction Framework
  • 21. 12/10/2019 Heiko Paulheim 21 What if…? • What if we went from Wikipedia every MediaWiki? • According to WikiApiary, there’s thousands...
  • 22. 12/10/2019 Heiko Paulheim 22 Why? • More is better (maybe)
  • 23. 12/10/2019 Heiko Paulheim 23 Why? • Overcoming Wikipedia’s coverage bias
  • 24. 12/10/2019 Heiko Paulheim 24 A Brief History of DBkWik • Started as a student project in 2017 • Task: run DBpedia EF on a large Wiki Farm – ...and see what happens
  • 25. 12/10/2019 Heiko Paulheim 25 DBkWik vs. DBpedia • Challenges – Getting dumps: only a fraction of Fandom Wikis has dumps – Downloadable from Fandom: 12,840 dumps – Tried: auto-requesting dumps
  • 26. 12/10/2019 Heiko Paulheim 26 Obtaining Dumps • We had to change our strategy: WikiTeam software – Produces dumps by crawling Wikis – Fandom has not blocked us so far :-) – Current collection: 307,466 Wikis → will go into DBkWik 1.2 release
  • 27. 12/10/2019 Heiko Paulheim 27 DBkWik vs. DBpedia • Mappings do not exist – no central ontology – i.e., only raw extraction possible • Duplicates exist – origin: pages about the same entity in different Wikis – unlike Wikipedia: often not explicitly linked • Different configurations of MediaWiki
  • 28. 12/10/2019 Heiko Paulheim 28 Absence of Mappings and Ontology • Every infobox becomes a class: {infobox actor → mywiki:actor a owl:Class • Every infobox key becomes a property |role = Harry’s mother → mywiki:role a rdf:Property • The resulting ontology is very shallow – No class hierarchy – No distinction of object and data properties – No domains and ranges
  • 29. 12/10/2019 Heiko Paulheim 29 Duplicates • Collecting Data from a Multitude of Wikis
  • 30. 12/10/2019 Heiko Paulheim 30 Representational Variety • No conventions across Wikis (besides using MediaWiki syntax) {{Person |name = Trent Reznor |image = TrentReznor.jpg |caption - Reznor at the [[83rd Academy Awards]] |nominations = 1 |wins = 1 |role = Composer |birthdate = May 17, 1965 |birthloc = Mercer, Pennsylvania, USA}} {{Infobox musician | Name = Trent Reznor | Birth_name = Michael Trent Reznor | Born = May 17, [[1965]] (age 53) | Origin = [[Mercer]], [[Pennsylvania]], [[United States]] ... }} {{Infobox cast |Name=Trent Reznor |Image= |ImageCaption= |character= |crew= |Born={{d|May|17|1965}}{{-}}New Castle, Pennsylvania, United States ... }
  • 31. 12/10/2019 Heiko Paulheim 31 Data Fusion
  • 32. 12/10/2019 Heiko Paulheim 32 Naive Data Fusion and Linking to DBpedia • String similarity for schema matching (classes/properties) • doc2vec similarity on original pages for instance matching • Results – Classes and properties work OK – Instances are trickier – Internal linking seems easier F1 score... Internal Linking Linking to DBpedia Classes .979 .898 Properties .836 .865 Instances .879 .657 maybe...
  • 33. 12/10/2019 Heiko Paulheim 33 Improving Linking and Fusion • Started a new track at OAEI in 2018 – annual benchmark for matching tools • In 2019, some tools beat the baseline – albeit by a small margin only
  • 34. 12/10/2019 Heiko Paulheim 34 Results Data Fusion • Uneven distribution – e.g., character appears 5k times • Currently: no multi-linguality – e.g., Main Page, Hauptseite • Probably overloaded fusion (false positives) – e.g., next, location
  • 35. 12/10/2019 Heiko Paulheim 35 Light-weight Schema Induction • Class hierarchy and domain/range induction – Using association rule mining • e.g., Artist(x) → Person(x) – 5k class subsumption axioms – 59k domain restrictions – 114k range restrictions • Instance typing – With a light-weight version of SDType – Using the learned ranges as approximations of actual distributions • Result: ~100k new instance types Person? Artist Person
  • 36. 12/10/2019 Heiko Paulheim 36 Big Picture Dump Downloader DBpedia Extraction Framework Interlinking Instance Matcher Schema Matcher MediaWiki Dumps Extracted RDF Internal Linking Instance Matcher Schema Matcher Consolidated Knowledge Graph DBkWik Linked Data Endpoint Ontology Knowledge Graph Fusion Instance Matcher Domain/ Range Type SDType Light SubclassMaterialization
  • 37. 12/10/2019 Heiko Paulheim 37 DBkWik 1.1 • Source: ~15k Wiki dumps from Fandom – 52.4GB of data (roughly the size of the English Wikipedia) Raw Final Instances 14,212,535 11,163,719 Typed instances 1,880,189 1,372,971 Triples 107,833,322 91,526,001 Avg. indegree 0.624 0.703 Avg. outdegree 7.506 8.169 Classes 71,580 12,029 Properties 506,487 128,566
  • 38. 12/10/2019 Heiko Paulheim 38 DBkWik 1.1 • Fused graphs from 15k Wikis http://dbkwik.webdatacommons.org/
  • 39. 12/10/2019 Heiko Paulheim 39 DBkWik 1.1 vs. other Knowledge Graphs • Caveat: – Minus non-recognized duplicates!
  • 40. 12/10/2019 Heiko Paulheim 40 DBkWik 1.1 vs. DBpedia • How complementary are DBkWik and Dbpedia? • Challenge: – We only have an incomplete and partly correct mapping M – But: we know its precision P and recall R • Trick (see KI paper 2017): – O is the actual overlap (unknown), T  M is the true part of M (unknown) • By definition: – P = |T| / |M| → |T| = P * |M| – R = |T| / |O| → |T| = R * |O| → |O| = |M| * P / R DBkWik DBpedia
  • 41. 12/10/2019 Heiko Paulheim 41 DBkWik 1.1 vs. DBpedia • How complementary are DBkWik and Dbpedia? – |O| = |M| * P / R – Overlap: ~500k instances • In other words: – 95% of all entities in DBkWik are not in DBpedia – 90% of all entities in DBpedia are not in DBkWik DBkWik DBpedia
  • 42. 12/10/2019 Heiko Paulheim 42 NewNif Extractor Towards DBkWik 1.2 • Current crawl: 307,466 Wikis • Extraction: more robust for non-infobox templates – e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists • Robust abstract extraction – using SWEBLE parser – no local MediaWiki instance • Better matching • New gold standard Source Simple WikiParser LinkExtractor Page NifExtractor AST Destination Graph HTML
  • 43. 12/10/2019 Heiko Paulheim 43 Towards DBkWik 1.2 • What to expect? – data from 307,466 wikis – 38,985,266 articles
  • 44. 12/10/2019 Heiko Paulheim 44 Towards DBkWik 1.2 • What to expect? – data from 307,466 wikis – 38,985,266 articles
  • 45. 12/10/2019 Heiko Paulheim 45 Further Open Challenges • More detailed profiling of knowledge graphs – e.g., do we reduce or increase bias? • Task-based downstream evaluations – Does it improve, e.g., recommender systems? • Fusion policies – e.g., identify outdated information
  • 46. 12/10/2019 Heiko Paulheim 46 Contributors • Contributors (past&present) Sven Hertling Alexandra Hofmann Samresh Perchani Jan Portisch Nicolas Heist
  • 47. 12/10/2019 Heiko Paulheim 47 Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block Heiko Paulheim