SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Downloaden Sie, um offline zu lesen
9/30/21 Heiko Paulheim 1
From Wikis to Knowledge Graphs:
Approaches and Challenges
beyond DBpedia and YAGO
Heiko Paulheim
9/30/21 Heiko Paulheim 2
A Brief History of Knowledge Graphs
Google’s
Announcement
DBpedia
YAGO
ResearchCyc Wikidata
Freebase
NELL
9/30/21 Heiko Paulheim 3
Wikipedia as a Knowledge Graph
• Wikipedia based Knowledge Graphs
– DBpedia: launched 2007
– YAGO: launched 2008
– Extraction from Wikipedia
using mappings & heuristics
• Present
– Two of the most used knowledge graphs
– ...with Wikidata catching up
9/30/21 Heiko Paulheim 4
Wikipedia as a Knowledge Graph
9/30/21 Heiko Paulheim 5
Wikipedia as a Knowledge Graph
city
campus
state
c
i
t
y
9/30/21 Heiko Paulheim 6
Wikipedia as a Knowledge Graph
• Mapping to a central schema/ontology
University
chancellor Person
Organisation
Agent
campus Place
range
range
domain
domain
subclass of
subclass of
subclass of
9/30/21 Heiko Paulheim 7
Wikipedia as a Knowledge Graph
• General characteristics of DBpedia and YAGO:
– Central/schema ontology
• DBpedia: crowdsourcing
• YAGO: WordNet + categories
– Mapping of infobox keys
• DBpedia: crowdsourcing
• YAGO: engineering
– One page per entity
• i.e.: set of entities = set of Wikipedia pages
9/30/21 Heiko Paulheim 8
Getting the Most out of Wikipedia
• Study for KG-based
Recommender Systems*
– DBpedia has a coverage of
• 85% for movies
• 63% for music artists
• 31% for books
*) Di Noia, et al.: SPRank: Semantic Path-based Ranking for Top-n
Recommendations using Linked Open Data. In: ACM TIST, 2016
https://grouplens.org/datasets/
9/30/21 Heiko Paulheim 9
Why bother?
• Experiment w/
recommender
systems
(LDK 2021)
– Trained on five versions
of DBpedia
– Biases become evident
• examined: genre,
production country
Voit & Paulheim (2021): Bias in Knowledge Graphs.
9/30/21 Heiko Paulheim 10
Why bother?
• One key take away of that paper:
• Rethink parameter tuning and ablation studies!
– We see ablation studies on methods, parameters, etc.
– But rarely on knowledge graphs
– However, there are considerable differences
• observed in this work: factor of 2-3
• Especially:
– entity coverage and level of detail
Voit & Paulheim (2021): Bias in Knowledge Graphs.
9/30/21 Heiko Paulheim 11
Increasing Level of Detail
• YAGO uses categories for types
– e.g., Category:American Industrial Groups
– but does not analyze them further
• :NineInchNails a :AmericanIndustrialGroup
– “Things, not Strings”?
• :NineInchNails a :MusicalGroup ;
hometown :United_States ;
genre :Industrial .
9/30/21 Heiko Paulheim 12
Cat2Ax: Axiomatizing Wikipedia Categories
 dbo:Album
 dbo:artist.{dbr:Nine_Inch_Nails}
 dbo:genre.{dbr:Rock_Music}
See: ISWC 2019 Paper on Uncovering the Semantics of Wikipedia Categories
9/30/21 Heiko Paulheim 13
Cat2Ax: Axiomatizing Wikipedia Categories
 dbo:genre.{dbr:Rock_Music} ?
 dbo:artist.{dbr:Rock_(Rapper)} ?
9/30/21 Heiko Paulheim 14
Cat2Ax: Axiomatizing Wikipedia Categories
– Frequency: how often does the pattern occur in a category?
• i.e.: share of instances that have dbo:genre.{dbr.Rock_Music}?
– Lexical score: likelihood of term as a surface form of object
• i.e.: how often is Rock used to refer to dbr:Rock_Music?
– Sibling score: how likely are sibling categories sharing similar patterns?
• i.e., are there sibling categories with a high score for dbo:genre?
9/30/21 Heiko Paulheim 15
Cat2Ax: Axiomatizing Wikipedia Categories
• Results
9/30/21 Heiko Paulheim 16
CaLiGraph Example
Category: Musical Groups established
in 1987
List of symphonic metal bands
Category: Swedish death metal bands
List of Swedes in Music
9/30/21 Heiko Paulheim 17
Improving Entity Coverage:
Lists in Wikipedia
• Only existing pages have categories
– Lists may also link to non-existing pages
9/30/21 Heiko Paulheim 18
Pushing Entity Coverage Further
• Beyond red links (2020) • Beyond explicit lists (2021)
9/30/21 Heiko Paulheim 19
Entity Extraction from List Pages
• Red and grey links
– Red links point to entities
that do not exist
– “Grey links”
• are actually not links
• i.e., entities to be
discovered
9/30/21 Heiko Paulheim 20
Entity Extraction from List Pages
• Lists form (shallow) hierarchies
9/30/21 Heiko Paulheim 21
Entity Extraction from List Pages
• Idea: align with category graph
• Equivalence:
– “List of Japanese Writers”
↔ Category:Japanese Writers
• Subsumption:
– “List of Japanese
Speculative Fiction Writers”
→ Category:Japanese Writers
9/30/21 Heiko Paulheim 22
Classifying Red Links
• Not all entities on a list page
belong to the same category
• Idea:
– Learn classifier to tell subject
entities from non-subject entities
• Distant learning approach
– Positive examples:
• Entities that are in the
corresponding category
– Negative examples
• Entities that are in a category
which is disjoint
• e.g., Book <> Writer
9/30/21 Heiko Paulheim 23
Classifying Red Links
• Using a mix of features
– Page layout, position of entities, statistical, linguistics, …
9/30/21 Heiko Paulheim 24
Classifying Red Links
• Enumerations work slightly better than tables
• Unevenly balanced
– >70% place, species, and person
9/30/21 Heiko Paulheim 25
Beyond List Pages
• Many pages
contain list-like
constructs
• Usually
– small
– same type
– same relation
to page entity
– more grey links
…
9/30/21 Heiko Paulheim 26
Beyond List Pages
9/30/21 Heiko Paulheim 27
Beyond List Pages
• Learning descriptive rules for listings, e.g.
– topSection(“Discography”) → artist.{>PageEntity<}
– Learning across pages to mitigate small data problems
• Metrics:
– Support: no. of listings covered by rule antecedent
– Confidence: frequency of rule consequent over all covered listings
– Consistency: mean absolute deviation
of overall confidence and listing confidence
• i.e., does the rule work equally well across all covered listings
9/30/21 Heiko Paulheim 28
Beyond List Pages
• Entity detection:
– Specialize SpaCy tagger with entities on Wikipedia list pages
– Use SpaCy tags for filtering (e.g., PER for Person etc.)
• Based on majority vote per class
– tag fit (i.e., proportion of “fitting” tags for class axioms)
used for thresholding
9/30/21 Heiko Paulheim 29
Beyond List Pages
• We can learn
– ~5M rules for types
– ~3k rules for relations
• Identify ~2M new entities
– incl. type and relations
within KG
• Post hoc inspection of axioms:
– Accuracy >90%
Person
34%
Species
24%
Place
11%
Work
6%
Other
25%
9/30/21 Heiko Paulheim 30
CaLiGraph at a Glance
• Latest version 2.1
– 15M entities
• incl. 8M from listings
– Caveat:
• disambiguation!
9/30/21 Heiko Paulheim 31
Entity Disambiguation
• Examples: Wikipedia pages of Die Krupps and Eisbrecher
?
9/30/21 Heiko Paulheim 32
CaLiGraph Glitches
9/30/21 Heiko Paulheim 33
CaLiGraph Challenges & Open Issues
• Entity disambiguation
• Usage of formal ontologies (e.g., country is functional)
• Extracting information directly from the context
Achim_Färber
Drums
instrument
9/30/21 Heiko Paulheim 34
Knowledge Graph Creation Beyond Wikipedia
9/30/21 Heiko Paulheim 35
A Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Wikipedia Dump
(+ mappings)
• Output:
– DBpedia
DBpedia
Extraction
Framework
9/30/21 Heiko Paulheim 36
A Satellite View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Media Wiki Dump
(+ mappings)
• Output:
– A Knowledge Graph
DBpedia
Extraction
Framework
9/30/21 Heiko Paulheim 37
What if…?
• What if we went from Wikipedia every MediaWiki?
• According to WikiApiary, there’s thousands...
9/30/21 Heiko Paulheim 38
Why?
• More is better (maybe)
9/30/21 Heiko Paulheim 39
Why?
• Overcoming Wikipedia’s coverage bias
9/30/21 Heiko Paulheim 40
A Brief History of DBkWik
• Started as a student project in 2017
• Task: run DBpedia EF on a large Wiki Farm
– ...and see what happens
9/30/21 Heiko Paulheim 41
DBkWik vs. DBpedia
• Challenges
– Getting dumps: only a fraction of Fandom Wikis has dumps
– Downloadable from Fandom: 12,840 dumps
– Tried: auto-requesting dumps
9/30/21 Heiko Paulheim 42
Obtaining Dumps
• We had to change our strategy: WikiTeam software
– Produces dumps by crawling Wikis
– Fandom has not blocked us so far :-)
– Current collection: >300k Wikis
→ will go into DBkWik 1.2 release
9/30/21 Heiko Paulheim 43
DBkWik vs. DBpedia
• Mappings do not exist
– no central ontology
– i.e., only raw extraction possible
• Duplicates exist
– origin: pages about the same entity
in different Wikis
– unlike Wikipedia: often not explicitly linked
• Different configurations of MediaWiki
9/30/21 Heiko Paulheim 44
Absence of Mappings and Ontology
• Every infobox becomes a class:
{infobox actor
→ mywiki:actor a owl:Class
• Every infobox key becomes a property
|role = Harry’s mother
→ mywiki:role a rdf:Property
• The resulting ontology is very shallow
– No class hierarchy
– No distinction of object and data properties
– No domains and ranges
9/30/21 Heiko Paulheim 45
Duplicates
• Collecting Data from a Multitude of Wikis
9/30/21 Heiko Paulheim 46
Representational Variety
• No conventions across Wikis (besides using MediaWiki syntax)
{{Person
|name = Trent Reznor
|image = TrentReznor.jpg
|caption - Reznor at the [[83rd Academy Awards]]
|nominations = 1
|wins = 1
|role = Composer
|birthdate = May 17, 1965
|birthloc = Mercer, Pennsylvania, USA}}
{{Infobox musician
| Name = Trent Reznor
| Birth_name = Michael Trent Reznor
| Born = May 17, [[1965]] (age 53)
| Origin = [[Mercer]],
[[Pennsylvania]], [[United States]]
...
}}
{{Infobox cast
|Name=Trent Reznor
|Image=
|ImageCaption=
|character=
|crew=
|Born={{d|May|17|1965}}{{-}}New Castle,
Pennsylvania, United States
...
}
9/30/21 Heiko Paulheim 47
Data Fusion
9/30/21 Heiko Paulheim 48
Naive Data Fusion and Linking to DBpedia
• String similarity for schema matching (classes/properties)
• doc2vec similarity on original pages for instance matching
• Results
– Classes and properties work OK
– Instances are trickier
– Internal linking seems easier
F1 score... Internal Linking Linking to DBpedia
Classes .979 .898
Properties .836 .865
Instances .879 .657
maybe...
9/30/21 Heiko Paulheim 49
Improving Linking and Fusion
• Started a new track at OAEI in 2018
– annual benchmark for matching tools
• From 2019, some tools starting beating the baseline
– albeit by a small margin only
9/30/21 Heiko Paulheim 50
The Golden Hammer Bias
• Challenge:
– OAEI tools expect two related KGs
– but: 300k KGs can only be matched
without manual pre-inspection
See: ESWC 2020 Paper on OAEI Knowledge Graph Track
9/30/21 Heiko Paulheim 51
Big Picture
Dump
Downloader
DBpedia
Extraction
Framework
Interlinking
Instance
Matcher
Schema
Matcher
MediaWiki Dumps
Extracted
RDF
Internal Linking
Instance
Matcher
Schema
Matcher
Consolidated
Knowledge Graph
DBkWik
Linked
Data
Endpoint
Ontology
Knowledge
Graph
Fusion
Instance
Matcher
Domain/
Range
Type
SDType
Light
Subclass
Materialization
9/30/21 Heiko Paulheim 52
DBkWik 1.1
• Source: ~15k Wiki dumps from Fandom
– 52.4GB of data (roughly the size of the English Wikipedia)
Raw Final
Instances 14,212,535 11,163,719
Typed instances 1,880,189 1,372,971
Triples 107,833,322 91,526,001
Avg. indegree 0.624 0.703
Avg. outdegree 7.506 8.169
Classes 71,580 12,029
Properties 506,487 128,566
9/30/21 Heiko Paulheim 53
Towards DBkWik 1.2
• Again, we have an entity resolution problem
– with entities from 300k sources
• Strategies
• (i) pairwise (O(n²))
• (ii-iv) transitive pairs
• (iii-v) incremental merge
• Ordering by
• smallest/largest first
• source similarity
A B
C D
A B C D
A B
C
AB
D
ABC
A B C D
AB CD
(i)
(ii)
(v) (vi)
A B
C
D
(iii)
Incremental
Merge
Transitive
Pairs
A C
B D
(iv)
Similarity based
Order based
9/30/21 Heiko Paulheim 54
Towards DBkWik 1.2
• Preliminary results
– Incremental merge works best (quality close to pairwise)
• Main challenge: runtime
– One match&merge step takes ~20 minutes
– i.e., 300k steps are more than 11 years!
• Idea: parallelization
– best case: tree height of 18
→ best runtime (fully parallel): six hours
A B
C
AB
D
ABC
E
ABCD
F
ABCDE
G
ABCDEF
H
ABCDEFG
A B C D
AB CD
E F G H
EF GH
ABCD EFGH
9/30/21 Heiko Paulheim 55
Summary
• DBpedia and YAGO
– one source (Wikipedia, multiple languages)
– one entity per page paradigm
• CaLiGraph
– one source (Wikipedia, multiple languages would be possible)
– extraction from list-like constructs
• furtherpossible extension: list-like constructs outside of Wikipedia
– current open challenge: entity resolution
• DBkWik
– extraction from thousands of Wikis
– current open challenge: entity resolution
• in particular: scalability!
9/30/21 Heiko Paulheim 56
Further Open Challenges
• More detailed profiling of knowledge graphs
– e.g., do we reduce or increase bias?
– and: is that good or bad?
• Task-based downstream evaluations
– Does it improve, e.g., recommender systems?
• Fusion policies
– schema level,
e.g., many shallow ontologies
→ one deep ontology?
– instance level,
e.g., identify outdated information
9/30/21 Heiko Paulheim 57
Contributors
• Contributors (past&present)
Sven Hertling Alexandra
Hofmann
Samresh
Perchani
Jan Portisch Nicolas
Heist
9/30/21 Heiko Paulheim 58
From Wikis to Knowledge Graphs:
Approaches and Challenges
beyond DBpedia and YAGO
Heiko Paulheim

Weitere ähnliche Inhalte

Was ist angesagt?

New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge GraphsHeiko Paulheim
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Heiko Paulheim
 
Type Inference on Noisy RDF Data
Type Inference on Noisy RDF DataType Inference on Noisy RDF Data
Type Inference on Noisy RDF DataHeiko Paulheim
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyHeiko Paulheim
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataHeiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Heiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Heiko Paulheim
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopHeiko Paulheim
 
Linked Data in Libraries
Linked Data in LibrariesLinked Data in Libraries
Linked Data in LibrariesRichard Wallis
 
Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...Nuno Freire
 
Schema.org: Where did that come from!
Schema.org: Where did that come from!Schema.org: Where did that come from!
Schema.org: Where did that come from!Richard Wallis
 
Linked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need ReconciliationLinked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need ReconciliationRobert Sanderson
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationRichard Wallis
 
They have left the building: The Web Route to Library Users
They have left the building: The Web Route to Library UsersThey have left the building: The Web Route to Library Users
They have left the building: The Web Route to Library UsersRichard Wallis
 
Schema.org where did that come from?
Schema.org where did that come from?Schema.org where did that come from?
Schema.org where did that come from?Richard Wallis
 

Was ist angesagt? (20)

New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
Type Inference on Noisy RDF Data
Type Inference on Noisy RDF DataType Inference on Noisy RDF Data
Type Inference on Noisy RDF Data
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
How much is a Triple?
How much is a Triple?How much is a Triple?
How much is a Triple?
 
Ld4 dh tutorial
Ld4 dh tutorialLd4 dh tutorial
Ld4 dh tutorial
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
 
Linked Data in Libraries
Linked Data in LibrariesLinked Data in Libraries
Linked Data in Libraries
 
Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...
 
Schema.org: Where did that come from!
Schema.org: Where did that come from!Schema.org: Where did that come from!
Schema.org: Where did that come from!
 
Linked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need ReconciliationLinked Data Snowball, or Why We Need Reconciliation
Linked Data Snowball, or Why We Need Reconciliation
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data Foundation
 
They have left the building: The Web Route to Library Users
They have left the building: The Web Route to Library UsersThey have left the building: The Web Route to Library Users
They have left the building: The Web Route to Library Users
 
Linked Open Data and Ontotext Projects
Linked Open Data and Ontotext ProjectsLinked Open Data and Ontotext Projects
Linked Open Data and Ontotext Projects
 
Schema.org where did that come from?
Schema.org where did that come from?Schema.org where did that come from?
Schema.org where did that come from?
 
Unlocking Doors: recent initiatives in open and linked data at the National L...
Unlocking Doors: recent initiatives in open and linked data at the National L...Unlocking Doors: recent initiatives in open and linked data at the National L...
Unlocking Doors: recent initiatives in open and linked data at the National L...
 

Ähnlich wie From Wikis to Knowledge Graphs

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfHeiko Paulheim
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim
 
Maja Žumer: Library catalogues of the future: realising the old vision with n...
Maja Žumer: Library catalogues of the future: realising the old vision with n...Maja Žumer: Library catalogues of the future: realising the old vision with n...
Maja Žumer: Library catalogues of the future: realising the old vision with n...ÚISK FF UK
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...jessica666
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...jessica666
 
Visualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham CorpusVisualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham CorpusUCLDH
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesHeiko Paulheim
 
FST library session – part 2 2015
FST library session – part 2 2015FST library session – part 2 2015
FST library session – part 2 2015molloylibrarian
 
Intro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-Thon
Intro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-ThonIntro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-Thon
Intro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-ThonSara Snyder
 
DBT04-ER-Models-v1.pdf
DBT04-ER-Models-v1.pdfDBT04-ER-Models-v1.pdf
DBT04-ER-Models-v1.pdfNermeenKamel7
 

Ähnlich wie From Wikis to Knowledge Graphs (12)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
 
Maja Žumer: Library catalogues of the future: realising the old vision with n...
Maja Žumer: Library catalogues of the future: realising the old vision with n...Maja Žumer: Library catalogues of the future: realising the old vision with n...
Maja Žumer: Library catalogues of the future: realising the old vision with n...
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
 
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
Publishers' Bindings Online and The Artistic, Cultural, and Historical Signif...
 
Referencing session November 2016
Referencing session November 2016Referencing session November 2016
Referencing session November 2016
 
Visualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham CorpusVisualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham Corpus
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List Pages
 
FST library session – part 2 2015
FST library session – part 2 2015FST library session – part 2 2015
FST library session – part 2 2015
 
Intro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-Thon
Intro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-ThonIntro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-Thon
Intro to Editing Wikipedia - Women in the Arts Meetup & Edit-a-Thon
 
DBT04-ER-Models-v1.pdf
DBT04-ER-Models-v1.pdfDBT04-ER-Models-v1.pdf
DBT04-ER-Models-v1.pdf
 

Mehr von Heiko Paulheim

Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterHeiko Paulheim
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionHeiko Paulheim
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge DiscoveryHeiko Paulheim
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerHeiko Paulheim
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Heiko Paulheim
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaHeiko Paulheim
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionHeiko Paulheim
 

Mehr von Heiko Paulheim (8)

Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly Detection
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpedia
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
 

Kürzlich hochgeladen

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 

Kürzlich hochgeladen (20)

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 

From Wikis to Knowledge Graphs

  • 1. 9/30/21 Heiko Paulheim 1 From Wikis to Knowledge Graphs: Approaches and Challenges beyond DBpedia and YAGO Heiko Paulheim
  • 2. 9/30/21 Heiko Paulheim 2 A Brief History of Knowledge Graphs Google’s Announcement DBpedia YAGO ResearchCyc Wikidata Freebase NELL
  • 3. 9/30/21 Heiko Paulheim 3 Wikipedia as a Knowledge Graph • Wikipedia based Knowledge Graphs – DBpedia: launched 2007 – YAGO: launched 2008 – Extraction from Wikipedia using mappings & heuristics • Present – Two of the most used knowledge graphs – ...with Wikidata catching up
  • 4. 9/30/21 Heiko Paulheim 4 Wikipedia as a Knowledge Graph
  • 5. 9/30/21 Heiko Paulheim 5 Wikipedia as a Knowledge Graph city campus state c i t y
  • 6. 9/30/21 Heiko Paulheim 6 Wikipedia as a Knowledge Graph • Mapping to a central schema/ontology University chancellor Person Organisation Agent campus Place range range domain domain subclass of subclass of subclass of
  • 7. 9/30/21 Heiko Paulheim 7 Wikipedia as a Knowledge Graph • General characteristics of DBpedia and YAGO: – Central/schema ontology • DBpedia: crowdsourcing • YAGO: WordNet + categories – Mapping of infobox keys • DBpedia: crowdsourcing • YAGO: engineering – One page per entity • i.e.: set of entities = set of Wikipedia pages
  • 8. 9/30/21 Heiko Paulheim 8 Getting the Most out of Wikipedia • Study for KG-based Recommender Systems* – DBpedia has a coverage of • 85% for movies • 63% for music artists • 31% for books *) Di Noia, et al.: SPRank: Semantic Path-based Ranking for Top-n Recommendations using Linked Open Data. In: ACM TIST, 2016 https://grouplens.org/datasets/
  • 9. 9/30/21 Heiko Paulheim 9 Why bother? • Experiment w/ recommender systems (LDK 2021) – Trained on five versions of DBpedia – Biases become evident • examined: genre, production country Voit & Paulheim (2021): Bias in Knowledge Graphs.
  • 10. 9/30/21 Heiko Paulheim 10 Why bother? • One key take away of that paper: • Rethink parameter tuning and ablation studies! – We see ablation studies on methods, parameters, etc. – But rarely on knowledge graphs – However, there are considerable differences • observed in this work: factor of 2-3 • Especially: – entity coverage and level of detail Voit & Paulheim (2021): Bias in Knowledge Graphs.
  • 11. 9/30/21 Heiko Paulheim 11 Increasing Level of Detail • YAGO uses categories for types – e.g., Category:American Industrial Groups – but does not analyze them further • :NineInchNails a :AmericanIndustrialGroup – “Things, not Strings”? • :NineInchNails a :MusicalGroup ; hometown :United_States ; genre :Industrial .
  • 12. 9/30/21 Heiko Paulheim 12 Cat2Ax: Axiomatizing Wikipedia Categories  dbo:Album  dbo:artist.{dbr:Nine_Inch_Nails}  dbo:genre.{dbr:Rock_Music} See: ISWC 2019 Paper on Uncovering the Semantics of Wikipedia Categories
  • 13. 9/30/21 Heiko Paulheim 13 Cat2Ax: Axiomatizing Wikipedia Categories  dbo:genre.{dbr:Rock_Music} ?  dbo:artist.{dbr:Rock_(Rapper)} ?
  • 14. 9/30/21 Heiko Paulheim 14 Cat2Ax: Axiomatizing Wikipedia Categories – Frequency: how often does the pattern occur in a category? • i.e.: share of instances that have dbo:genre.{dbr.Rock_Music}? – Lexical score: likelihood of term as a surface form of object • i.e.: how often is Rock used to refer to dbr:Rock_Music? – Sibling score: how likely are sibling categories sharing similar patterns? • i.e., are there sibling categories with a high score for dbo:genre?
  • 15. 9/30/21 Heiko Paulheim 15 Cat2Ax: Axiomatizing Wikipedia Categories • Results
  • 16. 9/30/21 Heiko Paulheim 16 CaLiGraph Example Category: Musical Groups established in 1987 List of symphonic metal bands Category: Swedish death metal bands List of Swedes in Music
  • 17. 9/30/21 Heiko Paulheim 17 Improving Entity Coverage: Lists in Wikipedia • Only existing pages have categories – Lists may also link to non-existing pages
  • 18. 9/30/21 Heiko Paulheim 18 Pushing Entity Coverage Further • Beyond red links (2020) • Beyond explicit lists (2021)
  • 19. 9/30/21 Heiko Paulheim 19 Entity Extraction from List Pages • Red and grey links – Red links point to entities that do not exist – “Grey links” • are actually not links • i.e., entities to be discovered
  • 20. 9/30/21 Heiko Paulheim 20 Entity Extraction from List Pages • Lists form (shallow) hierarchies
  • 21. 9/30/21 Heiko Paulheim 21 Entity Extraction from List Pages • Idea: align with category graph • Equivalence: – “List of Japanese Writers” ↔ Category:Japanese Writers • Subsumption: – “List of Japanese Speculative Fiction Writers” → Category:Japanese Writers
  • 22. 9/30/21 Heiko Paulheim 22 Classifying Red Links • Not all entities on a list page belong to the same category • Idea: – Learn classifier to tell subject entities from non-subject entities • Distant learning approach – Positive examples: • Entities that are in the corresponding category – Negative examples • Entities that are in a category which is disjoint • e.g., Book <> Writer
  • 23. 9/30/21 Heiko Paulheim 23 Classifying Red Links • Using a mix of features – Page layout, position of entities, statistical, linguistics, …
  • 24. 9/30/21 Heiko Paulheim 24 Classifying Red Links • Enumerations work slightly better than tables • Unevenly balanced – >70% place, species, and person
  • 25. 9/30/21 Heiko Paulheim 25 Beyond List Pages • Many pages contain list-like constructs • Usually – small – same type – same relation to page entity – more grey links …
  • 26. 9/30/21 Heiko Paulheim 26 Beyond List Pages
  • 27. 9/30/21 Heiko Paulheim 27 Beyond List Pages • Learning descriptive rules for listings, e.g. – topSection(“Discography”) → artist.{>PageEntity<} – Learning across pages to mitigate small data problems • Metrics: – Support: no. of listings covered by rule antecedent – Confidence: frequency of rule consequent over all covered listings – Consistency: mean absolute deviation of overall confidence and listing confidence • i.e., does the rule work equally well across all covered listings
  • 28. 9/30/21 Heiko Paulheim 28 Beyond List Pages • Entity detection: – Specialize SpaCy tagger with entities on Wikipedia list pages – Use SpaCy tags for filtering (e.g., PER for Person etc.) • Based on majority vote per class – tag fit (i.e., proportion of “fitting” tags for class axioms) used for thresholding
  • 29. 9/30/21 Heiko Paulheim 29 Beyond List Pages • We can learn – ~5M rules for types – ~3k rules for relations • Identify ~2M new entities – incl. type and relations within KG • Post hoc inspection of axioms: – Accuracy >90% Person 34% Species 24% Place 11% Work 6% Other 25%
  • 30. 9/30/21 Heiko Paulheim 30 CaLiGraph at a Glance • Latest version 2.1 – 15M entities • incl. 8M from listings – Caveat: • disambiguation!
  • 31. 9/30/21 Heiko Paulheim 31 Entity Disambiguation • Examples: Wikipedia pages of Die Krupps and Eisbrecher ?
  • 32. 9/30/21 Heiko Paulheim 32 CaLiGraph Glitches
  • 33. 9/30/21 Heiko Paulheim 33 CaLiGraph Challenges & Open Issues • Entity disambiguation • Usage of formal ontologies (e.g., country is functional) • Extracting information directly from the context Achim_Färber Drums instrument
  • 34. 9/30/21 Heiko Paulheim 34 Knowledge Graph Creation Beyond Wikipedia
  • 35. 9/30/21 Heiko Paulheim 35 A Bird’s Eye View on DBpedia EF • DBpedia Extraction Framework • Input: – A Wikipedia Dump (+ mappings) • Output: – DBpedia DBpedia Extraction Framework
  • 36. 9/30/21 Heiko Paulheim 36 A Satellite View on DBpedia EF • DBpedia Extraction Framework • Input: – A Media Wiki Dump (+ mappings) • Output: – A Knowledge Graph DBpedia Extraction Framework
  • 37. 9/30/21 Heiko Paulheim 37 What if…? • What if we went from Wikipedia every MediaWiki? • According to WikiApiary, there’s thousands...
  • 38. 9/30/21 Heiko Paulheim 38 Why? • More is better (maybe)
  • 39. 9/30/21 Heiko Paulheim 39 Why? • Overcoming Wikipedia’s coverage bias
  • 40. 9/30/21 Heiko Paulheim 40 A Brief History of DBkWik • Started as a student project in 2017 • Task: run DBpedia EF on a large Wiki Farm – ...and see what happens
  • 41. 9/30/21 Heiko Paulheim 41 DBkWik vs. DBpedia • Challenges – Getting dumps: only a fraction of Fandom Wikis has dumps – Downloadable from Fandom: 12,840 dumps – Tried: auto-requesting dumps
  • 42. 9/30/21 Heiko Paulheim 42 Obtaining Dumps • We had to change our strategy: WikiTeam software – Produces dumps by crawling Wikis – Fandom has not blocked us so far :-) – Current collection: >300k Wikis → will go into DBkWik 1.2 release
  • 43. 9/30/21 Heiko Paulheim 43 DBkWik vs. DBpedia • Mappings do not exist – no central ontology – i.e., only raw extraction possible • Duplicates exist – origin: pages about the same entity in different Wikis – unlike Wikipedia: often not explicitly linked • Different configurations of MediaWiki
  • 44. 9/30/21 Heiko Paulheim 44 Absence of Mappings and Ontology • Every infobox becomes a class: {infobox actor → mywiki:actor a owl:Class • Every infobox key becomes a property |role = Harry’s mother → mywiki:role a rdf:Property • The resulting ontology is very shallow – No class hierarchy – No distinction of object and data properties – No domains and ranges
  • 45. 9/30/21 Heiko Paulheim 45 Duplicates • Collecting Data from a Multitude of Wikis
  • 46. 9/30/21 Heiko Paulheim 46 Representational Variety • No conventions across Wikis (besides using MediaWiki syntax) {{Person |name = Trent Reznor |image = TrentReznor.jpg |caption - Reznor at the [[83rd Academy Awards]] |nominations = 1 |wins = 1 |role = Composer |birthdate = May 17, 1965 |birthloc = Mercer, Pennsylvania, USA}} {{Infobox musician | Name = Trent Reznor | Birth_name = Michael Trent Reznor | Born = May 17, [[1965]] (age 53) | Origin = [[Mercer]], [[Pennsylvania]], [[United States]] ... }} {{Infobox cast |Name=Trent Reznor |Image= |ImageCaption= |character= |crew= |Born={{d|May|17|1965}}{{-}}New Castle, Pennsylvania, United States ... }
  • 47. 9/30/21 Heiko Paulheim 47 Data Fusion
  • 48. 9/30/21 Heiko Paulheim 48 Naive Data Fusion and Linking to DBpedia • String similarity for schema matching (classes/properties) • doc2vec similarity on original pages for instance matching • Results – Classes and properties work OK – Instances are trickier – Internal linking seems easier F1 score... Internal Linking Linking to DBpedia Classes .979 .898 Properties .836 .865 Instances .879 .657 maybe...
  • 49. 9/30/21 Heiko Paulheim 49 Improving Linking and Fusion • Started a new track at OAEI in 2018 – annual benchmark for matching tools • From 2019, some tools starting beating the baseline – albeit by a small margin only
  • 50. 9/30/21 Heiko Paulheim 50 The Golden Hammer Bias • Challenge: – OAEI tools expect two related KGs – but: 300k KGs can only be matched without manual pre-inspection See: ESWC 2020 Paper on OAEI Knowledge Graph Track
  • 51. 9/30/21 Heiko Paulheim 51 Big Picture Dump Downloader DBpedia Extraction Framework Interlinking Instance Matcher Schema Matcher MediaWiki Dumps Extracted RDF Internal Linking Instance Matcher Schema Matcher Consolidated Knowledge Graph DBkWik Linked Data Endpoint Ontology Knowledge Graph Fusion Instance Matcher Domain/ Range Type SDType Light Subclass Materialization
  • 52. 9/30/21 Heiko Paulheim 52 DBkWik 1.1 • Source: ~15k Wiki dumps from Fandom – 52.4GB of data (roughly the size of the English Wikipedia) Raw Final Instances 14,212,535 11,163,719 Typed instances 1,880,189 1,372,971 Triples 107,833,322 91,526,001 Avg. indegree 0.624 0.703 Avg. outdegree 7.506 8.169 Classes 71,580 12,029 Properties 506,487 128,566
  • 53. 9/30/21 Heiko Paulheim 53 Towards DBkWik 1.2 • Again, we have an entity resolution problem – with entities from 300k sources • Strategies • (i) pairwise (O(n²)) • (ii-iv) transitive pairs • (iii-v) incremental merge • Ordering by • smallest/largest first • source similarity A B C D A B C D A B C AB D ABC A B C D AB CD (i) (ii) (v) (vi) A B C D (iii) Incremental Merge Transitive Pairs A C B D (iv) Similarity based Order based
  • 54. 9/30/21 Heiko Paulheim 54 Towards DBkWik 1.2 • Preliminary results – Incremental merge works best (quality close to pairwise) • Main challenge: runtime – One match&merge step takes ~20 minutes – i.e., 300k steps are more than 11 years! • Idea: parallelization – best case: tree height of 18 → best runtime (fully parallel): six hours A B C AB D ABC E ABCD F ABCDE G ABCDEF H ABCDEFG A B C D AB CD E F G H EF GH ABCD EFGH
  • 55. 9/30/21 Heiko Paulheim 55 Summary • DBpedia and YAGO – one source (Wikipedia, multiple languages) – one entity per page paradigm • CaLiGraph – one source (Wikipedia, multiple languages would be possible) – extraction from list-like constructs • furtherpossible extension: list-like constructs outside of Wikipedia – current open challenge: entity resolution • DBkWik – extraction from thousands of Wikis – current open challenge: entity resolution • in particular: scalability!
  • 56. 9/30/21 Heiko Paulheim 56 Further Open Challenges • More detailed profiling of knowledge graphs – e.g., do we reduce or increase bias? – and: is that good or bad? • Task-based downstream evaluations – Does it improve, e.g., recommender systems? • Fusion policies – schema level, e.g., many shallow ontologies → one deep ontology? – instance level, e.g., identify outdated information
  • 57. 9/30/21 Heiko Paulheim 57 Contributors • Contributors (past&present) Sven Hertling Alexandra Hofmann Samresh Perchani Jan Portisch Nicolas Heist
  • 58. 9/30/21 Heiko Paulheim 58 From Wikis to Knowledge Graphs: Approaches and Challenges beyond DBpedia and YAGO Heiko Paulheim