SlideShare ist ein Scribd-Unternehmen logo
1 von 166
Downloaden Sie, um offline zu lesen
9/26/18 Heiko Paulheim 1
Machine Learning
with and from
Semantic Web Knowledge Graphs
Heiko Paulheim
9/26/18 Heiko Paulheim 2
Introduction
• Heiko Paulheim
• Professor for Data Science
University of Mannheim, Germany
• Research on
– Knowledge Graphs
– Linked Data
– Machine Learning
– …
• Participant at Reasoning Web Summer School
as a PhD student in 2010 :-)
9/26/18 Heiko Paulheim 3
Outline
• Knowledge Graphs
– brief introduction
– overview of existing ones
– recent developments
– comparisons
• Using Knowledge Graphs in Machine Learning
– approaches
– applications
• Using Machine Learning for Knowledge Graphs
– quality of knowledge graphs
– internal and external approaches
9/26/18 Heiko Paulheim 4
Introduction
• You’ve seen this, haven’t you?
Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae,
Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/
9/26/18 Heiko Paulheim 5
Introduction
• Knowledge Graphs on the Web
• Everybody talks about them, but what is a Knowledge Graph?
– I don’t have a definition either...
Journal Paper Review, (Natasha Noy, Google, June 2015):
“Please define what a knowledge graph is – and what it is not.”
9/26/18 Heiko Paulheim 6
Introduction
• Knowledge Graph definitions
• Many people talk about KGs, few give definitions
• Working definition: a Knowledge Graph
– mainly describes instances and their relations in a graph
●
Unlike an ontology
●
Unlike, e.g., WordNet
– Defines possible classes and relations in a schema or ontology
●
Unlike schema-free output of some IE tools
– Allows for interlinking arbitrary entities with each other
●
Unlike a relational database
– Covers various domains
●
Unlike, e.g., Geonames
9/26/18 Heiko Paulheim 7
Introduction
Dagstuhl Seminar “Knowledge Graphs: New Directions for Knowledge
Representation on the Semantic Web”, Sep 09-14, 2018
9/26/18 Heiko Paulheim 8
Introduction
• Knowledge Graphs out there (not guaranteed to be complete)
public
private
Paulheim: Knowledge graph refinement: A survey of approaches and evaluation
methods. Semantic Web 8:3 (2017), pp. 489-508
9/26/18 Heiko Paulheim 9
Motivation
• At ISWC and similar conferences, you see a couple of works...
– that use DBpedia for doing X
– that use Wikidata for doing Y
– ...
• If you see them, do you ever ask yourselves:
– Why DBpedia and not Wikidata?
(or the other way round?)
9/26/18 Heiko Paulheim 10
Motivation
• Questions:
– which knowledge graph should I use for which purpose?
– are there significant differences?
– do they come with strengths and weaknesses?
– would it help to combine them?
• For answering those questions, we need to
– make empirical statements about knowledge graphs
– define measures
– define setups in which to measure them
9/26/18 Heiko Paulheim 11
Knowledge Graph Creation: CyC
• The beginning
– Encyclopedic collection of knowledge
– Started by Douglas Lenat in 1984
– Estimation: 350 person years and 250,000 rules
should do the job
of collecting the essence of the world’s knowledge
• The present (as of June 2017)
– ~1,000 person years, $120M total development cost
– 21M axioms and rules
– Used to exist until 2017
9/26/18 Heiko Paulheim 12
Knowledge Graph Creation
• Lesson learned no. 1:
– Trading efforts against accuracy
Min. efforts Max. accuracy
9/26/18 Heiko Paulheim 13
Knowledge Graph Creation: Freebase
• The 2000s
– Freebase: collaborative editing
– Schema not fixed
• Present
– Acquired by Google in 2010
– Powered first version of Google’s Knowledge Graph
– Shut down in 2016
– Partly lives on in Wikidata (see in a minute)
coming up soon:
was it a good deal or not?
9/26/18 Heiko Paulheim 14
Knowledge Graph Creation: Freebase
• Community based
• Like Wikipedia,
but more structured
9/26/18 Heiko Paulheim 15
Knowledge Graph Creation
• Lesson learned no. 2:
– Trading formality against number of users
Max. user involvement Max. degree of formality
9/26/18 Heiko Paulheim 16
Knowledge Graph Creation: Wikidata
• The 2010s
– Wikidata: launched 2012
– Goal: centralize data from Wikipedia languages
– Collaborative
– Imports other datasets
• Present
– One of the largest public knowledge graphs
(see later)
– Includes rich provenance
9/26/18 Heiko Paulheim 17
Knowledge Graph Creation: Wikidata
• Collaborative
editing
9/26/18 Heiko Paulheim 18
Knowledge Graph Creation: Wikidata
• Provenance
9/26/18 Heiko Paulheim 19
Knowledge Graph Creation
• Lesson learned no. 3:
– There is not one truth (but allowing for plurality adds complexity)
Max. simplicity Max. support for plurality
9/26/18 Heiko Paulheim 20
Knowledge Graph Creation: DBpedia & YAGO
• The 2010s
– DBpedia: launched 2007
– YAGO: launched 2008
– Extraction from Wikipedia
using mappings & heuristics
• Present
– Two of the most used knowledge graphs
– ...with Wikidata catching up
9/26/18 Heiko Paulheim 21
DBpedia
9/26/18 Heiko Paulheim 22
YAGO
9/26/18 Heiko Paulheim 23
Knowledge Graph Creation
• Lesson learned no. 4:
– Heuristics help increasing coverage (at the cost of accuracy)
Max. accuracy Max. coverage
9/26/18 Heiko Paulheim 24
Knowledge Graph Creation: NELL
• The 2010s
– NELL: Never ending language learner
– Input: ontology, seed examples, text corpus
– Output: facts, text patterns
– Large degree of automation, occasional human feedback
• Today
– Still running
– New release every few days
9/26/18 Heiko Paulheim 25
Knowledge Graph Creation: NELL
• Extraction of a Knowledge Graph from a Text Corpus
Nine Inch Nails
singer Trent Reznor,
born
1965
...as stated by Filter
singer Richard
Patrick
...says Slipknot
singer Corey Taylor,
44, in the interview.
“X singer Y”
→ band_member(X,Y)
band_member(Nine_Inch_Nails, Trent_Reznor)
band_member(Filter,Richard_Patrick)
band_member(Slipknot,Corey_Taylor)
patterns
facts
9/26/18 Heiko Paulheim 26
Knowledge Graph Creation
• Lesson learned no. 5:
– Quality cannot be maximized without human intervention
Min. human intervention Max. accuracy
9/26/18 Heiko Paulheim 27
Summary of Trade Offs
• (Manual) effort vs. accuracy and completeness
• User involvement (or usability) vs. degree of formality
• Simplicity vs. support for plurality and provenance
→ all those decisions influence the shape of a knowledge graph!
9/26/18 Heiko Paulheim 28
Non-Public Knowledge Graphs
• Many companies have their
own private knowledge graphs
– Google: Knowledge Graph,
Knowledge Vault
– Yahoo!: Knowledge Graph
– Microsoft: Satori
– Facebook: Entities Graph
– Thomson Reuters: permid.org
(partly public)
• However, we usually know only little about them
9/26/18 Heiko Paulheim 29
Comparison of Knowledge Graphs
• Release cycles
Instant updates:
DBpedia live,
Freebase
Wikidata
Days:
NELL
Months:
DBpedia
Years:
YAGO
Cyc
• Size and density
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
Caution!
9/26/18 Heiko Paulheim 30
Comparison of Knowledge Graphs
• What do they actually contain?
• Experiment: pick 25 classes of interest
– And find them in respective ontologies
• Count instances (coverage)
• Determine in and out degree (level of detail)
9/26/18 Heiko Paulheim 31
Comparison of Knowledge Graphs
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
9/26/18 Heiko Paulheim 32
Comparison of Knowledge Graphs
• Summary findings:
– Persons: more in Wikidata
(twice as many persons as DBpedia and YAGO)
– Countries: more details in Wikidata
– Places: most in DBpedia
– Organizations: most in YAGO
– Events: most in YAGO
– Artistic works:
●
Wikidata contains more movies and albums
●
YAGO contains more songs
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
9/26/18 Heiko Paulheim 33
Caveats
• Reading the diagrams right…
• So, Wikidata contains more persons
– but less instances of all the interesting subclasses?
• There are classes like Actor in Wikidata
– but they are hardly used
– rather: modeled using profession relation
9/26/18 Heiko Paulheim 34
Caveats
• Reading the diagrams right… (ctd.)
• So, Wikidata contains more data on countries, but less countries?
• First: Wikidata only counts current, actual countries
– DBpedia and YAGO also count historical countries
• “KG1 contains less of X than KG2” can mean
– it actually contains less instances of X
– it contains equally many or more instances,
but they are not typed with X (see later)
• Second: we count single facts about countries
– Wikidata records some time indexed information, e.g., population
– Each point in time contributes a fact
9/26/18 Heiko Paulheim 35
Overlap of Knowledge Graphs
• How largely do knowledge graphs overlap?
• They are interlinked, so we can simply count links
– For NELL, we use links to Wikipedia as a proxy
DBpedia
YAGO
Wikidata
NELL Open
Cyc
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
9/26/18 Heiko Paulheim 36
Overlap of Knowledge Graphs
• How largely do knowledge graphs overlap?
• They are interlinked, so we can simply count links
– For NELL, we use links to Wikipedia as a proxy
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
9/26/18 Heiko Paulheim 37
Overlap of Knowledge Graphs
• Links between Knowledge Graphs are incomplete
– The Open World Assumption also holds for interlinks
• But we can estimate their number
• Approach:
– find link set automatically with different heuristics
– determine precision and recall on existing interlinks
– estimate actual number of links
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
9/26/18 Heiko Paulheim 38
Overlap of Knowledge Graphs
• Idea:
– Given that the link set F is found
– And the (unknown) actual link set would be C
• Precision P: Fraction of F which is actually correct
– i.e., measures how much |F| is over-estimating |C|
• Recall R: Fraction of C which is contained in F
– i.e., measures how much |F| is under-estimating |C|
• From that, we estimate
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
9/26/18 Heiko Paulheim 39
Overlap of Knowledge Graphs
• Mathematical derivation:
– Definition of recall:
– Definition of precision:
• Resolve both to , substitute, and resolve to
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
9/26/18 Heiko Paulheim 40
Overlap of Knowledge Graphs
• Experiment:
– We use the same 25 classes as before
– Measure 1: overlap relative to smaller KG (i.e., potential gain)
– Measure 2: overlap relative to explicit links
(i.e., importance of improving links)
• Link generation with 16 different metrics and thresholds
– Intra-class correlation coefficient for |C|: 0.969
– Intra-class correlation coefficient for |F|: 0.646
• Bottom line:
– Despite variety in link sets generated, the overlap is estimated reliably
– The link generation mechanisms do not need to be overly accurate
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
9/26/18 Heiko Paulheim 41
Overlap of Knowledge Graphs
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
9/26/18 Heiko Paulheim 42
Overlap of Knowledge Graphs
• Summary findings:
– DBpedia and YAGO cover roughly the same instances
(not much surprising)
– NELL is the most complementary to the others
– Existing interlinks are insufficient for out-of-the-box parallel usage
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
9/26/18 Heiko Paulheim 43
Intermezzo: Knowledge Graph Creation Cost
• There are quite a few metrics for evaluating KGs
– size, degree, interlinking, quality, licensing, ...
Färber et al.: Linked data quality of
DBpedia, Freebase, OpenCyc,
Wikidata, and YAGO SWJ 9(1), 2018
Zaveri et al.: Quality Assessment for
Linked Open Data: A Survey. SWJ 7(1),
2016
9/26/18 Heiko Paulheim 44
Intermezzo: Knowledge Graph Creation Cost
• ...but what is the cost of a single statement?
• Some back of the envelope calculations...
9/26/18 Heiko Paulheim 45
Intermezzo: Knowledge Graph Creation Cost
• Case 1: manual curation
– Cyc: created by experts
Total development cost: $120M
Total #statements: 21M
→ $5.71 per statement
– Freebase: created by laymen
Assumption: adding a statement to Freebase
equals adding a sentence to Wikipedia
●
English Wikipedia up to April 2011: 41M working hours
(Geiger and Halfaker, 2013),
size in April 2011: 3.6M pages, avg. 36.4 sentences each
●
Using US minimum wage: $2.25 per sentence
→ $2.25 per statement
(Footnote: total cost of creating Freebase would be $6.75B)
acquisition by Google
estimated as $60-300M
9/26/18 Heiko Paulheim 46
Intermezzo: Knowledge Graph Creation Cost
• Case 2: automatic/heuristic creation
– DBpedia: 4.9M LOC, 2.2M LOC for mappings
software project development: ~37 LOC per hour
(Devanbu et al., 1996)
we use German PhD salaries as a cost estimate
→ 1.85c per statement
– YAGO: made from 1.6M LOC
uses WordNet: 117k synsets, we treat each synset like a Wiki page
→ 0.83c per statement
– NELL: 103k LOC
→ 14.25c per statement
• Compared to manual curation: saving factor 16-250
9/26/18 Heiko Paulheim 47
Intermezzo: Knowledge Graph Creation Cost
• Graph error rate against cost
– we can pay for accuracy
– NELL is a bit of an outlier
9/26/18 Heiko Paulheim 48
New Kids on the Block
Subjective age:
Measured by the fraction
of the audience
that understands a reference
to your young days’
pop culture...
9/26/18 Heiko Paulheim 49
WebDataCommons Microdata Corpus
• Web pages can use different markup
– Microdata
– Microformats
– RDFa
• Microdata with schema.org is promoted
by major search engines (Google, Bing, Yahoo!, Yandex)
→ direct business incentive of using markup
→ quick and wide adoption during the last years
9/26/18 Heiko Paulheim 50
WebDataCommons Microdata Corpus
• Structured markup on the Web can be extracted to RDF
– See W3C Interest Group Note: Microdata to RDF [1]
[1] http://www.w3.org/TR/microdata-rdf/
<div itemscope
itemtype="http://schema.org/PostalAddress">
<span itemprop="name">Data and Web Science Group</span>
<span itemprop="addressLocality">Mannheim</span>,
<span itemprop="postalCode">68131</span>
<span itemprop="addressCountry">Germany</span>
</div>
_:1 a <http://schema.org/PostalAddress> .
_:1 <http://schema.org/name> "Data and Web Science Group" .
_:1 <http://schema.org/addressLocality> "Mannheim" .
_:1 <http://schema.org/postalCode> "68131" .
_:1 <http://schema.org/adressCounty> "Germany" .
9/26/18 Heiko Paulheim 51
WebDataCommons Microdata Corpus
• Main topics covered in the corpus
– Meta information on web page content (web page, blog...)
– Business data (products, offers, …)
– Contact data (businesses, persons, ...)
– (Product) reviews and ratings
• ...and a massive long tail
9/26/18 Heiko Paulheim 52
WebDataCommons Microdata Corpus
• Is the Microdata corpus a knowledge graph?
• Recap Working definition: a Knowledge Graph
– mainly describes instances and their relations in a graph
– defines possible classes and relations in a schema or ontology
– allows for interlinking arbitrary entities with each other
– covers various domains
• Yes, but...
9/26/18 Heiko Paulheim 53
Further Sources of Knowledge in Wikipedia
• show: list pages, categories, tables, ...
9/26/18 Heiko Paulheim 54
CaLiGraph Idea
• Entities co-occur in surface patterns
– e.g., enumerations, table columns, …
• Co-occurring entities share semantic patterns
– e.g., types, relations, attribute values
• Existing entities co-occur with new entities
9/26/18 Heiko Paulheim 55
CaLiGraph Idea
• Surface patterns and semantic patterns
also exist outside of Wikipedia
9/26/18 Heiko Paulheim 56
New Kids on the Block
• Wikipedia-based Knowledge Graphs will remain
an essential building block of Semantic Web applications
• But they suffer from...
– ...a coverage bias
– ...limitations of the creating heuristics
9/26/18 Heiko Paulheim 57
Wikipedia’s Coverage Bias
• One (but not the only!) possible source of coverage bias
– Articles about long-tail entities become deleted
9/26/18 Heiko Paulheim 58
Work in Progress: DBkWik
• Why stop at Wikipedia?
• Wikipedia is based on the MediaWiki software
– ...and so are thousands of Wikis
– Fandom by Wikia: >385,000 Wikis on special topics
– WikiApiary: reports >20,000 installations of MediaWiki on the Web
9/26/18 Heiko Paulheim 59
Work in Progress: DBkWik
• Collecting Data from a Multitude of Wikis
9/26/18 Heiko Paulheim 60
Work in Progress: DBkWik
• The DBpedia Extraction Framework consumes MediaWiki dumps
• Experiment
– Can we process dumps from arbitrary Wikis with it?
– Are the results somewhat meaningful?
9/26/18 Heiko Paulheim 61
Work in Progress: DBkWik
• Example from Harry Potter Wiki
http://dbkwik.webdatacommons.org/
9/26/18 Heiko Paulheim 62
Work in Progress: DBkWik
• Differences to DBpedia
– DBpedia has manually created mappings to an ontology
– Wikipedia has one page per subject
– Wikipedia has global infobox conventions (more or less)
• Challenges
– On-the-fly ontology creation
– Instance matching
– Schema matching
Hertling & Paulheim: DBkWik: A Consolidated Knowledge Graph from
Thousands of Wikis. ICBK 2018
9/26/18 Heiko Paulheim 63
Work in Progress: DBkWik
Dump
Downloader
DBpedia
Extraction
Framework
Interlinking
Instance
Matcher
Schema
Matcher
MediaWiki
Dumps
Extracted
RDF
Internal Linking
Instance
Matcher
Schema
Matcher
Consolidated
Knowledge Graph
DBkWik
Linked
Data
Endpoi
nt
Ontology
Knowledge
Graph
Fusion
Instance
Matcher
Domain/
Range
Type
SDType
Light
SubclassMaterialization
• Avoiding O(n²) internal linking:
– Match to DBpedia first
– Use common links to DBpedia as blocking keys for internal matching
Hertling & Paulheim: DBkWik: A Consolidated Knowledge Graph from
Thousands of Wikis. ICBK 2018
9/26/18 Heiko Paulheim 64
Work in Progress: DBkWik
• Downloaded ~15k Wiki dumps from Fandom
– 52.4GB of data, roughly the size of the English Wikipedia
• Prototype: extracted data for ~250 Wikis
– 4.3M instances, ~750k linked to DBpedia
– 7k classes, ~1k linked to DBpedia
– 43k properties, ~20k linked to DBpedia
– ...including duplicates!
• Link quality
– Good for classes, OK for properties (F1 of .957 and .852)
– Needs improvement for instances (F1 of .641)
9/26/18 Heiko Paulheim 65
Work in Progress: WebIsALOD
• Background: Web table interpretation
• Most approaches need typing information
– DBpedia etc. have too little coverage
on the long tail
– Wanted: extensive type database
Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted
from the Web as Linked Open Data. ISWC 2017
9/26/18 Heiko Paulheim 66
Work in Progress: WebIsALOD
• Extraction of type information using Hearst-like patterns, e.g.,
– T, such as X
– X, Y, and other T
• Text corpus: common crawl
– ~2 TB crawled web pages
– Fast implementation: regex over text
– “Expensive” operations only applied once regex has fired
• Resulting database
– 400M hypernymy relations
Seitner et al.: A large DataBase of hypernymy relations extracted from the Web.
LREC 2016
9/26/18 Heiko Paulheim 67
Work in Progress: WebIsALOD
• Example:
http://webisa.webdatacommons.org/
9/26/18 Heiko Paulheim 68
Work in Progress: WebIsALOD
• Initial effort: transformation to a LOD dataset
– including rich provenance information
Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted
from the Web as Linked Open Data. ISWC 2017
9/26/18 Heiko Paulheim 69
Work in Progress: WebIsALOD
• Estimated contents breakdown
Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted
from the Web as Linked Open Data. ISWC 2017
9/26/18 Heiko Paulheim 70
Work in Progress: WebIsALOD
• Main challenge
– Original dataset is quite noisy (<10% correct statements)
– Recap: coverage vs. accuracy
– Simple thresholding removes too much knowledge
• Approach
– Train RandomForest model for predicting correct vs. wrong statements
– Using all the provenance information we have
– Use model to compute confidence scores
Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted
from the Web as Linked Open Data. ISWC 2017
9/26/18 Heiko Paulheim 71
Work in Progress: WebIsALOD
• Current challenges and works in progress
– Distinguishing instances and classes
●
i.e.: subclass vs. instance of relations
– Splitting instances
●
Bauhaus is a goth band
●
Bauhaus is a German school
– Knowledge extraction from pre and post modifiers
●
Bauhaus is a goth band → genre(Bauhaus, Goth)
●
Bauhaus is a German school → location(Bauhaus, Germany)
Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted
from the Web as Linked Open Data. ISWC 2017
9/26/18 Heiko Paulheim 72
Brief Interlude: Machine Learning Basics
• An essential ingredient to intelligent applications
– Dealing with new pieces of knowledge
– Handling unknown situations
– Adapting to users' needs
– Making predictions for the future
• Inductive vs. Deductive Reasoning:
– Deductive: rules + facts → facts
– Inductive: facts → rules
9/26/18 Heiko Paulheim 73
Brief Interlude: Machine Learning Basics
• Example: learning a new concept, e.g., "Tree"
"tree" "tree" "tree"
"not a tree" "not a tree" "not a tree"
9/26/18 Heiko Paulheim 74
Brief Interlude: Machine Learning Basics
• Example: learning a new concept, e.g., "Tree"
– we look at (positive and negative)
examples
– ...and derive a model
• e.g., "Trees are big, green plants"
• Goal: Classification of new instances
"tree?"
Warning:
Models are only
approximating examples!
Not guaranteed to be
correct or complete!
9/26/18 Heiko Paulheim 75
Brief Interlude: Machine Learning Basics
• Typical tasks:
– Classification (binary or multi-label)
– Regression (i.e., predicting numerical values)
– Clustering (finding groups of objects)
– Anomaly Detection (identifying special data points)
– Frequent Pattern Mining
– ...
• Methods:
– Statistical approaches (Naive Bayes, Neural Nets, SVMs, …)
– Symbolic approaches (Rules, Decision Trees, …)
– ...
9/26/18 Heiko Paulheim 76
Background Knowledge for Machine Learning
• Example machine learning task: predicting book sales
ISBN City Sold
3-2347-3427-1 Darmstadt 124
3-43784-324-2 Mannheim 493
3-145-34587-0 Roßdorf 14
...
ISBN City Population ... Genre Publisher ... Sold
3-2347-3427-1 Darm-
stadt
144402 ... Crime Bloody
Books
... 124
3-43784-324-2 Mann-
heim
291458 … Crime Guns Ltd. … 493
3-145-34587-0 Roß-
dorf
12019 ... Travel Up&Away ... 14
...
→ Crime novels sell better in larger cities
We can use Semantic Web Knowledge Graphs for augmenting ML problems
with background knowledge!
9/26/18 Heiko Paulheim 77
The FeGeLOD Framework
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
N a m e d E n t it y
R e c o g n it io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
F e a t u r e
G e n e r a t io n
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l
1 4 1 4 7 1
C ity _ U R I_ ...
...
F e a t u r e
S e le c t io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l
1 4 1 4 7 1
9/26/18 Heiko Paulheim 78
The FeGeLOD Framework
• Entity Recognition
– Simple approach: guess DBpedia URIs
– Hit rate >95% for cities and countries (by English name)
• Feature Generation
– augmenting the dataset with additional attributes from KG
• Feature Selection
– Filter noise: >95% unknown, identical, or different nominals
9/26/18 Heiko Paulheim 79
Propositionalization
• Problem: Knowledge Graphs
vs. ML algorithms expecting Feature Vectors
→ wanted: a transformation from nodes to sets of features
• Basic strategies:
– literal values (e.g., population) are used directly
– instance types become binary vectors
– relations are counted (absolute, relative, TF-IDF)
– combinations of relations and object types are counted
(absolute, relative, TF-IDF)
– ...
9/26/18 Heiko Paulheim 80
Propositionalization
• Advanced strategies: vector space embeddings
– word2vec: takes sentences as input, computes vector for each word
• RDF2vec: create “sentences” by random graph walks
9/26/18 Heiko Paulheim 81
Propositionalization
• RDF2vec example
– similar instances form classes
– direction of relations is stable
9/26/18 Heiko Paulheim 82
The FeGeLOD Prototype
(now: RapidMiner Linked Open Data Extension)
caution: works
only until RM6! :-(
9/26/18 Heiko Paulheim 83
The FeGeLOD Prototype
(now: RapidMiner Linked Open Data Extension)
9/26/18 Heiko Paulheim 84
Yet Another Knowledge Graph...
Presenter
Coffee
Break
Audience
served during
Tea
may need
served during
needs
9/26/18 Heiko Paulheim 85
Recommender Systems
• Content-based recommender systems:
– recommend similar items to user
– similarity can be computed using KGs
9/26/18 Heiko Paulheim 86
Recommender Systems
• Using proximity in latent RDF2vec feature space
9/26/18 Heiko Paulheim 87
Recommender Systems
• Using proximity in latent RDF2vec feature space
• RDF2vec based features constantly outperform
other propositionalization techniques
• DBpedia vectors outperform Wikidata vectors
9/26/18 Heiko Paulheim 88
Incident Detection from Twitter
• Tweets about random things come in via Twitter
– How can we detect incidents from those automatically?
• Obviously, keywords are not enough
– and hand-crafted rules are not practical
fire at #mannheim
#universityomg two cars on
fire #A5 #accident
fire at train station
still burning
my heart
is on fire!!!come on baby
light my fire
boss should fire
that stupid moron
9/26/18 Heiko Paulheim 89
Detecting Incidents from Social Media
• Social media contains data on many incidents
– But keyword search is not enough
– Detecting small incidents is hard
– Manual inspection is too expensive (and slow)
• Machine learning could help
– Train a model to classify incident/non incident tweets
– Apply model for detecting incident related tweets
• Training data:
– Traffic accidents
– ~2,000 tweets containing relevant keywords (“car”, “crash”, etc.),
hand labeled (50% related to traffic incidents)
9/26/18 Heiko Paulheim 90
Detecting Incidents from Social Media
• Learning to classify tweets:
– Positive and negative examples
– Features:
• Stemming
• POS tagging
• Word n-grams
• …
• Accuracy ~90%
• But
– Accuracy drops to ~85% when applying the model to a different city
9/26/18 Heiko Paulheim 91
Brief Interlude: Model Overfitting
• What happens here?
– Model is trained on a sample of labeled data
– Tries to identify the characteristics of that data
• Possible effect:
– Model is too close to training data
9/26/18 Heiko Paulheim 92
Brief Interlude: Model Overfitting
• Extreme example
– Predict credit rating
• Possible (useful) model:
– (job status = employed) && (debts<5000) → rating=positive
Name Net Income Job status Debts Rating
John 40000 employed 0 +
Mary 38000 employed 10000 -
Stephen 21000 self-employed 20000 -
Eric 2000 student 10000 -
Alice 35000 employed 4000 +
9/26/18 Heiko Paulheim 93
Brief Interlude: Model Overfitting
• Extreme example
– Predict credit rating
• Possible overfit models:
– (34000<income<36000) || (income>39000) → rating=positive
– (name=John) || (name=Alice) → rating=positive
Name Net Income Job status Debts Rating
John 40000 employed 0 +
Mary 38000 employed 10000 -
Stephen 21000 self-employed 20000 -
Eric 2000 student 10000 -
Alice 35000 employed 4000 +
9/26/18 Heiko Paulheim 94
Brief Interlude: Model Overfitting
• All three models perfectly describe the training data
– But only one is a useful generalization
• Two goals of machine learning (sometimes contradicting):
– Explain training data as good as possible
– Find a model that is as general as possible
• Strategies for preventing overfitting:
– Cross validation
– Model pruning
– Stopping criteria in model building
– Occam's Razor
– ...
9/26/18 Heiko Paulheim 95
Detecting Incidents from Social Media
• Accuracy ~90%
• But
– Accuracy drops to ~85% when applying the model to a different city
– Model overfitting?
• Example set:
– “Again crash on I90”
– “Accident on I90”
• Model:
– “I90” → indicates traffic accident
• Applying the model:
– “Two cars crashed on I51” → not related to traffic accident
9/26/18 Heiko Paulheim 96
Using KGs for Preventing Overfitting
• Example set:
– “Again crash on I90”
– “Accident on I90”
dbpedia:Interstate_90
dbpedia-owl:Road
rdf:type
dbpedia:Interstate_51
rdf:type
• Model:
– dbpedia-owl:Road → indicates traffic accident
• Applying the model:
– “Two cars crashed on I51” → indicates traffic accident
• Using DBpedia Spotlight + FeGeLOD
– Accuracy keeps up at 90%
– Overfitting is avoided
9/26/18 Heiko Paulheim 97
What's that Noise?
• A slightly different scenario:
– Noise measurements from cities
• Create predictions
– For the rest of the city
– For new infrastructure
• Note:
– No handcrafted rules
– But a fully automatic prediction
●
Based on example
measurements
9/26/18 Heiko Paulheim 98
What's that Noise?
• Additional data comes from
– Linked Geo Data
– Open Street Maps (Streets: types, lanes and speed limits)
– German weather observations (DWD)
9/26/18 Heiko Paulheim 99
What's that Noise?
• Results:
– Machine trained model for noise
– ~80% accuracy in predicting the noise level (six classes)
●
Mean absolute error only 0.077
• This allows to...
– generate a whole noise map from a small set of observations
– play “what if” with hypothetical changes
●
e.g., how do speed limits affect noise levels?
9/26/18 Heiko Paulheim 100
And now for Something Completely Different
• Who are these people?
9/26/18 Heiko Paulheim 101
Statistical Data
• Statistics are very wide spread
– Quality of living in cities
– Corruption by country
– Fertility rate by country
– Suicide rate by country
– Box office revenue of films
– ...
9/26/18 Heiko Paulheim 102
Statistical Data
• Questions we are often interested in
– Why does city X have a high/low quality of living?
– Why is the corruption higher in country A than in country B?
– Will a new film create a high/low box office revenue?
• i.e., we are looking for
– explanations
– forecasts (e.g., extrapolations)
9/26/18 Heiko Paulheim 103
Statistical Data
http://xkcd.com/605/
9/26/18 Heiko Paulheim 104
Statistical Data
• What statistics typically look like
9/26/18 Heiko Paulheim 105
Statistical Data
• There are powerful tools for finding correlations etc.
– but many statistics cannot be interpreted directly
– background knowledge is missing
• So where do we get background knowledge from?
– with as little efforts as possible
9/26/18 Heiko Paulheim 106
Statistical Data
• What we have
9/26/18 Heiko Paulheim 107
Statistical Data
• What we need
9/26/18 Heiko Paulheim 108
Possible Sources for Background Knowledge
9/26/18 Heiko Paulheim 109
...and we've already seen FeGeLOD
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
N a m e d E n t it y
R e c o g n it io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
F e a t u r e
G e n e r a t io n
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l
1 4 1 4 7 1
C ity _ U R I_ ...
...
F e a t u r e
S e le c t io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l
1 4 1 4 7 1
9/26/18 Heiko Paulheim 110
Statistical Data
• Adding background knowledge
– FeGeLOD framework
• Correlation analysis
– e.g., Pearson Correlation Coefficient
• Rule learning
– e.g., Association Rule Mining
– e.g., Subgroup Discovery
• Further data preprocessing
– depending on approach
– e.g., discretization
9/26/18 Heiko Paulheim 111
Prototype Tool: Explain-a-LOD
• Loads a statistics file (e.g., CSV)
• Adds background knowledge
• Presents explanations
9/26/18 Heiko Paulheim 112
Presenting Explanations
• Verbalization with simple patterns
– e.g., negative correlation between population and quality of living
– "A city which has a low population has a high quality of living"
• Color coding
– By correlation coefficient, confidence/support of rules, etc.
9/26/18 Heiko Paulheim 113
Statistical Data: Examples
• Data Set: Mercer Quality of Living
– Quality of living in more than cities word wide
– norm: NYC=100 (value range 23-109)
– As of 1999
– http://across.co.nz/qualityofliving.htm
• Knowledge graphs used in the examples:
– DBpedia
– CIA World Factbook for statistics by country
9/26/18 Heiko Paulheim 114
Statistical Data: Examples
• Examples for low quality cities
– big hot cities (junHighC >= 27 and areaTotalKm >= 334)
– cold cities where no music has ever been recorded
(recordedIn_in = false and janHighC <= 16)
– latitude <= 24 and longitude <= 47
●
a very accurate rule
●
but what's the interpretation?
9/26/18 Heiko Paulheim 115
Statistical Data: Examples
9/26/18 Heiko Paulheim 116
Statistical Data: Examples
• Data Set: Transparency International
– 177 Countries and a corruption perception indicator
(between 1 and 10)
– As of 2010
– http://www.transparency.org/cpi2010/results
9/26/18 Heiko Paulheim 117
Statistical Data: Examples
• Example rules for countries with low corruption
– HDI > 78%
●
Human Development Index, calculated from
live expectancy, education level, economic performance
– OECD member states
– Foundation place of more than nine organizations
– More than ten mountains
– More than ten companies with their headquarter in that state,
but less than two cargo airlines
9/26/18 Heiko Paulheim 118
Statistical Data: Examples
• Data Set: Burnout rates
– 16 German DAX companies
– Absolute and relative numbers
– As of 2011
– http://de.statista.com/statistik/daten/studie/226959/umfrage/burn-out-
erkrankungen-unter-mitarbeitern-ausgewaehlter-dax-unternehmen/
9/26/18 Heiko Paulheim 119
Statistical Data: Examples
• Findings for burnout rates
– Positive correlation between turnover and burnout rates
– Car manufacturers are less prone to burnout
– German companies are less prone to burnout than international ones
●
Exception: Frankfurt
9/26/18 Heiko Paulheim 120
Statistical Data: Examples
• Data Set: Antidepressives consumption
– In European countries
– Source: OECD
– http://www.oecd-ilibrary.org/social-issues-migration-health/health-at-a-
glance-2011/pharmaceutical-consumption_health_glance-2011-39-en
9/26/18 Heiko Paulheim 121
Statistical Data: Examples
• Findings for antidepressives consumption
– Larger countries have higher consumption
– Low HDI → high consumption
– By geography:
●
Nordic countries, countries at the Atlantic: high
●
Mediterranean: medium
●
Alpine countries: low
– High average age → high consupmtion
– High birth rates → high consumption
9/26/18 Heiko Paulheim 122
Statistical Data: Examples
• Data Set: Suicide rates
– By country
– OECD states
– As of 2005
– http://www.washingtonpost.com/wp-srv/world/suiciderate.html
9/26/18 Heiko Paulheim 123
Statistical Data: Examples
• Findings for suicide rates
– Democraties have lower suicide rates than other forms of government
– High HDI → low suicide rate
– High population density → high suicide rate
– By geography:
●
At the sea → low
●
In the mountains → high
– High Gini index → low suicide rate
●
High Gini index ↔ unequal distribution of wealth
– High usage of nuclear power → high suicide rates
9/26/18 Heiko Paulheim 124
Statistical Data: Examples
• Data set: sexual activity
– Percentage of people having sex weekly
– By country
– Survey by Durex 2005-2009
– http://chartsbin.com/view/uya
9/26/18 Heiko Paulheim 125
Statistical Data: Examples
• Findings on sexual activity
– By geography:
●
High in Europe, low in Asia
●
Low in Island states
– By language:
●
English speaking: low
●
French speaking: high
– Low average age → high activity
– High GDP per capita → low activity
– High unemployment rate → high activity
– High number of ISP providers → low activity
9/26/18 Heiko Paulheim 126
Try it... but be careful!
• Download from
http://www.ke.tu-darmstadt.de/resources/explain-a-lod
• including a demo video, papers, etc.
• Pitfalls
– Open world assumption
– LOD may be noisy
– Biases
– ...
9/26/18 Heiko Paulheim 127
Try it... but be careful!
• Download from
http://www.ke.tu-darmstadt.de/resources/explain-a-lod
• including a demo video, papers, etc.
http://xkcd.com/552/
9/26/18 Heiko Paulheim 128
Intermediate Conclusions
• Many tasks require massive background knowledge
– Can be acquired from knowledge graphs
– e.g., FeGeLOD framework, RapidMiner LOD extension
• Machine learning is often useful to...
– make sense using knowledge graphs
– answer non-trivial questions
– Add additional knowledge dimensions (e.g., similarity)
9/26/18 Heiko Paulheim 129
But...
• ...what if the Knowledge Graphs we have are just not good enough?
9/26/18 Heiko Paulheim 130
Problem: Incompleteness
• Example: get a list of science fiction writers in DBpedia
select ?x where
{?x a dbo:Writer .
?x dbo:genre dbr:Science_Fiction}
order by ?x
9/26/18 Heiko Paulheim 131
Problem: Incompleteness
• Results from DBpedia
Arthur C. Clarke?
H.G. Wells?
Isaac Asimov?
9/26/18 Heiko Paulheim 132
Problem: Incompleteness
• What reasons can cause incomplete results?
• Two possible problems:
– The resource at hand is not of type dbo:Writer
– The genre relation to dbr:Science_Fiction is missing
select ?x where
{?x a dbo:Writer .
?x dbo:genre dbr:Science_Fiction}
order by ?x
9/26/18 Heiko Paulheim 133
Common Errors in Knowledge Graphs
• Various works on Knowledge Graph Refinement
– Knowledge Graph completion
– Error detection
• See, e.g., 2017 survey in
Semantic Web Journal
Paulheim: Knowledge Graph Refinement – A Survey
of Approaches and Evaluation Methods. SWJ 8(3), 2017
9/26/18 Heiko Paulheim 134
Common Errors in Knowledge Graphs
• Missing types
– Estimate (2013) for DBpedia: at least 2.6M type statements are
missing
– Using YAGO as “ground truth”
• “Well, we’re semantics folks, we have ontologies!”
– CONSTRUCT {?x a ?t}
WHERE { {?x ?r ?y . ?r rdfs:domain ?t}
UNION {?y ?r ?x . ?r rdfs:range ?t} }
Bizer & Paulheim: Type Inference on Noisy RDF Data. In: ISWC 2013
9/26/18 Heiko Paulheim 135
Common Errors in Knowledge Graphs
• Experiment of RDFS reasoning for typing Germany
• Results:
– Place, PopulatedPlace, Award, MilitaryConflict,
City, Country, EthnicGroup, Genre, Stadium,
Settlement, Language, MontainRange, PersonFunction,
Race, RouteOfTransportation, Building, Mountain,
Airport, WineRegion
• Bottom line: RDFS reasoning accumulates errors
– Germany is the object of 44,433 statements
– 15 single wrong statements can cause those 15 errors
– i.e., an error rate of only 0.03% (that is unlikely to achieve)
Bizer & Paulheim: Type Inference on Noisy RDF Data. In: ISWC 2013
9/26/18 Heiko Paulheim 136
Common Errors in Knowledge Graphs
• Required: a noise-tolerant approach
• SDType (meanwhile included in DBpedia)
– Use statistical distributions of properties and object types
●
P(C|p) → probability of object being of type C when observing
property p in a statement
– Averaging scores for all statements of a resource
– Weighting properties by discriminative power
• Since DBpedia 3.9: typing ~1M untyped resources
at precision >0.95
• Refinement:
– Filtering resources of non-instance pages and list pages
Bizer & Paulheim: Type Inference on Noisy RDF Data. In: ISWC 2013
9/26/18 Heiko Paulheim 137
Common Errors in Knowledge Graphs
• Depiction of characteristic distributions
– example: director
9/26/18 Heiko Paulheim 138
Common Errors in Knowledge Graphs
• Recap
– Trade-off coverage vs. accuracy
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
2
4
6
8
10
12
# found
types
precision
Lower bound for threshold
Precision
#foundtypes
Bizer & Paulheim: Type Inference on Noisy RDF Data. In: ISWC 2013
9/26/18 Heiko Paulheim 139
Hierarchical Type Prediction
• SLCN: Decomposing the type prediction problem:
– we learn a classifier for each class
– much simpler learning problem, less instances and features required
– different variants for sampling negative instances
Melo, Völker & Paulheim: Type prediction in noisy RDF knowledge bases using
hierarchical multilabel classification with graph and latent features. IJAIT, 2017
9/26/18 Heiko Paulheim 140
Common Errors in Knowledge Graphs
• The same idea applied to identification of noisy statements
– i.e., a statement is implausible if the distribution
of its object’s types deviates from
the overall distribution for the predicate
• Removing ~20,000 erroneous statements from DBpedia
• Error analysis
– Errors in Wikipedia account for ~30%
– Other typical problems: see following slides
Bizer & Paulheim: Improving the quality of linked data using statistical distributions.
In: IJSWIS 10(2), 2014
9/26/18 Heiko Paulheim 141
Common Errors in Knowledge Graphs
• Typical errors
– links in longer texts are not interpreted correctly
– dbr:Carole_Goble dbo:award dbr:Jim_Gray
Paulheim & Gangemi: Serving DBpedia with DOLCE –
More than Just Adding a Cherry on Top. ISWC 2015
9/26/18 Heiko Paulheim 142
Common Errors in Knowledge Graphs
• Typical errors
– Misinterpretation of redirects
– dbr:Ben_Casey dbo:company dbr:Bing_Cosby
Paulheim & Gangemi: Serving DBpedia with DOLCE –
More than Just Adding a Cherry on Top. ISWC 2015
9/26/18 Heiko Paulheim 143
Common Errors in Knowledge Graphs
• Typical errors
– Metonymy
– dbr:Human_Nature_(band) dbo:genre dbr:Motown,
– Links with anchors pointing to subsections in a page
– First_Army_(France)#1944-1945
Paulheim & Gangemi: Serving DBpedia with DOLCE –
More than Just Adding a Cherry on Top. ISWC 2015
9/26/18 Heiko Paulheim 144
Common Errors in Knowledge Graphs
• Identifying individual errors is possible with many techniques
– e.g., statistics, reasoning, exploiting upper ontologies, …
• ...but what do we do with those efforts?
– they typically end up in drawers and abandoned GitHub repositories
Paulheim & Gangemi: Serving DBpedia with DOLCE – More than Just Adding a
Cherry on Top. ISWC 2015
Paulheim: Data-driven Joint Debugging of the DBpedia Mappings and Ontology.
ESWC 2017
9/26/18 Heiko Paulheim 145
Motivation
• Possible option 1: Remove erroneous triples from DBpedia
• Challenges
– May remove correct axioms, may need thresholding
– Needs to be repeated for each release
– Needs to be materialized on all of DBpedia
DBpedia
Extraction
FrameworkWikipedia
DBpedia Mappings Wiki
Post
Filter
9/26/18 Heiko Paulheim 146
Motivation
• Possible option 2: Integrate into DBpedia Extraction Framework
• Challenges
– Development workload
– Some approaches are not fully automated (technically or conceptually)
– Scalability
DBpedia
Extraction
Framework
plus filter
module
Wikipedia
DBpedia Mappings Wiki
DBpedia
Extraction
Framework
9/26/18 Heiko Paulheim 147
Common Errors in Knowledge Graphs
• Goal: a third option
– Find the root of the error and fix it!
Wikipedia
DBpedia Mappings Wiki
DBpedia
Extraction
Framework
Inconsistency
DetectionIdentification
of suspicious
mappings and
ontology
constructs
Paulheim: Data-driven Joint Debugging of the DBpedia Mappings and Ontology. ESWC 2017
9/26/18 Heiko Paulheim 148
Common Errors in Knowledge Graphs
• All the work so far has been focused on individual errors
• Hypothesis: few error patterns with a common root
cause a large fraction of errors
– i.e., finding those patterns boosts the quality of a knowledge graph
• Observation:
– Top 40 patterns constitute 96% of inconsistent statements in DBpedia
Paulheim, Gangemi: Serving DBpedia with DOLCE – More than Just Adding a Cherry on Top. ISWC 2015
9/26/18 Heiko Paulheim 149
Common Errors in Knowledge Graphs
• Case 1: Wrong mapping
• Example:
– branch in infobox military unit
is mapped to dbo:militaryBranch
●
but dbo:militaryBranch
has dbo:Person as its domain
– correction: dbo:commandStructure
– Affects 12,172 statements
(31% of all dbo:militaryBranch)
9/26/18 Heiko Paulheim 150
Common Errors in Knowledge Graphs
• Case 2: Mappings that should be removed
• Example:
– dbo:picture
– Most of the are inconsistent (64.5% places, 23.0% persons)
– Reason: statements are extracted from picture caption
dbr:Brixton_Academy
dbo:picture
dbr:Brixton .
dbr:Justify_My_Love
dbo:picture
dbr:Madonna_(entertainer) .
9/26/18 Heiko Paulheim 151
Common Errors in Knowledge Graphs
• Case 3: Ontology problems (domain/range)
• Example 1:
– Populated places (e.g., cities) are used both as place and organization
– For some properties, the range is either one of the two
●
e.g., dbo:operator
– Polysemy should be reflected in the ontology
• Example 2:
– dbo:architect, dbo:designer, dbo:engineer etc.
have dbo:Person as their range
– Significant fractions (8.6%, 7.6%, 58.4%, resp.)
have a dbo:Organization as object
– Range should be broadened
9/26/18 Heiko Paulheim 152
Common Errors in Knowledge Graphs
• Case 4: Missing properties
• Example 1:
– dbo:president links an organization to its president
– Majority use (8,354, or 76.2%):
link a person to the president s/he served for
• Example 2:
– dbo:instrument links an artist
to the instrument s/he plays
– Prominent alternative use (3,828, or 7.2%):
links a genre to its characteristic instrument
9/26/18 Heiko Paulheim 153
Common Errors in Knowledge Graphs
• Introductory example:
Arthur C. Clarke?
H.G. Wells?
Isaac Asimov?
select ?x where
{?x a dbo:Writer .
?x dbo:genre dbr:Science_Fiction}
order by ?x
9/26/18 Heiko Paulheim 154
Common Errors in Knowledge Graphs
• Incompleteness in relation assertions
• Example: Arthur C. Clarke, Isaac Asimov, ...
– There is no explicit link to Science Fiction in the infobox
– i.e., the statement for
... dbo:genre dbr:Science_Fiction
is not generated
9/26/18 Heiko Paulheim 155
Common Errors in Knowledge Graphs
• Example for recent work (ISWC 2017):
heuristic relation extraction from Wikipedia abstracts
• Idea:
– There are probably certain patterns:
●
e.g., all genres linked in an abstract about a writer
are that writer’s genres
●
e.g., the first place linked in an abstract about a person
is that person’s birthplace
– The types are already in DBpedia
– We can use existing relations as training data
●
Using a local closed world assumption for negative examples
– Learned models can be evaluated and only used at a certain precision
Heist & Paulheim: Language-agnostic relation extraction from Wikipedia Abstracts.
ISWC 2017
9/26/18 Heiko Paulheim 156
Common Errors in Knowledge Graphs
• Results:
– 1M additional assertions can be learned for 100 relations
at 95% precision
• Additional consideration:
– We use only links, types from DBpedia, and positional features
– No language-specific information (e.g., POS tags)
– Thus, we are not restricted to English!
Heist & Paulheim: Language-agnostic relation extraction from Wikipedia Abstracts.
ISWC 2017
9/26/18 Heiko Paulheim 157
Common Errors in Knowledge Graphs
• Cross-lingual experiment:
– Using the 12 largest language editions of Wikipedia
– Exploiting inter-language links
Heist & Paulheim: Language-agnostic relation extraction from Wikipedia Abstracts.
ISWC 2017
9/26/18 Heiko Paulheim 158
Common Errors in Knowledge Graphs
• Analysis
– Is there a relation between the language and the the country
(dbo:country) of the entities for which information is extracted?
Heist & Paulheim: Language-agnostic relation extraction from Wikipedia Abstracts.
ISWC 2017
9/26/18 Heiko Paulheim 159
Common Errors in Knowledge Graphs
• So far, we have looked at relation assertions
• Numerical values can also be problematic…
– Recap: Wikipedia is made for human consumption
• The following are all valid representations of the same height value
(and perfectly understandable by humans)
– 6 ft 6 in, 6ft 6in, 6'6'', 6'6”, 6´6´´, …
– 1.98m, 1,98m, 1m 98, 1m 98cm, 198cm, 198 cm, …
– 6 ft 6 in (198 cm), 6ft 6in (1.98m), 6'6'' (1.98 m), …
– 6 ft 6 in[1]
, 6 ft 6 in [citation needed]
, …
– ...
Wienand & Paulheim: Detecting Incorrect Numerical Data in DBpedia. ESWC 2014
Fleischhacker et al.: Detecting Errors in Numerical Linked Data Using Cross-Checked
Outlier Detection. ISWC 2014
9/26/18 Heiko Paulheim 160
Common Errors in Knowledge Graphs
• Approach: outlier detection
– With preprocessing: finding meaningful subpopulations
– With cross-checking: discarding natural outliers
• Findings: 85%-95% precision possible
– depending on predicate
– Identification of typical parsing problems
Wienand & Paulheim: Detecting Incorrect Numerical Data in DBpedia. ESWC 2014
Fleischhacker et al.: Detecting Errors in Numerical Linked Data Using Cross-Checked
Outlier Detection. ISWC 2014
9/26/18 Heiko Paulheim 161
Common Errors in Knowledge Graphs
• Footprint of a systematic error
Wienand & Paulheim: Detecting Incorrect Numerical Data in DBpedia. ESWC 2014
Fleischhacker et al.: Detecting Errors in Numerical Linked Data Using Cross-Checked
Outlier Detection. ISWC 2014
9/26/18 Heiko Paulheim 162
Common Errors in Knowledge Graphs
• Errors include
– Interpretation of imperial units
– Unusual decimal/thousands separators
– Concatenation (population 28,322,006)
Wienand & Paulheim: Detecting Incorrect Numerical Data in DBpedia. ESWC 2014
Fleischhacker et al.: Detecting Errors in Numerical Linked Data Using Cross-Checked
Outlier Detection. ISWC 2014
9/26/18 Heiko Paulheim 163
Take Aways
• A lot of knowledge graphs are out there
– built with various methods (manual, automatic, ...)
– can be used in many different tasks
– data interpretation, recommender systems, …
• Knowledge graphs are not perfect
– machine learning helps improving them
9/26/18 Heiko Paulheim 164
Hands On
• You have seen many knowledge graphs
• Now contribute to work in progress
– Go to WebIsALOD
– http://webisa.webdatacommons.org/
• Look at, e.g.,
http://webisa.webdatacommons.org/concept/_apple_
• You’ll observe a few issues here:
– wrong statements
– class/instance distinction
– disambiguation (fruit vs. company)
– knowledge implicit in classes
(e.g., silicon valley company)
9/26/18 Heiko Paulheim 165
Hands On
• Pick one (or more) of those issues
• Sketch a possible approach to solve this
– e.g., using machine learning
• Think:
– what does the learning problem look like?
– how could we acquire training data?
– what results do you expect?
– where could your approach fail?
• You’re developing slideware
– i.e., no implemented solutions are expected
– however, please look at actual data to demonstrate your approach
9/26/18 Heiko Paulheim 166
Machine Learning
with and from
Semantic Web Knowledge Graphs
Heiko Paulheim

Weitere ähnliche Inhalte

Was ist angesagt?

Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Heiko Paulheim
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge GraphsHeiko Paulheim
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Heiko Paulheim
 
Type Inference on Noisy RDF Data
Type Inference on Noisy RDF DataType Inference on Noisy RDF Data
Type Inference on Noisy RDF DataHeiko Paulheim
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyHeiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Heiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Heiko Paulheim
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataHeiko Paulheim
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfHeiko Paulheim
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge DiscoveryHeiko Paulheim
 
Linked Data at the German National Library
Linked Data at the German National LibraryLinked Data at the German National Library
Linked Data at the German National LibraryReinhold Heuvelmann
 
Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013Lars G. Svensson
 
Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...
Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...
Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...Europeana
 
Researcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized WebResearcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized WebHerbert Van de Sompel
 
Adoption and Integration of Persistent Identifiers in European Research Infor...
Adoption and Integration of Persistent Identifiers in European Research Infor...Adoption and Integration of Persistent Identifiers in European Research Infor...
Adoption and Integration of Persistent Identifiers in European Research Infor...LIBER Europe
 
GIS Day 2015: Geoinformatics, Open Source and Videos - a library perspective
GIS Day 2015: Geoinformatics, Open Source and Videos - a library perspectiveGIS Day 2015: Geoinformatics, Open Source and Videos - a library perspective
GIS Day 2015: Geoinformatics, Open Source and Videos - a library perspectivePeter Löwe
 
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...LIBER Europe
 

Was ist angesagt? (19)

Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
 
Type Inference on Noisy RDF Data
Type Inference on Noisy RDF DataType Inference on Noisy RDF Data
Type Inference on Noisy RDF Data
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 
Linked Data at the German National Library
Linked Data at the German National LibraryLinked Data at the German National Library
Linked Data at the German National Library
 
Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013Linked data in the German National Library at the OCLC IFLA round table 2013
Linked data in the German National Library at the OCLC IFLA round table 2013
 
Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...
Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...
Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...
 
Perseverance on Persistence
Perseverance on PersistencePerseverance on Persistence
Perseverance on Persistence
 
Researcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized WebResearcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized Web
 
Adoption and Integration of Persistent Identifiers in European Research Infor...
Adoption and Integration of Persistent Identifiers in European Research Infor...Adoption and Integration of Persistent Identifiers in European Research Infor...
Adoption and Integration of Persistent Identifiers in European Research Infor...
 
GIS Day 2015: Geoinformatics, Open Source and Videos - a library perspective
GIS Day 2015: Geoinformatics, Open Source and Videos - a library perspectiveGIS Day 2015: Geoinformatics, Open Source and Videos - a library perspective
GIS Day 2015: Geoinformatics, Open Source and Videos - a library perspective
 
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
 

Ähnlich wie Machine Learning with and for Semantic Web Knowledge Graphs

AMIA 2017 - Data visualisation
AMIA 2017 - Data visualisationAMIA 2017 - Data visualisation
AMIA 2017 - Data visualisationNickRichardson44
 
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim
 
NDF 2017 - Data visualisation presentation
NDF 2017 - Data visualisation presentationNDF 2017 - Data visualisation presentation
NDF 2017 - Data visualisation presentationNickRichardson44
 
Distributed Deep Learning (And How to Get Involved)
Distributed Deep Learning (And How to Get Involved)Distributed Deep Learning (And How to Get Involved)
Distributed Deep Learning (And How to Get Involved)Sina Sheikholeslami
 
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...Open Knowledge Maps
 
Pathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and ChallengesPathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and ChallengesTao Xie
 
Understanding the Past; Being Honest about the Present; Planning for the Future
Understanding the Past; Being Honest about the Present; Planning for the FutureUnderstanding the Past; Being Honest about the Present; Planning for the Future
Understanding the Past; Being Honest about the Present; Planning for the Futurelisbk
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim
 
Introduction to information visualisation for humanities PhDs
Introduction to information visualisation for humanities PhDsIntroduction to information visualisation for humanities PhDs
Introduction to information visualisation for humanities PhDsMia
 
Openglam20110917
Openglam20110917Openglam20110917
Openglam20110917okfn
 
Observations on Annotations – From Computational Linguistics and the World Wi...
Observations on Annotations – From Computational Linguistics and the World Wi...Observations on Annotations – From Computational Linguistics and the World Wi...
Observations on Annotations – From Computational Linguistics and the World Wi...Georg Rehm
 
Acquisition of audiovisual Scientific Technical Information from OSGeo: A wor...
Acquisition of audiovisual Scientific Technical Information from OSGeo: A wor...Acquisition of audiovisual Scientific Technical Information from OSGeo: A wor...
Acquisition of audiovisual Scientific Technical Information from OSGeo: A wor...Peter Löwe
 
W4P-Launch - Open Source Crowdsourcing platform
W4P-Launch - Open Source Crowdsourcing platformW4P-Launch - Open Source Crowdsourcing platform
W4P-Launch - Open Source Crowdsourcing platformOpen Knowledge Belgium
 
Osm presentation workshop 19 sept 2018
Osm presentation workshop 19 sept 2018Osm presentation workshop 19 sept 2018
Osm presentation workshop 19 sept 2018osimod
 
AliceVision : pipeline de reconstruction 3D open source
AliceVision : pipeline de reconstruction 3D open sourceAliceVision : pipeline de reconstruction 3D open source
AliceVision : pipeline de reconstruction 3D open sourceOpen Source Experience
 
OpenGLAM CH Hackathons
OpenGLAM CH HackathonsOpenGLAM CH Hackathons
OpenGLAM CH HackathonsBeat Estermann
 
discussion_3_project.pdf
discussion_3_project.pdfdiscussion_3_project.pdf
discussion_3_project.pdfKuan-Tsae Huang
 
OCLC Research update: Active engagement.
OCLC Research update: Active engagement.OCLC Research update: Active engagement.
OCLC Research update: Active engagement.Lynn Connaway
 
Exploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningExploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningHeiko Paulheim
 

Ähnlich wie Machine Learning with and for Semantic Web Knowledge Graphs (20)

AMIA 2017 - Data visualisation
AMIA 2017 - Data visualisationAMIA 2017 - Data visualisation
AMIA 2017 - Data visualisation
 
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
NDF 2017 - Data visualisation presentation
NDF 2017 - Data visualisation presentationNDF 2017 - Data visualisation presentation
NDF 2017 - Data visualisation presentation
 
Distributed Deep Learning (And How to Get Involved)
Distributed Deep Learning (And How to Get Involved)Distributed Deep Learning (And How to Get Involved)
Distributed Deep Learning (And How to Get Involved)
 
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
 
“Building Bridges” Workshop Summary
“Building Bridges” Workshop Summary“Building Bridges” Workshop Summary
“Building Bridges” Workshop Summary
 
Pathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and ChallengesPathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and Challenges
 
Understanding the Past; Being Honest about the Present; Planning for the Future
Understanding the Past; Being Honest about the Present; Planning for the FutureUnderstanding the Past; Being Honest about the Present; Planning for the Future
Understanding the Past; Being Honest about the Present; Planning for the Future
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
 
Introduction to information visualisation for humanities PhDs
Introduction to information visualisation for humanities PhDsIntroduction to information visualisation for humanities PhDs
Introduction to information visualisation for humanities PhDs
 
Openglam20110917
Openglam20110917Openglam20110917
Openglam20110917
 
Observations on Annotations – From Computational Linguistics and the World Wi...
Observations on Annotations – From Computational Linguistics and the World Wi...Observations on Annotations – From Computational Linguistics and the World Wi...
Observations on Annotations – From Computational Linguistics and the World Wi...
 
Acquisition of audiovisual Scientific Technical Information from OSGeo: A wor...
Acquisition of audiovisual Scientific Technical Information from OSGeo: A wor...Acquisition of audiovisual Scientific Technical Information from OSGeo: A wor...
Acquisition of audiovisual Scientific Technical Information from OSGeo: A wor...
 
W4P-Launch - Open Source Crowdsourcing platform
W4P-Launch - Open Source Crowdsourcing platformW4P-Launch - Open Source Crowdsourcing platform
W4P-Launch - Open Source Crowdsourcing platform
 
Osm presentation workshop 19 sept 2018
Osm presentation workshop 19 sept 2018Osm presentation workshop 19 sept 2018
Osm presentation workshop 19 sept 2018
 
AliceVision : pipeline de reconstruction 3D open source
AliceVision : pipeline de reconstruction 3D open sourceAliceVision : pipeline de reconstruction 3D open source
AliceVision : pipeline de reconstruction 3D open source
 
OpenGLAM CH Hackathons
OpenGLAM CH HackathonsOpenGLAM CH Hackathons
OpenGLAM CH Hackathons
 
discussion_3_project.pdf
discussion_3_project.pdfdiscussion_3_project.pdf
discussion_3_project.pdf
 
OCLC Research update: Active engagement.
OCLC Research update: Active engagement.OCLC Research update: Active engagement.
OCLC Research update: Active engagement.
 
Exploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningExploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data Mining
 

Mehr von Heiko Paulheim

Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterHeiko Paulheim
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopHeiko Paulheim
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionHeiko Paulheim
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerHeiko Paulheim
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Heiko Paulheim
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaHeiko Paulheim
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionHeiko Paulheim
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesHeiko Paulheim
 

Mehr von Heiko Paulheim (9)

Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly Detection
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpedia
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List Pages
 

Kürzlich hochgeladen

Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 

Kürzlich hochgeladen (17)

Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 

Machine Learning with and for Semantic Web Knowledge Graphs

  • 1. 9/26/18 Heiko Paulheim 1 Machine Learning with and from Semantic Web Knowledge Graphs Heiko Paulheim
  • 2. 9/26/18 Heiko Paulheim 2 Introduction • Heiko Paulheim • Professor for Data Science University of Mannheim, Germany • Research on – Knowledge Graphs – Linked Data – Machine Learning – … • Participant at Reasoning Web Summer School as a PhD student in 2010 :-)
  • 3. 9/26/18 Heiko Paulheim 3 Outline • Knowledge Graphs – brief introduction – overview of existing ones – recent developments – comparisons • Using Knowledge Graphs in Machine Learning – approaches – applications • Using Machine Learning for Knowledge Graphs – quality of knowledge graphs – internal and external approaches
  • 4. 9/26/18 Heiko Paulheim 4 Introduction • You’ve seen this, haven’t you? Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/
  • 5. 9/26/18 Heiko Paulheim 5 Introduction • Knowledge Graphs on the Web • Everybody talks about them, but what is a Knowledge Graph? – I don’t have a definition either... Journal Paper Review, (Natasha Noy, Google, June 2015): “Please define what a knowledge graph is – and what it is not.”
  • 6. 9/26/18 Heiko Paulheim 6 Introduction • Knowledge Graph definitions • Many people talk about KGs, few give definitions • Working definition: a Knowledge Graph – mainly describes instances and their relations in a graph ● Unlike an ontology ● Unlike, e.g., WordNet – Defines possible classes and relations in a schema or ontology ● Unlike schema-free output of some IE tools – Allows for interlinking arbitrary entities with each other ● Unlike a relational database – Covers various domains ● Unlike, e.g., Geonames
  • 7. 9/26/18 Heiko Paulheim 7 Introduction Dagstuhl Seminar “Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web”, Sep 09-14, 2018
  • 8. 9/26/18 Heiko Paulheim 8 Introduction • Knowledge Graphs out there (not guaranteed to be complete) public private Paulheim: Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8:3 (2017), pp. 489-508
  • 9. 9/26/18 Heiko Paulheim 9 Motivation • At ISWC and similar conferences, you see a couple of works... – that use DBpedia for doing X – that use Wikidata for doing Y – ... • If you see them, do you ever ask yourselves: – Why DBpedia and not Wikidata? (or the other way round?)
  • 10. 9/26/18 Heiko Paulheim 10 Motivation • Questions: – which knowledge graph should I use for which purpose? – are there significant differences? – do they come with strengths and weaknesses? – would it help to combine them? • For answering those questions, we need to – make empirical statements about knowledge graphs – define measures – define setups in which to measure them
  • 11. 9/26/18 Heiko Paulheim 11 Knowledge Graph Creation: CyC • The beginning – Encyclopedic collection of knowledge – Started by Douglas Lenat in 1984 – Estimation: 350 person years and 250,000 rules should do the job of collecting the essence of the world’s knowledge • The present (as of June 2017) – ~1,000 person years, $120M total development cost – 21M axioms and rules – Used to exist until 2017
  • 12. 9/26/18 Heiko Paulheim 12 Knowledge Graph Creation • Lesson learned no. 1: – Trading efforts against accuracy Min. efforts Max. accuracy
  • 13. 9/26/18 Heiko Paulheim 13 Knowledge Graph Creation: Freebase • The 2000s – Freebase: collaborative editing – Schema not fixed • Present – Acquired by Google in 2010 – Powered first version of Google’s Knowledge Graph – Shut down in 2016 – Partly lives on in Wikidata (see in a minute) coming up soon: was it a good deal or not?
  • 14. 9/26/18 Heiko Paulheim 14 Knowledge Graph Creation: Freebase • Community based • Like Wikipedia, but more structured
  • 15. 9/26/18 Heiko Paulheim 15 Knowledge Graph Creation • Lesson learned no. 2: – Trading formality against number of users Max. user involvement Max. degree of formality
  • 16. 9/26/18 Heiko Paulheim 16 Knowledge Graph Creation: Wikidata • The 2010s – Wikidata: launched 2012 – Goal: centralize data from Wikipedia languages – Collaborative – Imports other datasets • Present – One of the largest public knowledge graphs (see later) – Includes rich provenance
  • 17. 9/26/18 Heiko Paulheim 17 Knowledge Graph Creation: Wikidata • Collaborative editing
  • 18. 9/26/18 Heiko Paulheim 18 Knowledge Graph Creation: Wikidata • Provenance
  • 19. 9/26/18 Heiko Paulheim 19 Knowledge Graph Creation • Lesson learned no. 3: – There is not one truth (but allowing for plurality adds complexity) Max. simplicity Max. support for plurality
  • 20. 9/26/18 Heiko Paulheim 20 Knowledge Graph Creation: DBpedia & YAGO • The 2010s – DBpedia: launched 2007 – YAGO: launched 2008 – Extraction from Wikipedia using mappings & heuristics • Present – Two of the most used knowledge graphs – ...with Wikidata catching up
  • 23. 9/26/18 Heiko Paulheim 23 Knowledge Graph Creation • Lesson learned no. 4: – Heuristics help increasing coverage (at the cost of accuracy) Max. accuracy Max. coverage
  • 24. 9/26/18 Heiko Paulheim 24 Knowledge Graph Creation: NELL • The 2010s – NELL: Never ending language learner – Input: ontology, seed examples, text corpus – Output: facts, text patterns – Large degree of automation, occasional human feedback • Today – Still running – New release every few days
  • 25. 9/26/18 Heiko Paulheim 25 Knowledge Graph Creation: NELL • Extraction of a Knowledge Graph from a Text Corpus Nine Inch Nails singer Trent Reznor, born 1965 ...as stated by Filter singer Richard Patrick ...says Slipknot singer Corey Taylor, 44, in the interview. “X singer Y” → band_member(X,Y) band_member(Nine_Inch_Nails, Trent_Reznor) band_member(Filter,Richard_Patrick) band_member(Slipknot,Corey_Taylor) patterns facts
  • 26. 9/26/18 Heiko Paulheim 26 Knowledge Graph Creation • Lesson learned no. 5: – Quality cannot be maximized without human intervention Min. human intervention Max. accuracy
  • 27. 9/26/18 Heiko Paulheim 27 Summary of Trade Offs • (Manual) effort vs. accuracy and completeness • User involvement (or usability) vs. degree of formality • Simplicity vs. support for plurality and provenance → all those decisions influence the shape of a knowledge graph!
  • 28. 9/26/18 Heiko Paulheim 28 Non-Public Knowledge Graphs • Many companies have their own private knowledge graphs – Google: Knowledge Graph, Knowledge Vault – Yahoo!: Knowledge Graph – Microsoft: Satori – Facebook: Entities Graph – Thomson Reuters: permid.org (partly public) • However, we usually know only little about them
  • 29. 9/26/18 Heiko Paulheim 29 Comparison of Knowledge Graphs • Release cycles Instant updates: DBpedia live, Freebase Wikidata Days: NELL Months: DBpedia Years: YAGO Cyc • Size and density Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017 Caution!
  • 30. 9/26/18 Heiko Paulheim 30 Comparison of Knowledge Graphs • What do they actually contain? • Experiment: pick 25 classes of interest – And find them in respective ontologies • Count instances (coverage) • Determine in and out degree (level of detail)
  • 31. 9/26/18 Heiko Paulheim 31 Comparison of Knowledge Graphs Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 32. 9/26/18 Heiko Paulheim 32 Comparison of Knowledge Graphs • Summary findings: – Persons: more in Wikidata (twice as many persons as DBpedia and YAGO) – Countries: more details in Wikidata – Places: most in DBpedia – Organizations: most in YAGO – Events: most in YAGO – Artistic works: ● Wikidata contains more movies and albums ● YAGO contains more songs Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 33. 9/26/18 Heiko Paulheim 33 Caveats • Reading the diagrams right… • So, Wikidata contains more persons – but less instances of all the interesting subclasses? • There are classes like Actor in Wikidata – but they are hardly used – rather: modeled using profession relation
  • 34. 9/26/18 Heiko Paulheim 34 Caveats • Reading the diagrams right… (ctd.) • So, Wikidata contains more data on countries, but less countries? • First: Wikidata only counts current, actual countries – DBpedia and YAGO also count historical countries • “KG1 contains less of X than KG2” can mean – it actually contains less instances of X – it contains equally many or more instances, but they are not typed with X (see later) • Second: we count single facts about countries – Wikidata records some time indexed information, e.g., population – Each point in time contributes a fact
  • 35. 9/26/18 Heiko Paulheim 35 Overlap of Knowledge Graphs • How largely do knowledge graphs overlap? • They are interlinked, so we can simply count links – For NELL, we use links to Wikipedia as a proxy DBpedia YAGO Wikidata NELL Open Cyc Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 36. 9/26/18 Heiko Paulheim 36 Overlap of Knowledge Graphs • How largely do knowledge graphs overlap? • They are interlinked, so we can simply count links – For NELL, we use links to Wikipedia as a proxy Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 37. 9/26/18 Heiko Paulheim 37 Overlap of Knowledge Graphs • Links between Knowledge Graphs are incomplete – The Open World Assumption also holds for interlinks • But we can estimate their number • Approach: – find link set automatically with different heuristics – determine precision and recall on existing interlinks – estimate actual number of links Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 38. 9/26/18 Heiko Paulheim 38 Overlap of Knowledge Graphs • Idea: – Given that the link set F is found – And the (unknown) actual link set would be C • Precision P: Fraction of F which is actually correct – i.e., measures how much |F| is over-estimating |C| • Recall R: Fraction of C which is contained in F – i.e., measures how much |F| is under-estimating |C| • From that, we estimate Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 39. 9/26/18 Heiko Paulheim 39 Overlap of Knowledge Graphs • Mathematical derivation: – Definition of recall: – Definition of precision: • Resolve both to , substitute, and resolve to Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 40. 9/26/18 Heiko Paulheim 40 Overlap of Knowledge Graphs • Experiment: – We use the same 25 classes as before – Measure 1: overlap relative to smaller KG (i.e., potential gain) – Measure 2: overlap relative to explicit links (i.e., importance of improving links) • Link generation with 16 different metrics and thresholds – Intra-class correlation coefficient for |C|: 0.969 – Intra-class correlation coefficient for |F|: 0.646 • Bottom line: – Despite variety in link sets generated, the overlap is estimated reliably – The link generation mechanisms do not need to be overly accurate Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 41. 9/26/18 Heiko Paulheim 41 Overlap of Knowledge Graphs Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 42. 9/26/18 Heiko Paulheim 42 Overlap of Knowledge Graphs • Summary findings: – DBpedia and YAGO cover roughly the same instances (not much surprising) – NELL is the most complementary to the others – Existing interlinks are insufficient for out-of-the-box parallel usage Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 43. 9/26/18 Heiko Paulheim 43 Intermezzo: Knowledge Graph Creation Cost • There are quite a few metrics for evaluating KGs – size, degree, interlinking, quality, licensing, ... Färber et al.: Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO SWJ 9(1), 2018 Zaveri et al.: Quality Assessment for Linked Open Data: A Survey. SWJ 7(1), 2016
  • 44. 9/26/18 Heiko Paulheim 44 Intermezzo: Knowledge Graph Creation Cost • ...but what is the cost of a single statement? • Some back of the envelope calculations...
  • 45. 9/26/18 Heiko Paulheim 45 Intermezzo: Knowledge Graph Creation Cost • Case 1: manual curation – Cyc: created by experts Total development cost: $120M Total #statements: 21M → $5.71 per statement – Freebase: created by laymen Assumption: adding a statement to Freebase equals adding a sentence to Wikipedia ● English Wikipedia up to April 2011: 41M working hours (Geiger and Halfaker, 2013), size in April 2011: 3.6M pages, avg. 36.4 sentences each ● Using US minimum wage: $2.25 per sentence → $2.25 per statement (Footnote: total cost of creating Freebase would be $6.75B) acquisition by Google estimated as $60-300M
  • 46. 9/26/18 Heiko Paulheim 46 Intermezzo: Knowledge Graph Creation Cost • Case 2: automatic/heuristic creation – DBpedia: 4.9M LOC, 2.2M LOC for mappings software project development: ~37 LOC per hour (Devanbu et al., 1996) we use German PhD salaries as a cost estimate → 1.85c per statement – YAGO: made from 1.6M LOC uses WordNet: 117k synsets, we treat each synset like a Wiki page → 0.83c per statement – NELL: 103k LOC → 14.25c per statement • Compared to manual curation: saving factor 16-250
  • 47. 9/26/18 Heiko Paulheim 47 Intermezzo: Knowledge Graph Creation Cost • Graph error rate against cost – we can pay for accuracy – NELL is a bit of an outlier
  • 48. 9/26/18 Heiko Paulheim 48 New Kids on the Block Subjective age: Measured by the fraction of the audience that understands a reference to your young days’ pop culture...
  • 49. 9/26/18 Heiko Paulheim 49 WebDataCommons Microdata Corpus • Web pages can use different markup – Microdata – Microformats – RDFa • Microdata with schema.org is promoted by major search engines (Google, Bing, Yahoo!, Yandex) → direct business incentive of using markup → quick and wide adoption during the last years
  • 50. 9/26/18 Heiko Paulheim 50 WebDataCommons Microdata Corpus • Structured markup on the Web can be extracted to RDF – See W3C Interest Group Note: Microdata to RDF [1] [1] http://www.w3.org/TR/microdata-rdf/ <div itemscope itemtype="http://schema.org/PostalAddress"> <span itemprop="name">Data and Web Science Group</span> <span itemprop="addressLocality">Mannheim</span>, <span itemprop="postalCode">68131</span> <span itemprop="addressCountry">Germany</span> </div> _:1 a <http://schema.org/PostalAddress> . _:1 <http://schema.org/name> "Data and Web Science Group" . _:1 <http://schema.org/addressLocality> "Mannheim" . _:1 <http://schema.org/postalCode> "68131" . _:1 <http://schema.org/adressCounty> "Germany" .
  • 51. 9/26/18 Heiko Paulheim 51 WebDataCommons Microdata Corpus • Main topics covered in the corpus – Meta information on web page content (web page, blog...) – Business data (products, offers, …) – Contact data (businesses, persons, ...) – (Product) reviews and ratings • ...and a massive long tail
  • 52. 9/26/18 Heiko Paulheim 52 WebDataCommons Microdata Corpus • Is the Microdata corpus a knowledge graph? • Recap Working definition: a Knowledge Graph – mainly describes instances and their relations in a graph – defines possible classes and relations in a schema or ontology – allows for interlinking arbitrary entities with each other – covers various domains • Yes, but...
  • 53. 9/26/18 Heiko Paulheim 53 Further Sources of Knowledge in Wikipedia • show: list pages, categories, tables, ...
  • 54. 9/26/18 Heiko Paulheim 54 CaLiGraph Idea • Entities co-occur in surface patterns – e.g., enumerations, table columns, … • Co-occurring entities share semantic patterns – e.g., types, relations, attribute values • Existing entities co-occur with new entities
  • 55. 9/26/18 Heiko Paulheim 55 CaLiGraph Idea • Surface patterns and semantic patterns also exist outside of Wikipedia
  • 56. 9/26/18 Heiko Paulheim 56 New Kids on the Block • Wikipedia-based Knowledge Graphs will remain an essential building block of Semantic Web applications • But they suffer from... – ...a coverage bias – ...limitations of the creating heuristics
  • 57. 9/26/18 Heiko Paulheim 57 Wikipedia’s Coverage Bias • One (but not the only!) possible source of coverage bias – Articles about long-tail entities become deleted
  • 58. 9/26/18 Heiko Paulheim 58 Work in Progress: DBkWik • Why stop at Wikipedia? • Wikipedia is based on the MediaWiki software – ...and so are thousands of Wikis – Fandom by Wikia: >385,000 Wikis on special topics – WikiApiary: reports >20,000 installations of MediaWiki on the Web
  • 59. 9/26/18 Heiko Paulheim 59 Work in Progress: DBkWik • Collecting Data from a Multitude of Wikis
  • 60. 9/26/18 Heiko Paulheim 60 Work in Progress: DBkWik • The DBpedia Extraction Framework consumes MediaWiki dumps • Experiment – Can we process dumps from arbitrary Wikis with it? – Are the results somewhat meaningful?
  • 61. 9/26/18 Heiko Paulheim 61 Work in Progress: DBkWik • Example from Harry Potter Wiki http://dbkwik.webdatacommons.org/
  • 62. 9/26/18 Heiko Paulheim 62 Work in Progress: DBkWik • Differences to DBpedia – DBpedia has manually created mappings to an ontology – Wikipedia has one page per subject – Wikipedia has global infobox conventions (more or less) • Challenges – On-the-fly ontology creation – Instance matching – Schema matching Hertling & Paulheim: DBkWik: A Consolidated Knowledge Graph from Thousands of Wikis. ICBK 2018
  • 63. 9/26/18 Heiko Paulheim 63 Work in Progress: DBkWik Dump Downloader DBpedia Extraction Framework Interlinking Instance Matcher Schema Matcher MediaWiki Dumps Extracted RDF Internal Linking Instance Matcher Schema Matcher Consolidated Knowledge Graph DBkWik Linked Data Endpoi nt Ontology Knowledge Graph Fusion Instance Matcher Domain/ Range Type SDType Light SubclassMaterialization • Avoiding O(n²) internal linking: – Match to DBpedia first – Use common links to DBpedia as blocking keys for internal matching Hertling & Paulheim: DBkWik: A Consolidated Knowledge Graph from Thousands of Wikis. ICBK 2018
  • 64. 9/26/18 Heiko Paulheim 64 Work in Progress: DBkWik • Downloaded ~15k Wiki dumps from Fandom – 52.4GB of data, roughly the size of the English Wikipedia • Prototype: extracted data for ~250 Wikis – 4.3M instances, ~750k linked to DBpedia – 7k classes, ~1k linked to DBpedia – 43k properties, ~20k linked to DBpedia – ...including duplicates! • Link quality – Good for classes, OK for properties (F1 of .957 and .852) – Needs improvement for instances (F1 of .641)
  • 65. 9/26/18 Heiko Paulheim 65 Work in Progress: WebIsALOD • Background: Web table interpretation • Most approaches need typing information – DBpedia etc. have too little coverage on the long tail – Wanted: extensive type database Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted from the Web as Linked Open Data. ISWC 2017
  • 66. 9/26/18 Heiko Paulheim 66 Work in Progress: WebIsALOD • Extraction of type information using Hearst-like patterns, e.g., – T, such as X – X, Y, and other T • Text corpus: common crawl – ~2 TB crawled web pages – Fast implementation: regex over text – “Expensive” operations only applied once regex has fired • Resulting database – 400M hypernymy relations Seitner et al.: A large DataBase of hypernymy relations extracted from the Web. LREC 2016
  • 67. 9/26/18 Heiko Paulheim 67 Work in Progress: WebIsALOD • Example: http://webisa.webdatacommons.org/
  • 68. 9/26/18 Heiko Paulheim 68 Work in Progress: WebIsALOD • Initial effort: transformation to a LOD dataset – including rich provenance information Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted from the Web as Linked Open Data. ISWC 2017
  • 69. 9/26/18 Heiko Paulheim 69 Work in Progress: WebIsALOD • Estimated contents breakdown Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted from the Web as Linked Open Data. ISWC 2017
  • 70. 9/26/18 Heiko Paulheim 70 Work in Progress: WebIsALOD • Main challenge – Original dataset is quite noisy (<10% correct statements) – Recap: coverage vs. accuracy – Simple thresholding removes too much knowledge • Approach – Train RandomForest model for predicting correct vs. wrong statements – Using all the provenance information we have – Use model to compute confidence scores Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted from the Web as Linked Open Data. ISWC 2017
  • 71. 9/26/18 Heiko Paulheim 71 Work in Progress: WebIsALOD • Current challenges and works in progress – Distinguishing instances and classes ● i.e.: subclass vs. instance of relations – Splitting instances ● Bauhaus is a goth band ● Bauhaus is a German school – Knowledge extraction from pre and post modifiers ● Bauhaus is a goth band → genre(Bauhaus, Goth) ● Bauhaus is a German school → location(Bauhaus, Germany) Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted from the Web as Linked Open Data. ISWC 2017
  • 72. 9/26/18 Heiko Paulheim 72 Brief Interlude: Machine Learning Basics • An essential ingredient to intelligent applications – Dealing with new pieces of knowledge – Handling unknown situations – Adapting to users' needs – Making predictions for the future • Inductive vs. Deductive Reasoning: – Deductive: rules + facts → facts – Inductive: facts → rules
  • 73. 9/26/18 Heiko Paulheim 73 Brief Interlude: Machine Learning Basics • Example: learning a new concept, e.g., "Tree" "tree" "tree" "tree" "not a tree" "not a tree" "not a tree"
  • 74. 9/26/18 Heiko Paulheim 74 Brief Interlude: Machine Learning Basics • Example: learning a new concept, e.g., "Tree" – we look at (positive and negative) examples – ...and derive a model • e.g., "Trees are big, green plants" • Goal: Classification of new instances "tree?" Warning: Models are only approximating examples! Not guaranteed to be correct or complete!
  • 75. 9/26/18 Heiko Paulheim 75 Brief Interlude: Machine Learning Basics • Typical tasks: – Classification (binary or multi-label) – Regression (i.e., predicting numerical values) – Clustering (finding groups of objects) – Anomaly Detection (identifying special data points) – Frequent Pattern Mining – ... • Methods: – Statistical approaches (Naive Bayes, Neural Nets, SVMs, …) – Symbolic approaches (Rules, Decision Trees, …) – ...
  • 76. 9/26/18 Heiko Paulheim 76 Background Knowledge for Machine Learning • Example machine learning task: predicting book sales ISBN City Sold 3-2347-3427-1 Darmstadt 124 3-43784-324-2 Mannheim 493 3-145-34587-0 Roßdorf 14 ... ISBN City Population ... Genre Publisher ... Sold 3-2347-3427-1 Darm- stadt 144402 ... Crime Bloody Books ... 124 3-43784-324-2 Mann- heim 291458 … Crime Guns Ltd. … 493 3-145-34587-0 Roß- dorf 12019 ... Travel Up&Away ... 14 ... → Crime novels sell better in larger cities We can use Semantic Web Knowledge Graphs for augmenting ML problems with background knowledge!
  • 77. 9/26/18 Heiko Paulheim 77 The FeGeLOD Framework IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 N a m e d E n t it y R e c o g n it io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t F e a t u r e G e n e r a t io n IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l 1 4 1 4 7 1 C ity _ U R I_ ... ... F e a t u r e S e le c t io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l 1 4 1 4 7 1
  • 78. 9/26/18 Heiko Paulheim 78 The FeGeLOD Framework • Entity Recognition – Simple approach: guess DBpedia URIs – Hit rate >95% for cities and countries (by English name) • Feature Generation – augmenting the dataset with additional attributes from KG • Feature Selection – Filter noise: >95% unknown, identical, or different nominals
  • 79. 9/26/18 Heiko Paulheim 79 Propositionalization • Problem: Knowledge Graphs vs. ML algorithms expecting Feature Vectors → wanted: a transformation from nodes to sets of features • Basic strategies: – literal values (e.g., population) are used directly – instance types become binary vectors – relations are counted (absolute, relative, TF-IDF) – combinations of relations and object types are counted (absolute, relative, TF-IDF) – ...
  • 80. 9/26/18 Heiko Paulheim 80 Propositionalization • Advanced strategies: vector space embeddings – word2vec: takes sentences as input, computes vector for each word • RDF2vec: create “sentences” by random graph walks
  • 81. 9/26/18 Heiko Paulheim 81 Propositionalization • RDF2vec example – similar instances form classes – direction of relations is stable
  • 82. 9/26/18 Heiko Paulheim 82 The FeGeLOD Prototype (now: RapidMiner Linked Open Data Extension) caution: works only until RM6! :-(
  • 83. 9/26/18 Heiko Paulheim 83 The FeGeLOD Prototype (now: RapidMiner Linked Open Data Extension)
  • 84. 9/26/18 Heiko Paulheim 84 Yet Another Knowledge Graph... Presenter Coffee Break Audience served during Tea may need served during needs
  • 85. 9/26/18 Heiko Paulheim 85 Recommender Systems • Content-based recommender systems: – recommend similar items to user – similarity can be computed using KGs
  • 86. 9/26/18 Heiko Paulheim 86 Recommender Systems • Using proximity in latent RDF2vec feature space
  • 87. 9/26/18 Heiko Paulheim 87 Recommender Systems • Using proximity in latent RDF2vec feature space • RDF2vec based features constantly outperform other propositionalization techniques • DBpedia vectors outperform Wikidata vectors
  • 88. 9/26/18 Heiko Paulheim 88 Incident Detection from Twitter • Tweets about random things come in via Twitter – How can we detect incidents from those automatically? • Obviously, keywords are not enough – and hand-crafted rules are not practical fire at #mannheim #universityomg two cars on fire #A5 #accident fire at train station still burning my heart is on fire!!!come on baby light my fire boss should fire that stupid moron
  • 89. 9/26/18 Heiko Paulheim 89 Detecting Incidents from Social Media • Social media contains data on many incidents – But keyword search is not enough – Detecting small incidents is hard – Manual inspection is too expensive (and slow) • Machine learning could help – Train a model to classify incident/non incident tweets – Apply model for detecting incident related tweets • Training data: – Traffic accidents – ~2,000 tweets containing relevant keywords (“car”, “crash”, etc.), hand labeled (50% related to traffic incidents)
  • 90. 9/26/18 Heiko Paulheim 90 Detecting Incidents from Social Media • Learning to classify tweets: – Positive and negative examples – Features: • Stemming • POS tagging • Word n-grams • … • Accuracy ~90% • But – Accuracy drops to ~85% when applying the model to a different city
  • 91. 9/26/18 Heiko Paulheim 91 Brief Interlude: Model Overfitting • What happens here? – Model is trained on a sample of labeled data – Tries to identify the characteristics of that data • Possible effect: – Model is too close to training data
  • 92. 9/26/18 Heiko Paulheim 92 Brief Interlude: Model Overfitting • Extreme example – Predict credit rating • Possible (useful) model: – (job status = employed) && (debts<5000) → rating=positive Name Net Income Job status Debts Rating John 40000 employed 0 + Mary 38000 employed 10000 - Stephen 21000 self-employed 20000 - Eric 2000 student 10000 - Alice 35000 employed 4000 +
  • 93. 9/26/18 Heiko Paulheim 93 Brief Interlude: Model Overfitting • Extreme example – Predict credit rating • Possible overfit models: – (34000<income<36000) || (income>39000) → rating=positive – (name=John) || (name=Alice) → rating=positive Name Net Income Job status Debts Rating John 40000 employed 0 + Mary 38000 employed 10000 - Stephen 21000 self-employed 20000 - Eric 2000 student 10000 - Alice 35000 employed 4000 +
  • 94. 9/26/18 Heiko Paulheim 94 Brief Interlude: Model Overfitting • All three models perfectly describe the training data – But only one is a useful generalization • Two goals of machine learning (sometimes contradicting): – Explain training data as good as possible – Find a model that is as general as possible • Strategies for preventing overfitting: – Cross validation – Model pruning – Stopping criteria in model building – Occam's Razor – ...
  • 95. 9/26/18 Heiko Paulheim 95 Detecting Incidents from Social Media • Accuracy ~90% • But – Accuracy drops to ~85% when applying the model to a different city – Model overfitting? • Example set: – “Again crash on I90” – “Accident on I90” • Model: – “I90” → indicates traffic accident • Applying the model: – “Two cars crashed on I51” → not related to traffic accident
  • 96. 9/26/18 Heiko Paulheim 96 Using KGs for Preventing Overfitting • Example set: – “Again crash on I90” – “Accident on I90” dbpedia:Interstate_90 dbpedia-owl:Road rdf:type dbpedia:Interstate_51 rdf:type • Model: – dbpedia-owl:Road → indicates traffic accident • Applying the model: – “Two cars crashed on I51” → indicates traffic accident • Using DBpedia Spotlight + FeGeLOD – Accuracy keeps up at 90% – Overfitting is avoided
  • 97. 9/26/18 Heiko Paulheim 97 What's that Noise? • A slightly different scenario: – Noise measurements from cities • Create predictions – For the rest of the city – For new infrastructure • Note: – No handcrafted rules – But a fully automatic prediction ● Based on example measurements
  • 98. 9/26/18 Heiko Paulheim 98 What's that Noise? • Additional data comes from – Linked Geo Data – Open Street Maps (Streets: types, lanes and speed limits) – German weather observations (DWD)
  • 99. 9/26/18 Heiko Paulheim 99 What's that Noise? • Results: – Machine trained model for noise – ~80% accuracy in predicting the noise level (six classes) ● Mean absolute error only 0.077 • This allows to... – generate a whole noise map from a small set of observations – play “what if” with hypothetical changes ● e.g., how do speed limits affect noise levels?
  • 100. 9/26/18 Heiko Paulheim 100 And now for Something Completely Different • Who are these people?
  • 101. 9/26/18 Heiko Paulheim 101 Statistical Data • Statistics are very wide spread – Quality of living in cities – Corruption by country – Fertility rate by country – Suicide rate by country – Box office revenue of films – ...
  • 102. 9/26/18 Heiko Paulheim 102 Statistical Data • Questions we are often interested in – Why does city X have a high/low quality of living? – Why is the corruption higher in country A than in country B? – Will a new film create a high/low box office revenue? • i.e., we are looking for – explanations – forecasts (e.g., extrapolations)
  • 103. 9/26/18 Heiko Paulheim 103 Statistical Data http://xkcd.com/605/
  • 104. 9/26/18 Heiko Paulheim 104 Statistical Data • What statistics typically look like
  • 105. 9/26/18 Heiko Paulheim 105 Statistical Data • There are powerful tools for finding correlations etc. – but many statistics cannot be interpreted directly – background knowledge is missing • So where do we get background knowledge from? – with as little efforts as possible
  • 106. 9/26/18 Heiko Paulheim 106 Statistical Data • What we have
  • 107. 9/26/18 Heiko Paulheim 107 Statistical Data • What we need
  • 108. 9/26/18 Heiko Paulheim 108 Possible Sources for Background Knowledge
  • 109. 9/26/18 Heiko Paulheim 109 ...and we've already seen FeGeLOD IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 N a m e d E n t it y R e c o g n it io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t F e a t u r e G e n e r a t io n IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l 1 4 1 4 7 1 C ity _ U R I_ ... ... F e a t u r e S e le c t io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l 1 4 1 4 7 1
  • 110. 9/26/18 Heiko Paulheim 110 Statistical Data • Adding background knowledge – FeGeLOD framework • Correlation analysis – e.g., Pearson Correlation Coefficient • Rule learning – e.g., Association Rule Mining – e.g., Subgroup Discovery • Further data preprocessing – depending on approach – e.g., discretization
  • 111. 9/26/18 Heiko Paulheim 111 Prototype Tool: Explain-a-LOD • Loads a statistics file (e.g., CSV) • Adds background knowledge • Presents explanations
  • 112. 9/26/18 Heiko Paulheim 112 Presenting Explanations • Verbalization with simple patterns – e.g., negative correlation between population and quality of living – "A city which has a low population has a high quality of living" • Color coding – By correlation coefficient, confidence/support of rules, etc.
  • 113. 9/26/18 Heiko Paulheim 113 Statistical Data: Examples • Data Set: Mercer Quality of Living – Quality of living in more than cities word wide – norm: NYC=100 (value range 23-109) – As of 1999 – http://across.co.nz/qualityofliving.htm • Knowledge graphs used in the examples: – DBpedia – CIA World Factbook for statistics by country
  • 114. 9/26/18 Heiko Paulheim 114 Statistical Data: Examples • Examples for low quality cities – big hot cities (junHighC >= 27 and areaTotalKm >= 334) – cold cities where no music has ever been recorded (recordedIn_in = false and janHighC <= 16) – latitude <= 24 and longitude <= 47 ● a very accurate rule ● but what's the interpretation?
  • 115. 9/26/18 Heiko Paulheim 115 Statistical Data: Examples
  • 116. 9/26/18 Heiko Paulheim 116 Statistical Data: Examples • Data Set: Transparency International – 177 Countries and a corruption perception indicator (between 1 and 10) – As of 2010 – http://www.transparency.org/cpi2010/results
  • 117. 9/26/18 Heiko Paulheim 117 Statistical Data: Examples • Example rules for countries with low corruption – HDI > 78% ● Human Development Index, calculated from live expectancy, education level, economic performance – OECD member states – Foundation place of more than nine organizations – More than ten mountains – More than ten companies with their headquarter in that state, but less than two cargo airlines
  • 118. 9/26/18 Heiko Paulheim 118 Statistical Data: Examples • Data Set: Burnout rates – 16 German DAX companies – Absolute and relative numbers – As of 2011 – http://de.statista.com/statistik/daten/studie/226959/umfrage/burn-out- erkrankungen-unter-mitarbeitern-ausgewaehlter-dax-unternehmen/
  • 119. 9/26/18 Heiko Paulheim 119 Statistical Data: Examples • Findings for burnout rates – Positive correlation between turnover and burnout rates – Car manufacturers are less prone to burnout – German companies are less prone to burnout than international ones ● Exception: Frankfurt
  • 120. 9/26/18 Heiko Paulheim 120 Statistical Data: Examples • Data Set: Antidepressives consumption – In European countries – Source: OECD – http://www.oecd-ilibrary.org/social-issues-migration-health/health-at-a- glance-2011/pharmaceutical-consumption_health_glance-2011-39-en
  • 121. 9/26/18 Heiko Paulheim 121 Statistical Data: Examples • Findings for antidepressives consumption – Larger countries have higher consumption – Low HDI → high consumption – By geography: ● Nordic countries, countries at the Atlantic: high ● Mediterranean: medium ● Alpine countries: low – High average age → high consupmtion – High birth rates → high consumption
  • 122. 9/26/18 Heiko Paulheim 122 Statistical Data: Examples • Data Set: Suicide rates – By country – OECD states – As of 2005 – http://www.washingtonpost.com/wp-srv/world/suiciderate.html
  • 123. 9/26/18 Heiko Paulheim 123 Statistical Data: Examples • Findings for suicide rates – Democraties have lower suicide rates than other forms of government – High HDI → low suicide rate – High population density → high suicide rate – By geography: ● At the sea → low ● In the mountains → high – High Gini index → low suicide rate ● High Gini index ↔ unequal distribution of wealth – High usage of nuclear power → high suicide rates
  • 124. 9/26/18 Heiko Paulheim 124 Statistical Data: Examples • Data set: sexual activity – Percentage of people having sex weekly – By country – Survey by Durex 2005-2009 – http://chartsbin.com/view/uya
  • 125. 9/26/18 Heiko Paulheim 125 Statistical Data: Examples • Findings on sexual activity – By geography: ● High in Europe, low in Asia ● Low in Island states – By language: ● English speaking: low ● French speaking: high – Low average age → high activity – High GDP per capita → low activity – High unemployment rate → high activity – High number of ISP providers → low activity
  • 126. 9/26/18 Heiko Paulheim 126 Try it... but be careful! • Download from http://www.ke.tu-darmstadt.de/resources/explain-a-lod • including a demo video, papers, etc. • Pitfalls – Open world assumption – LOD may be noisy – Biases – ...
  • 127. 9/26/18 Heiko Paulheim 127 Try it... but be careful! • Download from http://www.ke.tu-darmstadt.de/resources/explain-a-lod • including a demo video, papers, etc. http://xkcd.com/552/
  • 128. 9/26/18 Heiko Paulheim 128 Intermediate Conclusions • Many tasks require massive background knowledge – Can be acquired from knowledge graphs – e.g., FeGeLOD framework, RapidMiner LOD extension • Machine learning is often useful to... – make sense using knowledge graphs – answer non-trivial questions – Add additional knowledge dimensions (e.g., similarity)
  • 129. 9/26/18 Heiko Paulheim 129 But... • ...what if the Knowledge Graphs we have are just not good enough?
  • 130. 9/26/18 Heiko Paulheim 130 Problem: Incompleteness • Example: get a list of science fiction writers in DBpedia select ?x where {?x a dbo:Writer . ?x dbo:genre dbr:Science_Fiction} order by ?x
  • 131. 9/26/18 Heiko Paulheim 131 Problem: Incompleteness • Results from DBpedia Arthur C. Clarke? H.G. Wells? Isaac Asimov?
  • 132. 9/26/18 Heiko Paulheim 132 Problem: Incompleteness • What reasons can cause incomplete results? • Two possible problems: – The resource at hand is not of type dbo:Writer – The genre relation to dbr:Science_Fiction is missing select ?x where {?x a dbo:Writer . ?x dbo:genre dbr:Science_Fiction} order by ?x
  • 133. 9/26/18 Heiko Paulheim 133 Common Errors in Knowledge Graphs • Various works on Knowledge Graph Refinement – Knowledge Graph completion – Error detection • See, e.g., 2017 survey in Semantic Web Journal Paulheim: Knowledge Graph Refinement – A Survey of Approaches and Evaluation Methods. SWJ 8(3), 2017
  • 134. 9/26/18 Heiko Paulheim 134 Common Errors in Knowledge Graphs • Missing types – Estimate (2013) for DBpedia: at least 2.6M type statements are missing – Using YAGO as “ground truth” • “Well, we’re semantics folks, we have ontologies!” – CONSTRUCT {?x a ?t} WHERE { {?x ?r ?y . ?r rdfs:domain ?t} UNION {?y ?r ?x . ?r rdfs:range ?t} } Bizer & Paulheim: Type Inference on Noisy RDF Data. In: ISWC 2013
  • 135. 9/26/18 Heiko Paulheim 135 Common Errors in Knowledge Graphs • Experiment of RDFS reasoning for typing Germany • Results: – Place, PopulatedPlace, Award, MilitaryConflict, City, Country, EthnicGroup, Genre, Stadium, Settlement, Language, MontainRange, PersonFunction, Race, RouteOfTransportation, Building, Mountain, Airport, WineRegion • Bottom line: RDFS reasoning accumulates errors – Germany is the object of 44,433 statements – 15 single wrong statements can cause those 15 errors – i.e., an error rate of only 0.03% (that is unlikely to achieve) Bizer & Paulheim: Type Inference on Noisy RDF Data. In: ISWC 2013
  • 136. 9/26/18 Heiko Paulheim 136 Common Errors in Knowledge Graphs • Required: a noise-tolerant approach • SDType (meanwhile included in DBpedia) – Use statistical distributions of properties and object types ● P(C|p) → probability of object being of type C when observing property p in a statement – Averaging scores for all statements of a resource – Weighting properties by discriminative power • Since DBpedia 3.9: typing ~1M untyped resources at precision >0.95 • Refinement: – Filtering resources of non-instance pages and list pages Bizer & Paulheim: Type Inference on Noisy RDF Data. In: ISWC 2013
  • 137. 9/26/18 Heiko Paulheim 137 Common Errors in Knowledge Graphs • Depiction of characteristic distributions – example: director
  • 138. 9/26/18 Heiko Paulheim 138 Common Errors in Knowledge Graphs • Recap – Trade-off coverage vs. accuracy 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 # found types precision Lower bound for threshold Precision #foundtypes Bizer & Paulheim: Type Inference on Noisy RDF Data. In: ISWC 2013
  • 139. 9/26/18 Heiko Paulheim 139 Hierarchical Type Prediction • SLCN: Decomposing the type prediction problem: – we learn a classifier for each class – much simpler learning problem, less instances and features required – different variants for sampling negative instances Melo, Völker & Paulheim: Type prediction in noisy RDF knowledge bases using hierarchical multilabel classification with graph and latent features. IJAIT, 2017
  • 140. 9/26/18 Heiko Paulheim 140 Common Errors in Knowledge Graphs • The same idea applied to identification of noisy statements – i.e., a statement is implausible if the distribution of its object’s types deviates from the overall distribution for the predicate • Removing ~20,000 erroneous statements from DBpedia • Error analysis – Errors in Wikipedia account for ~30% – Other typical problems: see following slides Bizer & Paulheim: Improving the quality of linked data using statistical distributions. In: IJSWIS 10(2), 2014
  • 141. 9/26/18 Heiko Paulheim 141 Common Errors in Knowledge Graphs • Typical errors – links in longer texts are not interpreted correctly – dbr:Carole_Goble dbo:award dbr:Jim_Gray Paulheim & Gangemi: Serving DBpedia with DOLCE – More than Just Adding a Cherry on Top. ISWC 2015
  • 142. 9/26/18 Heiko Paulheim 142 Common Errors in Knowledge Graphs • Typical errors – Misinterpretation of redirects – dbr:Ben_Casey dbo:company dbr:Bing_Cosby Paulheim & Gangemi: Serving DBpedia with DOLCE – More than Just Adding a Cherry on Top. ISWC 2015
  • 143. 9/26/18 Heiko Paulheim 143 Common Errors in Knowledge Graphs • Typical errors – Metonymy – dbr:Human_Nature_(band) dbo:genre dbr:Motown, – Links with anchors pointing to subsections in a page – First_Army_(France)#1944-1945 Paulheim & Gangemi: Serving DBpedia with DOLCE – More than Just Adding a Cherry on Top. ISWC 2015
  • 144. 9/26/18 Heiko Paulheim 144 Common Errors in Knowledge Graphs • Identifying individual errors is possible with many techniques – e.g., statistics, reasoning, exploiting upper ontologies, … • ...but what do we do with those efforts? – they typically end up in drawers and abandoned GitHub repositories Paulheim & Gangemi: Serving DBpedia with DOLCE – More than Just Adding a Cherry on Top. ISWC 2015 Paulheim: Data-driven Joint Debugging of the DBpedia Mappings and Ontology. ESWC 2017
  • 145. 9/26/18 Heiko Paulheim 145 Motivation • Possible option 1: Remove erroneous triples from DBpedia • Challenges – May remove correct axioms, may need thresholding – Needs to be repeated for each release – Needs to be materialized on all of DBpedia DBpedia Extraction FrameworkWikipedia DBpedia Mappings Wiki Post Filter
  • 146. 9/26/18 Heiko Paulheim 146 Motivation • Possible option 2: Integrate into DBpedia Extraction Framework • Challenges – Development workload – Some approaches are not fully automated (technically or conceptually) – Scalability DBpedia Extraction Framework plus filter module Wikipedia DBpedia Mappings Wiki DBpedia Extraction Framework
  • 147. 9/26/18 Heiko Paulheim 147 Common Errors in Knowledge Graphs • Goal: a third option – Find the root of the error and fix it! Wikipedia DBpedia Mappings Wiki DBpedia Extraction Framework Inconsistency DetectionIdentification of suspicious mappings and ontology constructs Paulheim: Data-driven Joint Debugging of the DBpedia Mappings and Ontology. ESWC 2017
  • 148. 9/26/18 Heiko Paulheim 148 Common Errors in Knowledge Graphs • All the work so far has been focused on individual errors • Hypothesis: few error patterns with a common root cause a large fraction of errors – i.e., finding those patterns boosts the quality of a knowledge graph • Observation: – Top 40 patterns constitute 96% of inconsistent statements in DBpedia Paulheim, Gangemi: Serving DBpedia with DOLCE – More than Just Adding a Cherry on Top. ISWC 2015
  • 149. 9/26/18 Heiko Paulheim 149 Common Errors in Knowledge Graphs • Case 1: Wrong mapping • Example: – branch in infobox military unit is mapped to dbo:militaryBranch ● but dbo:militaryBranch has dbo:Person as its domain – correction: dbo:commandStructure – Affects 12,172 statements (31% of all dbo:militaryBranch)
  • 150. 9/26/18 Heiko Paulheim 150 Common Errors in Knowledge Graphs • Case 2: Mappings that should be removed • Example: – dbo:picture – Most of the are inconsistent (64.5% places, 23.0% persons) – Reason: statements are extracted from picture caption dbr:Brixton_Academy dbo:picture dbr:Brixton . dbr:Justify_My_Love dbo:picture dbr:Madonna_(entertainer) .
  • 151. 9/26/18 Heiko Paulheim 151 Common Errors in Knowledge Graphs • Case 3: Ontology problems (domain/range) • Example 1: – Populated places (e.g., cities) are used both as place and organization – For some properties, the range is either one of the two ● e.g., dbo:operator – Polysemy should be reflected in the ontology • Example 2: – dbo:architect, dbo:designer, dbo:engineer etc. have dbo:Person as their range – Significant fractions (8.6%, 7.6%, 58.4%, resp.) have a dbo:Organization as object – Range should be broadened
  • 152. 9/26/18 Heiko Paulheim 152 Common Errors in Knowledge Graphs • Case 4: Missing properties • Example 1: – dbo:president links an organization to its president – Majority use (8,354, or 76.2%): link a person to the president s/he served for • Example 2: – dbo:instrument links an artist to the instrument s/he plays – Prominent alternative use (3,828, or 7.2%): links a genre to its characteristic instrument
  • 153. 9/26/18 Heiko Paulheim 153 Common Errors in Knowledge Graphs • Introductory example: Arthur C. Clarke? H.G. Wells? Isaac Asimov? select ?x where {?x a dbo:Writer . ?x dbo:genre dbr:Science_Fiction} order by ?x
  • 154. 9/26/18 Heiko Paulheim 154 Common Errors in Knowledge Graphs • Incompleteness in relation assertions • Example: Arthur C. Clarke, Isaac Asimov, ... – There is no explicit link to Science Fiction in the infobox – i.e., the statement for ... dbo:genre dbr:Science_Fiction is not generated
  • 155. 9/26/18 Heiko Paulheim 155 Common Errors in Knowledge Graphs • Example for recent work (ISWC 2017): heuristic relation extraction from Wikipedia abstracts • Idea: – There are probably certain patterns: ● e.g., all genres linked in an abstract about a writer are that writer’s genres ● e.g., the first place linked in an abstract about a person is that person’s birthplace – The types are already in DBpedia – We can use existing relations as training data ● Using a local closed world assumption for negative examples – Learned models can be evaluated and only used at a certain precision Heist & Paulheim: Language-agnostic relation extraction from Wikipedia Abstracts. ISWC 2017
  • 156. 9/26/18 Heiko Paulheim 156 Common Errors in Knowledge Graphs • Results: – 1M additional assertions can be learned for 100 relations at 95% precision • Additional consideration: – We use only links, types from DBpedia, and positional features – No language-specific information (e.g., POS tags) – Thus, we are not restricted to English! Heist & Paulheim: Language-agnostic relation extraction from Wikipedia Abstracts. ISWC 2017
  • 157. 9/26/18 Heiko Paulheim 157 Common Errors in Knowledge Graphs • Cross-lingual experiment: – Using the 12 largest language editions of Wikipedia – Exploiting inter-language links Heist & Paulheim: Language-agnostic relation extraction from Wikipedia Abstracts. ISWC 2017
  • 158. 9/26/18 Heiko Paulheim 158 Common Errors in Knowledge Graphs • Analysis – Is there a relation between the language and the the country (dbo:country) of the entities for which information is extracted? Heist & Paulheim: Language-agnostic relation extraction from Wikipedia Abstracts. ISWC 2017
  • 159. 9/26/18 Heiko Paulheim 159 Common Errors in Knowledge Graphs • So far, we have looked at relation assertions • Numerical values can also be problematic… – Recap: Wikipedia is made for human consumption • The following are all valid representations of the same height value (and perfectly understandable by humans) – 6 ft 6 in, 6ft 6in, 6'6'', 6'6”, 6´6´´, … – 1.98m, 1,98m, 1m 98, 1m 98cm, 198cm, 198 cm, … – 6 ft 6 in (198 cm), 6ft 6in (1.98m), 6'6'' (1.98 m), … – 6 ft 6 in[1] , 6 ft 6 in [citation needed] , … – ... Wienand & Paulheim: Detecting Incorrect Numerical Data in DBpedia. ESWC 2014 Fleischhacker et al.: Detecting Errors in Numerical Linked Data Using Cross-Checked Outlier Detection. ISWC 2014
  • 160. 9/26/18 Heiko Paulheim 160 Common Errors in Knowledge Graphs • Approach: outlier detection – With preprocessing: finding meaningful subpopulations – With cross-checking: discarding natural outliers • Findings: 85%-95% precision possible – depending on predicate – Identification of typical parsing problems Wienand & Paulheim: Detecting Incorrect Numerical Data in DBpedia. ESWC 2014 Fleischhacker et al.: Detecting Errors in Numerical Linked Data Using Cross-Checked Outlier Detection. ISWC 2014
  • 161. 9/26/18 Heiko Paulheim 161 Common Errors in Knowledge Graphs • Footprint of a systematic error Wienand & Paulheim: Detecting Incorrect Numerical Data in DBpedia. ESWC 2014 Fleischhacker et al.: Detecting Errors in Numerical Linked Data Using Cross-Checked Outlier Detection. ISWC 2014
  • 162. 9/26/18 Heiko Paulheim 162 Common Errors in Knowledge Graphs • Errors include – Interpretation of imperial units – Unusual decimal/thousands separators – Concatenation (population 28,322,006) Wienand & Paulheim: Detecting Incorrect Numerical Data in DBpedia. ESWC 2014 Fleischhacker et al.: Detecting Errors in Numerical Linked Data Using Cross-Checked Outlier Detection. ISWC 2014
  • 163. 9/26/18 Heiko Paulheim 163 Take Aways • A lot of knowledge graphs are out there – built with various methods (manual, automatic, ...) – can be used in many different tasks – data interpretation, recommender systems, … • Knowledge graphs are not perfect – machine learning helps improving them
  • 164. 9/26/18 Heiko Paulheim 164 Hands On • You have seen many knowledge graphs • Now contribute to work in progress – Go to WebIsALOD – http://webisa.webdatacommons.org/ • Look at, e.g., http://webisa.webdatacommons.org/concept/_apple_ • You’ll observe a few issues here: – wrong statements – class/instance distinction – disambiguation (fruit vs. company) – knowledge implicit in classes (e.g., silicon valley company)
  • 165. 9/26/18 Heiko Paulheim 165 Hands On • Pick one (or more) of those issues • Sketch a possible approach to solve this – e.g., using machine learning • Think: – what does the learning problem look like? – how could we acquire training data? – what results do you expect? – where could your approach fail? • You’re developing slideware – i.e., no implemented solutions are expected – however, please look at actual data to demonstrate your approach
  • 166. 9/26/18 Heiko Paulheim 166 Machine Learning with and from Semantic Web Knowledge Graphs Heiko Paulheim