Nell’iperspazio con Rocket: il Framework Web di Rust!
Towards research data knowledge graphs
1. Backup
Beyond research data infrastructures - exploiting artificial & crowd
intelligence for building research knowledge graphs
Stefan Dietze
GESIS – Leibniz Institute for the Social Sciences &
Heinrich-Heine-Universität Düsseldorf
AIFB / KIT, Karlsruhe 21.02.2020
2. Backup
Beyond research data infrastructures - exploiting artificial & crowd
intelligence for building research knowledge graphs
Stefan Dietze
GESIS – Leibniz Institute for the Social Sciences &
Heinrich-Heine-Universität Düsseldorf
LWDA2019, 02 October 2019
research data
infrastructure
data fusion
distant
supervision
Web mining
distributional
semantics
knowledge
graph
neural entity
linking
research data
machine
learning
social web
artificial
intelligence
semantics
claim extraction
stance
detection
fact verification crowd
(Buzzword) Bingo !?
7. Part I
Retrieving, extracting and linking research data (in particular: metadata) on the Web
Part II
Mining novel forms of research data (KGs) from the Web
17/03/20 7Stefan Dietze
Datasets
Metadata
Publications
Web pages
Opinions
Claims
Stances
Overview
8. Web mining of dataset metadata (or: dataset KGs)
Harvesting from open data portals (e.g. DCAT/VoID-
metadata on DataHub.io, DataCite etc.)
Information extraction on long tail of Web documents?
=> dynamics & scale: approx. 50 trn (50.000.000.000.000)
Web pages indexed by Google (plus gazillion of temporal
snapshots)
Embedded markup (RDFa, Microdata, Microformats) for
annotation of Web pages
Supports Web search & interpretation
Pushed by Google, Yahoo, Bing et al
(schema.org vocabulary)
Adoption on the Web by 38% all Web pages
(sample: Common Crawl 2016, 3.2 bn Web pages)
Easily accesible, large-scale source of factual knowledge
(about research data & research information)
Large-scale source of training data, e.g. manually
annotated Web pages citing datasets
Facts (“quads”)
node1 name WB Commodity URI-1
node1 distribution node_xy URI-1
node1 creator Worldbank URI-1
node1 dateCreated 26 April 2017 URI-1
node2 creator World Bank URI-2
node2 encodingFormat text/CSV URI-2
node3 dateCreated 26 April 2007 URI-3
node3 keywords crude URI-3
<div itemscope itemtype ="http://schema.org/Dataset">
<h1 itemprop="name">World Bank-Commodity Prices</h1>
<span itemprop=„distribution">URL-X</span>
<span itemprop=„license">CC-BY</span>
...
</div>
17/03/20 8Stefan Dietze
9. 17/03/20 9Stefan Dietze
Research dataset markup on the Web
In Common Crawl 2017 (3.2 bn pages):
o 14.1 M statements & 3.4 M instances
related to „s:Dataset“
o Spread across 200 K pages from 2878 PLDs
(top 10% of PLDs provide 95% of data)
Studies of scholarly articles and other types
[SAVESD16, WWW2017]: majority of major
publishers, data hosting sites, data registries,
libraries, research organisations respresented
power law distribution of dataset metadata across PLDs
Challenges
o Errors. Factual errors, annotation errors (see
also [Meusel et al, ESWC2015])
o Ambiguity & coreferences. e.g. 18.000 entity
descriptions of “iPhone 6” in Common Crawl
2016 & ambiguous literals (e.g. „Apple“>)
o Redundancies & conflicts vast amounts of
equivalent or conflicting statements
10. 0. Noise: data cleansing (node URIs, deduplication etc)
1.a) Scale: Blocking through BM25 entity retrieval on markup index
1.b) Relevance: supervised coreference resolution
2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse
feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level
KnowMore: data fusion on markup
17/03/20 10
1. Blocking &
coreference
resolution
2. Fusion / Fact selection
New Queries
WorldBank, type:(Organization)
Washington, type:(City)
David Malpass, type:(Person)
(supervised)
Entity Description
name
“WorldBank Commodity
Prices 2019”
distribution Worldbank (node)
releaseDate 26.04.2019
keywords „crude”, “prizes”, “market”
encodingFormat text/CSV
Query
WorldBank Commodity,
Prices 2019, type:(Dataset)
Candidate Facts
node1 name WB Commodity
node1 distribution node_xy
node1 creator Worldbank
node1 dateReleased 26 April 2019
node2 creator World Bank
node2 encodingFormat text/CSV
node3 dateCreated 26 April 2007
node4 keywords “crude”
Web page
markup
Web crawl
(Common Crawl,
44 bn facts)
approx. 125.000 facts for query [ s:Product, „iPhone6“ ]
Stefan Dietze
Yu, R., [..], Dietze, S., KnowMore-Knowledge Base
Augmentation with Structured Web Markup, Semantic
Web Journal 2019 (SWJ2019)
Tempelmeier, N., Demidova, S., Dietze, S., Inferring
Missing Categorical Information in Noisy and Sparse
Web Markup, The Web Conf. 2018 (WWW2018)
11. 0. Noise: data cleansing (node URIs, deduplication etc)
1.a) Scale: Blocking through BM25 entity retrieval on markup index
1.b) Relevance: supervised coreference resolution
2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse
feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level
KnowMore: data fusion on markup
17/03/20 11
1. Blocking &
coreference
resolution
2. Fusion / Fact selection
New Queries
WorldBank, type:(Organization)
Washington, type:(City)
David Malpass, type:(Person)
(supervised)
Entity Description
name
“WorldBank Commodity
Prices 2019”
distribution Worldbank (node)
releaseDate 26.04.2019
keywords „crude”, “prizes”, “market”
encodingFormat text/CSV
Query
WorldBank Commodity,
Prices 2019, type:(Dataset)
Candidate Facts
node1 name WB Commodity
node1 distribution node_xy
node1 creator Worldbank
node1 dateReleased 26 April 2019
node2 creator World Bank
node2 encodingFormat text/CSV
node3 dateCreated 26 April 2007
node4 keywords “crude”
Web page
markup
Web crawl
(Common Crawl,
44 bn facts)
approx. 125.000 facts for query [ s:Product, „iPhone6“ ]
Stefan Dietze
Yu, R., [..], Dietze, S., KnowMore-Knowledge Base
Augmentation with Structured Web Markup, Semantic
Web Journal 2019 (SWJ2019)
Tempelmeier, N., Demidova, S., Dietze, S., Inferring
Missing Categorical Information in Noisy and Sparse
Web Markup, The Web Conf. 2018 (WWW2018)
Fusion performance
Experiments on books, movies, products (ongoing: datasets)
Baselines: BM25, CBFS [ESWC2015], PreRecCorr [Pochampally
et. al., ACM SIGMOD 2014], strong variance across types
Knowledge Graph Augmentation
On average 60% - 70% of all facts new (across DBpedia,
Wikidata, Freebase)
Additional experiments on learning new categorical features
(e.g. product categories or movie genres) [WWW2018]
13. Rich Context & Coleridge Initiative
building (yet another) KG of scholarly resources & datasets
13Stefan Dietze
Context/corpus: publications
(currently: social sciences, SAGE Publishing)
Tasks:
I. Extraction/disambiguation of dataset mentions
II. Extraction/detection of research methods
III. Classification of research fields
https://coleridgeinitiative.org/richcontextcompetition
14. Disambiguation of dataset citations Otto, W. et al., Knowledge Extraction from scholarly
publications – the GESIS contribution to the Rich Context
Competition, to appear, Sage Publishing, 2020
14Stefan Dietze
All these issues are addressed in the current report,
which is based on analysis of data obtained in the
National Comorbidity Survey (NCS) (15). The NCS is
a nationally representative survey of the US household
population that includes retrospective reports about the
ages at onset and lifetime occurrences of suicidal
ideation, plans, and attempts along with information
about the occurrences of mental disorders, substance
use, substance abuse, and substance dependence.
National Comorbidity Survey (NCS) NCS
Challenges
Ambiguous (incomplete) citations
Lack of high-quality and representative training
data (usually: weak labels, domain bias)
Approaches & results
Prior work: supervised pattern induction
[Boland et al, TPDL2012]
Current approach:
o neural NER based on spaCy (CRF-based
approach for research method detection)
o Training (testing) on 12.000 (3.000) paragraphs
(distribution of negative/positive differs,
training batch size=25, dropout=0.4)
o Results approx. P = .50, R= .90 (weakly labelled
test data)
o On small set of manually labelled test data:
P= .52; R= .21)
15. Profiling (Graph) Datasets Zloch, M., Acosta, M., Hienert, D., Dietze, S., Conrad, S., A
Software Framework and Datasets for the Analysis of Graph
Measures on RDF Graphs, ESWC19, Best Student Paper
15Stefan Dietze
Motivation
Profiling datasets: extracting
representative dataset metadata, e.g.
to distinguish dataset of different
kinds, find/discover datasets, generate
synthetic datasets
Research question: what are effective
graph metrics to profile graph-based
research data (social graphs,
knowledge graphs)
Methods & results
Framework for profiling datasets based
on 60 different graph metrics
Feature engineering (correlation
analysis etc) and feature impact
analysis
Certain datasets categories hard to
describe/distinguish due to inherent
diversity/variance of datasets
Set of descriptive, non-redundant
dataset profile features varies for
different dataset categories
Feature homogeneity (lighter colour = more homogenous metric within domain)
Feature impact in binary classification task (RF)
16. Beyond datasets:
linking social sciences survey items
F. Bensmann, A. Papenmeier, D. Kern, B. Zapilko, S.
Dietze, Semantic Annotation, Representation and
Linking of Survey Data, in progress
16Stefan Dietze
Motivation
Surveys are costly: finding and reusing survey
questions/items
Linking semantically related questions/responses
across survey programmes: e.g. all
questions/responses which evaluate the
economic situation in Germany at present
Approach & results
Taxonomy of question features & vocabulary for
representing survey items & features
Initial ML models for predicting item features
o Multiclass classification models for predicting
information types (1st level: 3 classes, 2nd
level: 9 classes)
o LSTM, log. Regression, SVM, Naive Bayes,
Random Forest
o Reasonable performance, LSTM most robust
Example from ALLBUS 2018
Example form ALLBUS ‘18
Q: “How would you rate the current
economic conditions in Germany?”
Family-Member
Fact Cognition
Self-focus
Evaluation
Object-focus
Past Present Future
...
Apartment Neighborhood Country
<Continent> <Country> <City>
Point in time Time span Periodic Point...
Information Type
Focus
Time Reference
Periodicity
Relative Location
Geo. Location
17. Overview
Part I
Retrieving, extracting and linking research data (in particular: metadata) on the Web
Part II
Mining novel forms of research data (KGs) from the Web
17/03/20 17Stefan Dietze
Datasets
Metadata
Publications
Web pages
Opinions
Claims
Stances
18. Traditional & novel forms of research data: the case of social sciences
17/03/20 18Stefan Dietze
Traditional social science research data: survey & census
data, microdata, lab studies etc (lack of scale, dynamics)
Social science vision: substituting & complementing
traditional research data through data mined from the Web
Example: investigations into misinformation and opinion
forming on Twitter (e.g. [Vousoughi et al. 2018])
Aims usually at investigating insights by also dealing with
methodological/computational challenges
Insights, mostly (computational) social sciences, e.g.
o Spreading of claims and misinformation
o Effect of biased and fake news on public opinions
Methods, mostly in computer science, e.g. for
o Crawling, harvesting, scraping of data
o Extraction of structured knowledge
(sentiments, stances, claims, etc)
o Claim/fact detection and verification („fake news
detection“), e.g. CLEF 2018 Fact Checking Lab
o Stance detection, e.g. Fake News Challenge (FNC)
19. 17/03/20 19Stefan Dietze
http://dbpedia.org/resource/Tim_Berners-Lee
wna:positive-emotion
onyx:hasEmotionIntensity "0.75"
onyx:hasEmotionIntensity "0.0"
Mining opinions & interactions (the case of Twitter)
Heterogenity: multimodal, multilingual, informal,
“noisy” language
Context dependence: interpretation of
tweets/posts (entities, sentiments) requires
consideration of context (e.g. time, linked
content), “Dusseldorf” => City or Football team
Dynamics & scale: e.g. 6000 tweets per second,
plus interactions (retweets etc) and context (e.g.
25% of tweets contain URLs)
Evolution and temporal aspects: evolution of
interactions over time crucial for many social
sciences questions
Representativity and bias: demographic
distributions not known a priori in archived data
collections
http://dbpedia.org/resource/Solid
wna:negative-emotion
P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public
and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
20. 17/03/20 20Stefan Dietze
P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public
and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
TweetsKB: a knowledge graph of Web mined “opinions”
https://data.gesis.org/tweetskb/
Harvesting & archiving of 9 Bn tweets over 6 years
(permanent collection from Twitter 1% sample since
2013)
Information extraction pipeline to build a KG of entities,
interactions & sentiments
(distributed batch processing via Hadoop Map/Reduce)
o Entity linking with knowledge graph/DBpedia
(Yahoo‘s FEL [Blanco et al. 2015])
(“president”/“potus”/”trump” =>
dbp:DonaldTrump), to disambiguate text and use
background knowledge (eg US politicians?
Republicans?), high precision (.85), low recall (.39)
o Sentiment analysis/annotation using SentiStrength
[Thelwall et al., 2017], F1 approx. .80
o Extraction of metadata and lifting into established
schemas (SIOC, schema.org), publication using W3C
standards (RDF/SPARQL)
21. 17/03/20 21Stefan Dietze
P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public
and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
Harvesting & archiving of 9 Bn tweets over 5 years
(permanent collection from Twitter 1% sample since
2013)
Information extraction pipeline (distributed via Hadoop
Map/Reduce)
o Entity linking with knowledge graph/DBpedia
(Yahoo‘s FEL [Blanco et al. 2015])
(“president”/“potus”/”trump” =>
dbp:DonaldTrump), to disambiguate text and use
background knowledge (eg US politicians?
Republicans?), high precision (.85), low recall (.39)
o Sentiment analysis/annotation using SentiStrength
[Thelwall et al., 2012], F1 approx. .80
o Extraction of metadata and lifting into established
schemas (SIOC, schema.org), publication using W3C
standards (RDF/SPARQL)
Use cases
Aggregating sentiments towards topics/entities, e.g. about
CDU vs SPD politicians in particular time period
Twitter archives as general corpus for understanding temporal
entity relatedness (e.g. “austerity” & “Greece” 2010-2015)
Investigating spreading & impact of fake news
(e.g. TweetsKB, ClaimsKG, stance detection)
Limitations
Bias & representativity: demographic distributions of users
(not known a priori and not representative)
-0.40000
-0.30000
-0.20000
-0.10000
0.00000
0.10000
0.20000
0.30000
0.40000
Cologne Düsseldorf
https://data.gesis.org/tweetskb/
TweetsKB: a knowledge graph of Web mined “opinions”
22. 17/03/20 23Stefan Dietze
Mining knowledge about claims and stances
stance,
claim trustworthiness?
stance,
claim trustworthiness?
23. Detecting stances towards claims/opinions
Motivation
Problem: detecting stance of documents (e.g. Web
pages, scientific publication) towards a given claim
(unbalanced class distribution)
Motivation: stance of documents (in particular
disagreement) useful (a) as signal for truthfulness (fake
news detection) and (b) Document or Source
classification (PLDs, publishers)
Approach
Cascading binary classifiers: addressing individual
issues (e.g. misclassification costs) per step
Features, e.g. textual similarity (Word2Vec etc),
sentiments, LIWC, etc.
Best-performing models: 1) SVM with class-wise
penalty, 2) CNN, 3) SVM with class-wise penalty
Experiments on FNC-1 dataset (and FNC baselines)
Results
Minor overall performance improvement
Improvement on disagree class by 27%
(but still far from robust)
A. Roy, A. Ekbal, S. Dietze, P. Fafalios, Exploiting
stance hierarchies for cost-sensitive stance detection
of Web documents, JCDL2020 under review.
24Stefan Dietze
24. 17/03/20 25Stefan Dietze
ClaimsKG: a knowledge graph of claims and claim-related metadata
Motivation
Claims spread across various
(unstructured) fact-checking sites
Example: finding claims about / made by
US republican politicians across the Web?
Approach
Harvesting claims & metadata from fact-
checking sites (e.g. snopes.com,
Politifact.com etc); currently approx.
30.000 claims (plus mining
schema.org/ClaimReview markup (>
500.000 statements in Common Crawl
2017)
Information extraction & linking
o Linking mentioned entities to DBpedia
o Normalisation of ratings (true, false,
mixture, other); coreference resolution
of claims
o Exposing data through established
vocabulary and W3C standards
(e.g. SPARQL endpoint)
https://data.gesis.org/claimskg/
A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K.
Todorov, ClaimsKG – A Live Knowledge Graph of fact-checked Claims,
ISWC2019
25. Conclusions & open challenges
Retrieving, extracting, linking of research dataset metadata (KGs)
Mining of unstructured Web pages and scholarly articles for research datasets &
metadata
Profiling of research datasets for discovery, sampling, generation of synthetic data
Plenty of related initiatives and efforts
(e.g. Rich Context, Research Graph, OpenAIRE, ORKG)
Challenges: transparent/reproducible/reusable methods for extraction & mining
across domains and corpora
Mining and sharing novel forms of research data (KGs)
Mining the Web for novel forms of research data
Examples from social sciences: opinions (sentiments on entities) and interactions
on Twitter & structured knowledge about resource relations (for instance: stances)
and claims
Challenges: language understanding/interpretation, representativity and bias
17/03/20 26Stefan Dietze
26. Acknowledgements
• Maribel Acosta (KIT, Karlsruhe)
• Felix Bensmann (GESIS)
• Katarina Boland (GESIS, Germany)
• Stefan Conrad (HHU, Germany)
• Elena Demidova (L3S, Germany)
• Dimitar Dimitrov (GESIS, Germany)
• Asif Ekbal (IIT Patna, India)
• Pavlos Fafalios (FORTH ICS, Greece)
• Daniel Hienert (GESIS, Germany)
• Vasileios Iosifidis (L3S, Germany)
• Dagmar Kern (GESIS, Germany)
• Eirini Ntoutsi (LUH, Germany)
• Vasilis Iosifidis (L3S, Germany)
• Wolfgang Otto (GESIS, Germany)
• Andrea Papenmeier (GESIS, Germany)
• Markus Rokicki (L3S, Germany)
• Arjun Roy (IIT Patna, India)
• Renato Stoffalette Joao (L3S, Germany)
• Nicolas Tempelmeier (L3S, Germany)
• Konstantin Todorov (LIRMM, France)
• Ran Yu (GESIS, Germany)
• Benjamin Zapilko (GESIS, Germany)
• Matthäus Zloch (GESIS, Germany)
17/03/20 27Stefan Dietze
27. 28Stefan Dietze
Knowledge Technologies for the Social Sciences (WTS)
https://www.gesis.org/en/institute/departments/knowledge-technologies-for-the-social-sciences/
WTS Labs
https://www.gesis.org/en/research/applied-computer-science/labs/wts-research-labs
Data & Knowledge Engineering @ HHU
https://www.cs.hhu.de/en/research-groups/data-knowledge-engineering.html
L3S
http://www.l3s.de
Personal
http://stefandietze.net