SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Backup
Beyond research data infrastructures - exploiting artificial & crowd
intelligence for building research knowledge graphs
Stefan Dietze
GESIS – Leibniz Institute for the Social Sciences &
Heinrich-Heine-Universität Düsseldorf
AIFB / KIT, Karlsruhe 21.02.2020
Backup
Beyond research data infrastructures - exploiting artificial & crowd
intelligence for building research knowledge graphs
Stefan Dietze
GESIS – Leibniz Institute for the Social Sciences &
Heinrich-Heine-Universität Düsseldorf
LWDA2019, 02 October 2019
research data
infrastructure
data fusion
distant
supervision
Web mining
distributional
semantics
knowledge
graph
neural entity
linking
research data
machine
learning
social web
artificial
intelligence
semantics
claim extraction
stance
detection
fact verification crowd
(Buzzword) Bingo !?
Finding research data on the Web?
17/03/20 3Stefan Dietze
Finding research data on the Web?
17/03/20 4Stefan Dietze
Finding research data on the Web?
17/03/20 5Stefan Dietze
Finding (social sciences) research data on the Web
17/03/20 6Stefan Dietze
Part I
Retrieving, extracting and linking research data (in particular: metadata) on the Web
Part II
Mining novel forms of research data (KGs) from the Web
17/03/20 7Stefan Dietze
Datasets
Metadata
Publications
Web pages
Opinions
Claims
Stances
Overview
Web mining of dataset metadata (or: dataset KGs)
 Harvesting from open data portals (e.g. DCAT/VoID-
metadata on DataHub.io, DataCite etc.)
 Information extraction on long tail of Web documents?
=> dynamics & scale: approx. 50 trn (50.000.000.000.000)
Web pages indexed by Google (plus gazillion of temporal
snapshots)
 Embedded markup (RDFa, Microdata, Microformats) for
annotation of Web pages
 Supports Web search & interpretation
 Pushed by Google, Yahoo, Bing et al
(schema.org vocabulary)
 Adoption on the Web by 38% all Web pages
(sample: Common Crawl 2016, 3.2 bn Web pages)
 Easily accesible, large-scale source of factual knowledge
(about research data & research information)
 Large-scale source of training data, e.g. manually
annotated Web pages citing datasets
Facts (“quads”)
node1 name WB Commodity URI-1
node1 distribution node_xy URI-1
node1 creator Worldbank URI-1
node1 dateCreated 26 April 2017 URI-1
node2 creator World Bank URI-2
node2 encodingFormat text/CSV URI-2
node3 dateCreated 26 April 2007 URI-3
node3 keywords crude URI-3
<div itemscope itemtype ="http://schema.org/Dataset">
<h1 itemprop="name">World Bank-Commodity Prices</h1>
<span itemprop=„distribution">URL-X</span>
<span itemprop=„license">CC-BY</span>
...
</div>
17/03/20 8Stefan Dietze
17/03/20 9Stefan Dietze
Research dataset markup on the Web
 In Common Crawl 2017 (3.2 bn pages):
o 14.1 M statements & 3.4 M instances
related to „s:Dataset“
o Spread across 200 K pages from 2878 PLDs
(top 10% of PLDs provide 95% of data)
 Studies of scholarly articles and other types
[SAVESD16, WWW2017]: majority of major
publishers, data hosting sites, data registries,
libraries, research organisations respresented
power law distribution of dataset metadata across PLDs
 Challenges
o Errors. Factual errors, annotation errors (see
also [Meusel et al, ESWC2015])
o Ambiguity & coreferences. e.g. 18.000 entity
descriptions of “iPhone 6” in Common Crawl
2016 & ambiguous literals (e.g. „Apple“>)
o Redundancies & conflicts vast amounts of
equivalent or conflicting statements
 0. Noise: data cleansing (node URIs, deduplication etc)
 1.a) Scale: Blocking through BM25 entity retrieval on markup index
 1.b) Relevance: supervised coreference resolution
 2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse
feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level
KnowMore: data fusion on markup
17/03/20 10
1. Blocking &
coreference
resolution
2. Fusion / Fact selection
New Queries
WorldBank, type:(Organization)
Washington, type:(City)
David Malpass, type:(Person)
(supervised)
Entity Description
name
“WorldBank Commodity
Prices 2019”
distribution Worldbank (node)
releaseDate 26.04.2019
keywords „crude”, “prizes”, “market”
encodingFormat text/CSV
Query
WorldBank Commodity,
Prices 2019, type:(Dataset)
Candidate Facts
node1 name WB Commodity
node1 distribution node_xy
node1 creator Worldbank
node1 dateReleased 26 April 2019
node2 creator World Bank
node2 encodingFormat text/CSV
node3 dateCreated 26 April 2007
node4 keywords “crude”
Web page
markup
Web crawl
(Common Crawl,
44 bn facts)
approx. 125.000 facts for query [ s:Product, „iPhone6“ ]
Stefan Dietze
Yu, R., [..], Dietze, S., KnowMore-Knowledge Base
Augmentation with Structured Web Markup, Semantic
Web Journal 2019 (SWJ2019)
Tempelmeier, N., Demidova, S., Dietze, S., Inferring
Missing Categorical Information in Noisy and Sparse
Web Markup, The Web Conf. 2018 (WWW2018)
 0. Noise: data cleansing (node URIs, deduplication etc)
 1.a) Scale: Blocking through BM25 entity retrieval on markup index
 1.b) Relevance: supervised coreference resolution
 2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse
feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level
KnowMore: data fusion on markup
17/03/20 11
1. Blocking &
coreference
resolution
2. Fusion / Fact selection
New Queries
WorldBank, type:(Organization)
Washington, type:(City)
David Malpass, type:(Person)
(supervised)
Entity Description
name
“WorldBank Commodity
Prices 2019”
distribution Worldbank (node)
releaseDate 26.04.2019
keywords „crude”, “prizes”, “market”
encodingFormat text/CSV
Query
WorldBank Commodity,
Prices 2019, type:(Dataset)
Candidate Facts
node1 name WB Commodity
node1 distribution node_xy
node1 creator Worldbank
node1 dateReleased 26 April 2019
node2 creator World Bank
node2 encodingFormat text/CSV
node3 dateCreated 26 April 2007
node4 keywords “crude”
Web page
markup
Web crawl
(Common Crawl,
44 bn facts)
approx. 125.000 facts for query [ s:Product, „iPhone6“ ]
Stefan Dietze
Yu, R., [..], Dietze, S., KnowMore-Knowledge Base
Augmentation with Structured Web Markup, Semantic
Web Journal 2019 (SWJ2019)
Tempelmeier, N., Demidova, S., Dietze, S., Inferring
Missing Categorical Information in Noisy and Sparse
Web Markup, The Web Conf. 2018 (WWW2018)
Fusion performance
 Experiments on books, movies, products (ongoing: datasets)
 Baselines: BM25, CBFS [ESWC2015], PreRecCorr [Pochampally
et. al., ACM SIGMOD 2014], strong variance across types
Knowledge Graph Augmentation
 On average 60% - 70% of all facts new (across DBpedia,
Wikidata, Freebase)
 Additional experiments on learning new categorical features
(e.g. product categories or movie genres) [WWW2018]
Applications: search for (SS) datasets, resources & relations
12Stefan Dietze
https://search.gesis.org/
Dataset
Rel. Publications
Disambiguation of datasets,
methods, software, authors, topics.
Rich Context & Coleridge Initiative
building (yet another) KG of scholarly resources & datasets
13Stefan Dietze
 Context/corpus: publications
(currently: social sciences, SAGE Publishing)
 Tasks:
I. Extraction/disambiguation of dataset mentions
II. Extraction/detection of research methods
III. Classification of research fields
https://coleridgeinitiative.org/richcontextcompetition
Disambiguation of dataset citations Otto, W. et al., Knowledge Extraction from scholarly
publications – the GESIS contribution to the Rich Context
Competition, to appear, Sage Publishing, 2020
14Stefan Dietze
All these issues are addressed in the current report,
which is based on analysis of data obtained in the
National Comorbidity Survey (NCS) (15). The NCS is
a nationally representative survey of the US household
population that includes retrospective reports about the
ages at onset and lifetime occurrences of suicidal
ideation, plans, and attempts along with information
about the occurrences of mental disorders, substance
use, substance abuse, and substance dependence.
National Comorbidity Survey (NCS) NCS
Challenges
 Ambiguous (incomplete) citations
 Lack of high-quality and representative training
data (usually: weak labels, domain bias)
Approaches & results
 Prior work: supervised pattern induction
[Boland et al, TPDL2012]
 Current approach:
o neural NER based on spaCy (CRF-based
approach for research method detection)
o Training (testing) on 12.000 (3.000) paragraphs
(distribution of negative/positive differs,
training batch size=25, dropout=0.4)
o Results approx. P = .50, R= .90 (weakly labelled
test data)
o On small set of manually labelled test data:
P= .52; R= .21)
Profiling (Graph) Datasets Zloch, M., Acosta, M., Hienert, D., Dietze, S., Conrad, S., A
Software Framework and Datasets for the Analysis of Graph
Measures on RDF Graphs, ESWC19, Best Student Paper
15Stefan Dietze
Motivation
 Profiling datasets: extracting
representative dataset metadata, e.g.
to distinguish dataset of different
kinds, find/discover datasets, generate
synthetic datasets
 Research question: what are effective
graph metrics to profile graph-based
research data (social graphs,
knowledge graphs)
Methods & results
 Framework for profiling datasets based
on 60 different graph metrics
 Feature engineering (correlation
analysis etc) and feature impact
analysis
 Certain datasets categories hard to
describe/distinguish due to inherent
diversity/variance of datasets
 Set of descriptive, non-redundant
dataset profile features varies for
different dataset categories
Feature homogeneity (lighter colour = more homogenous metric within domain)
Feature impact in binary classification task (RF)
Beyond datasets:
linking social sciences survey items
F. Bensmann, A. Papenmeier, D. Kern, B. Zapilko, S.
Dietze, Semantic Annotation, Representation and
Linking of Survey Data, in progress
16Stefan Dietze
Motivation
 Surveys are costly: finding and reusing survey
questions/items
 Linking semantically related questions/responses
across survey programmes: e.g. all
questions/responses which evaluate the
economic situation in Germany at present
Approach & results
 Taxonomy of question features & vocabulary for
representing survey items & features
 Initial ML models for predicting item features
o Multiclass classification models for predicting
information types (1st level: 3 classes, 2nd
level: 9 classes)
o LSTM, log. Regression, SVM, Naive Bayes,
Random Forest
o Reasonable performance, LSTM most robust
Example from ALLBUS 2018
Example form ALLBUS ‘18
Q: “How would you rate the current
economic conditions in Germany?”
Family-Member
Fact Cognition
Self-focus
Evaluation
Object-focus
Past Present Future
...
Apartment Neighborhood Country
<Continent> <Country> <City>
Point in time Time span Periodic Point...
Information Type
Focus
Time Reference
Periodicity
Relative Location
Geo. Location
Overview
Part I
Retrieving, extracting and linking research data (in particular: metadata) on the Web
Part II
Mining novel forms of research data (KGs) from the Web
17/03/20 17Stefan Dietze
Datasets
Metadata
Publications
Web pages
Opinions
Claims
Stances
Traditional & novel forms of research data: the case of social sciences
17/03/20 18Stefan Dietze
 Traditional social science research data: survey & census
data, microdata, lab studies etc (lack of scale, dynamics)
 Social science vision: substituting & complementing
traditional research data through data mined from the Web
 Example: investigations into misinformation and opinion
forming on Twitter (e.g. [Vousoughi et al. 2018])
 Aims usually at investigating insights by also dealing with
methodological/computational challenges
 Insights, mostly (computational) social sciences, e.g.
o Spreading of claims and misinformation
o Effect of biased and fake news on public opinions
 Methods, mostly in computer science, e.g. for
o Crawling, harvesting, scraping of data
o Extraction of structured knowledge
(sentiments, stances, claims, etc)
o Claim/fact detection and verification („fake news
detection“), e.g. CLEF 2018 Fact Checking Lab
o Stance detection, e.g. Fake News Challenge (FNC)
17/03/20 19Stefan Dietze
http://dbpedia.org/resource/Tim_Berners-Lee
wna:positive-emotion
onyx:hasEmotionIntensity "0.75"
onyx:hasEmotionIntensity "0.0"
Mining opinions & interactions (the case of Twitter)
 Heterogenity: multimodal, multilingual, informal,
“noisy” language
 Context dependence: interpretation of
tweets/posts (entities, sentiments) requires
consideration of context (e.g. time, linked
content), “Dusseldorf” => City or Football team
 Dynamics & scale: e.g. 6000 tweets per second,
plus interactions (retweets etc) and context (e.g.
25% of tweets contain URLs)
 Evolution and temporal aspects: evolution of
interactions over time crucial for many social
sciences questions
 Representativity and bias: demographic
distributions not known a priori in archived data
collections
http://dbpedia.org/resource/Solid
wna:negative-emotion
P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public
and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
17/03/20 20Stefan Dietze
P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public
and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
TweetsKB: a knowledge graph of Web mined “opinions”
https://data.gesis.org/tweetskb/
 Harvesting & archiving of 9 Bn tweets over 6 years
(permanent collection from Twitter 1% sample since
2013)
 Information extraction pipeline to build a KG of entities,
interactions & sentiments
(distributed batch processing via Hadoop Map/Reduce)
o Entity linking with knowledge graph/DBpedia
(Yahoo‘s FEL [Blanco et al. 2015])
(“president”/“potus”/”trump” =>
dbp:DonaldTrump), to disambiguate text and use
background knowledge (eg US politicians?
Republicans?), high precision (.85), low recall (.39)
o Sentiment analysis/annotation using SentiStrength
[Thelwall et al., 2017], F1 approx. .80
o Extraction of metadata and lifting into established
schemas (SIOC, schema.org), publication using W3C
standards (RDF/SPARQL)
17/03/20 21Stefan Dietze
P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public
and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
 Harvesting & archiving of 9 Bn tweets over 5 years
(permanent collection from Twitter 1% sample since
2013)
 Information extraction pipeline (distributed via Hadoop
Map/Reduce)
o Entity linking with knowledge graph/DBpedia
(Yahoo‘s FEL [Blanco et al. 2015])
(“president”/“potus”/”trump” =>
dbp:DonaldTrump), to disambiguate text and use
background knowledge (eg US politicians?
Republicans?), high precision (.85), low recall (.39)
o Sentiment analysis/annotation using SentiStrength
[Thelwall et al., 2012], F1 approx. .80
o Extraction of metadata and lifting into established
schemas (SIOC, schema.org), publication using W3C
standards (RDF/SPARQL)
Use cases
 Aggregating sentiments towards topics/entities, e.g. about
CDU vs SPD politicians in particular time period
 Twitter archives as general corpus for understanding temporal
entity relatedness (e.g. “austerity” & “Greece” 2010-2015)
 Investigating spreading & impact of fake news
(e.g. TweetsKB, ClaimsKG, stance detection)
Limitations
 Bias & representativity: demographic distributions of users
(not known a priori and not representative)
-0.40000
-0.30000
-0.20000
-0.10000
0.00000
0.10000
0.20000
0.30000
0.40000
Cologne Düsseldorf
https://data.gesis.org/tweetskb/
TweetsKB: a knowledge graph of Web mined “opinions”
17/03/20 23Stefan Dietze
Mining knowledge about claims and stances
stance,
claim trustworthiness?
stance,
claim trustworthiness?
Detecting stances towards claims/opinions
Motivation
 Problem: detecting stance of documents (e.g. Web
pages, scientific publication) towards a given claim
(unbalanced class distribution)
 Motivation: stance of documents (in particular
disagreement) useful (a) as signal for truthfulness (fake
news detection) and (b) Document or Source
classification (PLDs, publishers)
Approach
 Cascading binary classifiers: addressing individual
issues (e.g. misclassification costs) per step
 Features, e.g. textual similarity (Word2Vec etc),
sentiments, LIWC, etc.
 Best-performing models: 1) SVM with class-wise
penalty, 2) CNN, 3) SVM with class-wise penalty
 Experiments on FNC-1 dataset (and FNC baselines)
Results
 Minor overall performance improvement
 Improvement on disagree class by 27%
(but still far from robust)
A. Roy, A. Ekbal, S. Dietze, P. Fafalios, Exploiting
stance hierarchies for cost-sensitive stance detection
of Web documents, JCDL2020 under review.
24Stefan Dietze
17/03/20 25Stefan Dietze
ClaimsKG: a knowledge graph of claims and claim-related metadata
Motivation
 Claims spread across various
(unstructured) fact-checking sites
 Example: finding claims about / made by
US republican politicians across the Web?
Approach
 Harvesting claims & metadata from fact-
checking sites (e.g. snopes.com,
Politifact.com etc); currently approx.
30.000 claims (plus mining
schema.org/ClaimReview markup (>
500.000 statements in Common Crawl
2017)
 Information extraction & linking
o Linking mentioned entities to DBpedia
o Normalisation of ratings (true, false,
mixture, other); coreference resolution
of claims
o Exposing data through established
vocabulary and W3C standards
(e.g. SPARQL endpoint)
https://data.gesis.org/claimskg/
A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K.
Todorov, ClaimsKG – A Live Knowledge Graph of fact-checked Claims,
ISWC2019
Conclusions & open challenges
Retrieving, extracting, linking of research dataset metadata (KGs)
 Mining of unstructured Web pages and scholarly articles for research datasets &
metadata
 Profiling of research datasets for discovery, sampling, generation of synthetic data
 Plenty of related initiatives and efforts
(e.g. Rich Context, Research Graph, OpenAIRE, ORKG)
 Challenges: transparent/reproducible/reusable methods for extraction & mining
across domains and corpora
Mining and sharing novel forms of research data (KGs)
 Mining the Web for novel forms of research data
 Examples from social sciences: opinions (sentiments on entities) and interactions
on Twitter & structured knowledge about resource relations (for instance: stances)
and claims
 Challenges: language understanding/interpretation, representativity and bias
17/03/20 26Stefan Dietze
Acknowledgements
• Maribel Acosta (KIT, Karlsruhe)
• Felix Bensmann (GESIS)
• Katarina Boland (GESIS, Germany)
• Stefan Conrad (HHU, Germany)
• Elena Demidova (L3S, Germany)
• Dimitar Dimitrov (GESIS, Germany)
• Asif Ekbal (IIT Patna, India)
• Pavlos Fafalios (FORTH ICS, Greece)
• Daniel Hienert (GESIS, Germany)
• Vasileios Iosifidis (L3S, Germany)
• Dagmar Kern (GESIS, Germany)
• Eirini Ntoutsi (LUH, Germany)
• Vasilis Iosifidis (L3S, Germany)
• Wolfgang Otto (GESIS, Germany)
• Andrea Papenmeier (GESIS, Germany)
• Markus Rokicki (L3S, Germany)
• Arjun Roy (IIT Patna, India)
• Renato Stoffalette Joao (L3S, Germany)
• Nicolas Tempelmeier (L3S, Germany)
• Konstantin Todorov (LIRMM, France)
• Ran Yu (GESIS, Germany)
• Benjamin Zapilko (GESIS, Germany)
• Matthäus Zloch (GESIS, Germany)
17/03/20 27Stefan Dietze
28Stefan Dietze
Knowledge Technologies for the Social Sciences (WTS)
https://www.gesis.org/en/institute/departments/knowledge-technologies-for-the-social-sciences/
WTS Labs
https://www.gesis.org/en/research/applied-computer-science/labs/wts-research-labs
Data & Knowledge Engineering @ HHU
https://www.cs.hhu.de/en/research-groups/data-knowledge-engineering.html
L3S
http://www.l3s.de
Personal
http://stefandietze.net

Weitere ähnliche Inhalte

Was ist angesagt?

Research Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceStefan Dietze
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Stefan Dietze
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebStefan Dietze
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Stefan Dietze
 
Omitola birmingham cityuniv
Omitola birmingham cityunivOmitola birmingham cityuniv
Omitola birmingham cityunivTope Omitola
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Sciencedatasciencekorea
 
Tuning Personalized PageRank for Semantics-aware Recommendations based on Lin...
Tuning Personalized PageRank for Semantics-aware Recommendations based on Lin...Tuning Personalized PageRank for Semantics-aware Recommendations based on Lin...
Tuning Personalized PageRank for Semantics-aware Recommendations based on Lin...Cataldo Musto
 
Gain Super Powers in Data Science: Relationship Discovery Across Public Data
Gain Super Powers in Data Science: Relationship Discovery Across Public DataGain Super Powers in Data Science: Relationship Discovery Across Public Data
Gain Super Powers in Data Science: Relationship Discovery Across Public DataOntotext
 
Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationRinke Hoekstra
 
WWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationWWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationStefan Dietze
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataStefan Dietze
 
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Rinke Hoekstra
 
Recommender Systems based on Linked Open Data
Recommender Systems based on Linked Open DataRecommender Systems based on Linked Open Data
Recommender Systems based on Linked Open DataCataldo Musto
 
Oop principles a good book
Oop principles a good bookOop principles a good book
Oop principles a good booklahorisher
 
Boost your data analytics with open data and public news content
Boost your data analytics with open data and public news contentBoost your data analytics with open data and public news content
Boost your data analytics with open data and public news contentOntotext
 
Diving in Panama Papers and Open Data to Discover Emerging News
Diving in Panama Papers and Open Data to Discover Emerging NewsDiving in Panama Papers and Open Data to Discover Emerging News
Diving in Panama Papers and Open Data to Discover Emerging NewsOntotext
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
Linked data introduction w exempel
Linked data introduction w exempelLinked data introduction w exempel
Linked data introduction w exempelKerstin Forsberg
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsCraig Knoblock
 

Was ist angesagt? (20)

Research Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScience
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)
 
Omitola birmingham cityuniv
Omitola birmingham cityunivOmitola birmingham cityuniv
Omitola birmingham cityuniv
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Science
 
Tuning Personalized PageRank for Semantics-aware Recommendations based on Lin...
Tuning Personalized PageRank for Semantics-aware Recommendations based on Lin...Tuning Personalized PageRank for Semantics-aware Recommendations based on Lin...
Tuning Personalized PageRank for Semantics-aware Recommendations based on Lin...
 
Gain Super Powers in Data Science: Relationship Discovery Across Public Data
Gain Super Powers in Data Science: Relationship Discovery Across Public DataGain Super Powers in Data Science: Relationship Discovery Across Public Data
Gain Super Powers in Data Science: Relationship Discovery Across Public Data
 
Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance Visualization
 
WWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationWWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & Education
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open Data
 
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
 
Recommender Systems based on Linked Open Data
Recommender Systems based on Linked Open DataRecommender Systems based on Linked Open Data
Recommender Systems based on Linked Open Data
 
Oop principles a good book
Oop principles a good bookOop principles a good book
Oop principles a good book
 
Boost your data analytics with open data and public news content
Boost your data analytics with open data and public news contentBoost your data analytics with open data and public news content
Boost your data analytics with open data and public news content
 
Diving in Panama Papers and Open Data to Discover Emerging News
Diving in Panama Papers and Open Data to Discover Emerging NewsDiving in Panama Papers and Open Data to Discover Emerging News
Diving in Panama Papers and Open Data to Discover Emerging News
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
Linked data introduction w exempel
Linked data introduction w exempelLinked data introduction w exempel
Linked data introduction w exempel
 
CODATA: Open Data, FAIR Data and Open Science/Simon Hodson
CODATA: Open Data, FAIR Data and Open Science/Simon HodsonCODATA: Open Data, FAIR Data and Open Science/Simon Hodson
CODATA: Open Data, FAIR Data and Open Science/Simon Hodson
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
 

Ähnlich wie Towards research data knowledge graphs

Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebStefan Dietze
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsStefan Dietze
 
Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesStefan Dietze
 
International Journal of Data Mining & Knowledge Management Process(IJDKP)
International Journal of Data Mining & Knowledge Management Process(IJDKP)International Journal of Data Mining & Knowledge Management Process(IJDKP)
International Journal of Data Mining & Knowledge Management Process(IJDKP)albert ca
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...Stefan Dietze
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltoolssuresh sood
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )IJDKP
 
KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013Stefan Dietze
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )IJDKP
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )IJDKP
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsNeo4j
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )IJDKP
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )IJDKP
 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebStefan Dietze
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )IJDKP
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )IJDKP
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )IJDKP
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )IJDKP
 

Ähnlich wie Towards research data knowledge graphs (20)

Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked Datasets
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital Libraries
 
International Journal of Data Mining & Knowledge Management Process(IJDKP)
International Journal of Data Mining & Knowledge Management Process(IJDKP)International Journal of Data Mining & Knowledge Management Process(IJDKP)
International Journal of Data Mining & Knowledge Management Process(IJDKP)
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltools
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale Analytics
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 

Mehr von Stefan Dietze

AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...Stefan Dietze
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...Stefan Dietze
 
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISResearch Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISStefan Dietze
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Stefan Dietze
 
Using AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebUsing AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebStefan Dietze
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesStefan Dietze
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebStefan Dietze
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningStefan Dietze
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebStefan Dietze
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Stefan Dietze
 
Dietze linked data-vr-es
Dietze linked data-vr-esDietze linked data-vr-es
Dietze linked data-vr-esStefan Dietze
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Stefan Dietze
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsStefan Dietze
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedStefan Dietze
 
LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014Stefan Dietze
 
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014Stefan Dietze
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebStefan Dietze
 
Open Data Dialog 2013 - Linked Data in Education
Open Data Dialog 2013 - Linked Data in EducationOpen Data Dialog 2013 - Linked Data in Education
Open Data Dialog 2013 - Linked Data in EducationStefan Dietze
 
Towards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeTowards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeStefan Dietze
 

Mehr von Stefan Dietze (19)

AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...
 
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISResearch Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESIS
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
 
Using AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebUsing AI to understand everyday learning on the Web
Using AI to understand everyday learning on the Web
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online Activities
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the Web
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday Learning
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)
 
Dietze linked data-vr-es
Dietze linked data-vr-esDietze linked data-vr-es
Dietze linked data-vr-es
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web Datasets
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
 
LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014
 
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
 
Open Data Dialog 2013 - Linked Data in Education
Open Data Dialog 2013 - Linked Data in EducationOpen Data Dialog 2013 - Linked Data in Education
Open Data Dialog 2013 - Linked Data in Education
 
Towards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeTowards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledge
 

Kürzlich hochgeladen

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Kürzlich hochgeladen (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Towards research data knowledge graphs

  • 1. Backup Beyond research data infrastructures - exploiting artificial & crowd intelligence for building research knowledge graphs Stefan Dietze GESIS – Leibniz Institute for the Social Sciences & Heinrich-Heine-Universität Düsseldorf AIFB / KIT, Karlsruhe 21.02.2020
  • 2. Backup Beyond research data infrastructures - exploiting artificial & crowd intelligence for building research knowledge graphs Stefan Dietze GESIS – Leibniz Institute for the Social Sciences & Heinrich-Heine-Universität Düsseldorf LWDA2019, 02 October 2019 research data infrastructure data fusion distant supervision Web mining distributional semantics knowledge graph neural entity linking research data machine learning social web artificial intelligence semantics claim extraction stance detection fact verification crowd (Buzzword) Bingo !?
  • 3. Finding research data on the Web? 17/03/20 3Stefan Dietze
  • 4. Finding research data on the Web? 17/03/20 4Stefan Dietze
  • 5. Finding research data on the Web? 17/03/20 5Stefan Dietze
  • 6. Finding (social sciences) research data on the Web 17/03/20 6Stefan Dietze
  • 7. Part I Retrieving, extracting and linking research data (in particular: metadata) on the Web Part II Mining novel forms of research data (KGs) from the Web 17/03/20 7Stefan Dietze Datasets Metadata Publications Web pages Opinions Claims Stances Overview
  • 8. Web mining of dataset metadata (or: dataset KGs)  Harvesting from open data portals (e.g. DCAT/VoID- metadata on DataHub.io, DataCite etc.)  Information extraction on long tail of Web documents? => dynamics & scale: approx. 50 trn (50.000.000.000.000) Web pages indexed by Google (plus gazillion of temporal snapshots)  Embedded markup (RDFa, Microdata, Microformats) for annotation of Web pages  Supports Web search & interpretation  Pushed by Google, Yahoo, Bing et al (schema.org vocabulary)  Adoption on the Web by 38% all Web pages (sample: Common Crawl 2016, 3.2 bn Web pages)  Easily accesible, large-scale source of factual knowledge (about research data & research information)  Large-scale source of training data, e.g. manually annotated Web pages citing datasets Facts (“quads”) node1 name WB Commodity URI-1 node1 distribution node_xy URI-1 node1 creator Worldbank URI-1 node1 dateCreated 26 April 2017 URI-1 node2 creator World Bank URI-2 node2 encodingFormat text/CSV URI-2 node3 dateCreated 26 April 2007 URI-3 node3 keywords crude URI-3 <div itemscope itemtype ="http://schema.org/Dataset"> <h1 itemprop="name">World Bank-Commodity Prices</h1> <span itemprop=„distribution">URL-X</span> <span itemprop=„license">CC-BY</span> ... </div> 17/03/20 8Stefan Dietze
  • 9. 17/03/20 9Stefan Dietze Research dataset markup on the Web  In Common Crawl 2017 (3.2 bn pages): o 14.1 M statements & 3.4 M instances related to „s:Dataset“ o Spread across 200 K pages from 2878 PLDs (top 10% of PLDs provide 95% of data)  Studies of scholarly articles and other types [SAVESD16, WWW2017]: majority of major publishers, data hosting sites, data registries, libraries, research organisations respresented power law distribution of dataset metadata across PLDs  Challenges o Errors. Factual errors, annotation errors (see also [Meusel et al, ESWC2015]) o Ambiguity & coreferences. e.g. 18.000 entity descriptions of “iPhone 6” in Common Crawl 2016 & ambiguous literals (e.g. „Apple“>) o Redundancies & conflicts vast amounts of equivalent or conflicting statements
  • 10.  0. Noise: data cleansing (node URIs, deduplication etc)  1.a) Scale: Blocking through BM25 entity retrieval on markup index  1.b) Relevance: supervised coreference resolution  2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level KnowMore: data fusion on markup 17/03/20 10 1. Blocking & coreference resolution 2. Fusion / Fact selection New Queries WorldBank, type:(Organization) Washington, type:(City) David Malpass, type:(Person) (supervised) Entity Description name “WorldBank Commodity Prices 2019” distribution Worldbank (node) releaseDate 26.04.2019 keywords „crude”, “prizes”, “market” encodingFormat text/CSV Query WorldBank Commodity, Prices 2019, type:(Dataset) Candidate Facts node1 name WB Commodity node1 distribution node_xy node1 creator Worldbank node1 dateReleased 26 April 2019 node2 creator World Bank node2 encodingFormat text/CSV node3 dateCreated 26 April 2007 node4 keywords “crude” Web page markup Web crawl (Common Crawl, 44 bn facts) approx. 125.000 facts for query [ s:Product, „iPhone6“ ] Stefan Dietze Yu, R., [..], Dietze, S., KnowMore-Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2019 (SWJ2019) Tempelmeier, N., Demidova, S., Dietze, S., Inferring Missing Categorical Information in Noisy and Sparse Web Markup, The Web Conf. 2018 (WWW2018)
  • 11.  0. Noise: data cleansing (node URIs, deduplication etc)  1.a) Scale: Blocking through BM25 entity retrieval on markup index  1.b) Relevance: supervised coreference resolution  2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level KnowMore: data fusion on markup 17/03/20 11 1. Blocking & coreference resolution 2. Fusion / Fact selection New Queries WorldBank, type:(Organization) Washington, type:(City) David Malpass, type:(Person) (supervised) Entity Description name “WorldBank Commodity Prices 2019” distribution Worldbank (node) releaseDate 26.04.2019 keywords „crude”, “prizes”, “market” encodingFormat text/CSV Query WorldBank Commodity, Prices 2019, type:(Dataset) Candidate Facts node1 name WB Commodity node1 distribution node_xy node1 creator Worldbank node1 dateReleased 26 April 2019 node2 creator World Bank node2 encodingFormat text/CSV node3 dateCreated 26 April 2007 node4 keywords “crude” Web page markup Web crawl (Common Crawl, 44 bn facts) approx. 125.000 facts for query [ s:Product, „iPhone6“ ] Stefan Dietze Yu, R., [..], Dietze, S., KnowMore-Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2019 (SWJ2019) Tempelmeier, N., Demidova, S., Dietze, S., Inferring Missing Categorical Information in Noisy and Sparse Web Markup, The Web Conf. 2018 (WWW2018) Fusion performance  Experiments on books, movies, products (ongoing: datasets)  Baselines: BM25, CBFS [ESWC2015], PreRecCorr [Pochampally et. al., ACM SIGMOD 2014], strong variance across types Knowledge Graph Augmentation  On average 60% - 70% of all facts new (across DBpedia, Wikidata, Freebase)  Additional experiments on learning new categorical features (e.g. product categories or movie genres) [WWW2018]
  • 12. Applications: search for (SS) datasets, resources & relations 12Stefan Dietze https://search.gesis.org/ Dataset Rel. Publications Disambiguation of datasets, methods, software, authors, topics.
  • 13. Rich Context & Coleridge Initiative building (yet another) KG of scholarly resources & datasets 13Stefan Dietze  Context/corpus: publications (currently: social sciences, SAGE Publishing)  Tasks: I. Extraction/disambiguation of dataset mentions II. Extraction/detection of research methods III. Classification of research fields https://coleridgeinitiative.org/richcontextcompetition
  • 14. Disambiguation of dataset citations Otto, W. et al., Knowledge Extraction from scholarly publications – the GESIS contribution to the Rich Context Competition, to appear, Sage Publishing, 2020 14Stefan Dietze All these issues are addressed in the current report, which is based on analysis of data obtained in the National Comorbidity Survey (NCS) (15). The NCS is a nationally representative survey of the US household population that includes retrospective reports about the ages at onset and lifetime occurrences of suicidal ideation, plans, and attempts along with information about the occurrences of mental disorders, substance use, substance abuse, and substance dependence. National Comorbidity Survey (NCS) NCS Challenges  Ambiguous (incomplete) citations  Lack of high-quality and representative training data (usually: weak labels, domain bias) Approaches & results  Prior work: supervised pattern induction [Boland et al, TPDL2012]  Current approach: o neural NER based on spaCy (CRF-based approach for research method detection) o Training (testing) on 12.000 (3.000) paragraphs (distribution of negative/positive differs, training batch size=25, dropout=0.4) o Results approx. P = .50, R= .90 (weakly labelled test data) o On small set of manually labelled test data: P= .52; R= .21)
  • 15. Profiling (Graph) Datasets Zloch, M., Acosta, M., Hienert, D., Dietze, S., Conrad, S., A Software Framework and Datasets for the Analysis of Graph Measures on RDF Graphs, ESWC19, Best Student Paper 15Stefan Dietze Motivation  Profiling datasets: extracting representative dataset metadata, e.g. to distinguish dataset of different kinds, find/discover datasets, generate synthetic datasets  Research question: what are effective graph metrics to profile graph-based research data (social graphs, knowledge graphs) Methods & results  Framework for profiling datasets based on 60 different graph metrics  Feature engineering (correlation analysis etc) and feature impact analysis  Certain datasets categories hard to describe/distinguish due to inherent diversity/variance of datasets  Set of descriptive, non-redundant dataset profile features varies for different dataset categories Feature homogeneity (lighter colour = more homogenous metric within domain) Feature impact in binary classification task (RF)
  • 16. Beyond datasets: linking social sciences survey items F. Bensmann, A. Papenmeier, D. Kern, B. Zapilko, S. Dietze, Semantic Annotation, Representation and Linking of Survey Data, in progress 16Stefan Dietze Motivation  Surveys are costly: finding and reusing survey questions/items  Linking semantically related questions/responses across survey programmes: e.g. all questions/responses which evaluate the economic situation in Germany at present Approach & results  Taxonomy of question features & vocabulary for representing survey items & features  Initial ML models for predicting item features o Multiclass classification models for predicting information types (1st level: 3 classes, 2nd level: 9 classes) o LSTM, log. Regression, SVM, Naive Bayes, Random Forest o Reasonable performance, LSTM most robust Example from ALLBUS 2018 Example form ALLBUS ‘18 Q: “How would you rate the current economic conditions in Germany?” Family-Member Fact Cognition Self-focus Evaluation Object-focus Past Present Future ... Apartment Neighborhood Country <Continent> <Country> <City> Point in time Time span Periodic Point... Information Type Focus Time Reference Periodicity Relative Location Geo. Location
  • 17. Overview Part I Retrieving, extracting and linking research data (in particular: metadata) on the Web Part II Mining novel forms of research data (KGs) from the Web 17/03/20 17Stefan Dietze Datasets Metadata Publications Web pages Opinions Claims Stances
  • 18. Traditional & novel forms of research data: the case of social sciences 17/03/20 18Stefan Dietze  Traditional social science research data: survey & census data, microdata, lab studies etc (lack of scale, dynamics)  Social science vision: substituting & complementing traditional research data through data mined from the Web  Example: investigations into misinformation and opinion forming on Twitter (e.g. [Vousoughi et al. 2018])  Aims usually at investigating insights by also dealing with methodological/computational challenges  Insights, mostly (computational) social sciences, e.g. o Spreading of claims and misinformation o Effect of biased and fake news on public opinions  Methods, mostly in computer science, e.g. for o Crawling, harvesting, scraping of data o Extraction of structured knowledge (sentiments, stances, claims, etc) o Claim/fact detection and verification („fake news detection“), e.g. CLEF 2018 Fact Checking Lab o Stance detection, e.g. Fake News Challenge (FNC)
  • 19. 17/03/20 19Stefan Dietze http://dbpedia.org/resource/Tim_Berners-Lee wna:positive-emotion onyx:hasEmotionIntensity "0.75" onyx:hasEmotionIntensity "0.0" Mining opinions & interactions (the case of Twitter)  Heterogenity: multimodal, multilingual, informal, “noisy” language  Context dependence: interpretation of tweets/posts (entities, sentiments) requires consideration of context (e.g. time, linked content), “Dusseldorf” => City or Football team  Dynamics & scale: e.g. 6000 tweets per second, plus interactions (retweets etc) and context (e.g. 25% of tweets contain URLs)  Evolution and temporal aspects: evolution of interactions over time crucial for many social sciences questions  Representativity and bias: demographic distributions not known a priori in archived data collections http://dbpedia.org/resource/Solid wna:negative-emotion P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
  • 20. 17/03/20 20Stefan Dietze P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18. TweetsKB: a knowledge graph of Web mined “opinions” https://data.gesis.org/tweetskb/  Harvesting & archiving of 9 Bn tweets over 6 years (permanent collection from Twitter 1% sample since 2013)  Information extraction pipeline to build a KG of entities, interactions & sentiments (distributed batch processing via Hadoop Map/Reduce) o Entity linking with knowledge graph/DBpedia (Yahoo‘s FEL [Blanco et al. 2015]) (“president”/“potus”/”trump” => dbp:DonaldTrump), to disambiguate text and use background knowledge (eg US politicians? Republicans?), high precision (.85), low recall (.39) o Sentiment analysis/annotation using SentiStrength [Thelwall et al., 2017], F1 approx. .80 o Extraction of metadata and lifting into established schemas (SIOC, schema.org), publication using W3C standards (RDF/SPARQL)
  • 21. 17/03/20 21Stefan Dietze P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.  Harvesting & archiving of 9 Bn tweets over 5 years (permanent collection from Twitter 1% sample since 2013)  Information extraction pipeline (distributed via Hadoop Map/Reduce) o Entity linking with knowledge graph/DBpedia (Yahoo‘s FEL [Blanco et al. 2015]) (“president”/“potus”/”trump” => dbp:DonaldTrump), to disambiguate text and use background knowledge (eg US politicians? Republicans?), high precision (.85), low recall (.39) o Sentiment analysis/annotation using SentiStrength [Thelwall et al., 2012], F1 approx. .80 o Extraction of metadata and lifting into established schemas (SIOC, schema.org), publication using W3C standards (RDF/SPARQL) Use cases  Aggregating sentiments towards topics/entities, e.g. about CDU vs SPD politicians in particular time period  Twitter archives as general corpus for understanding temporal entity relatedness (e.g. “austerity” & “Greece” 2010-2015)  Investigating spreading & impact of fake news (e.g. TweetsKB, ClaimsKG, stance detection) Limitations  Bias & representativity: demographic distributions of users (not known a priori and not representative) -0.40000 -0.30000 -0.20000 -0.10000 0.00000 0.10000 0.20000 0.30000 0.40000 Cologne Düsseldorf https://data.gesis.org/tweetskb/ TweetsKB: a knowledge graph of Web mined “opinions”
  • 22. 17/03/20 23Stefan Dietze Mining knowledge about claims and stances stance, claim trustworthiness? stance, claim trustworthiness?
  • 23. Detecting stances towards claims/opinions Motivation  Problem: detecting stance of documents (e.g. Web pages, scientific publication) towards a given claim (unbalanced class distribution)  Motivation: stance of documents (in particular disagreement) useful (a) as signal for truthfulness (fake news detection) and (b) Document or Source classification (PLDs, publishers) Approach  Cascading binary classifiers: addressing individual issues (e.g. misclassification costs) per step  Features, e.g. textual similarity (Word2Vec etc), sentiments, LIWC, etc.  Best-performing models: 1) SVM with class-wise penalty, 2) CNN, 3) SVM with class-wise penalty  Experiments on FNC-1 dataset (and FNC baselines) Results  Minor overall performance improvement  Improvement on disagree class by 27% (but still far from robust) A. Roy, A. Ekbal, S. Dietze, P. Fafalios, Exploiting stance hierarchies for cost-sensitive stance detection of Web documents, JCDL2020 under review. 24Stefan Dietze
  • 24. 17/03/20 25Stefan Dietze ClaimsKG: a knowledge graph of claims and claim-related metadata Motivation  Claims spread across various (unstructured) fact-checking sites  Example: finding claims about / made by US republican politicians across the Web? Approach  Harvesting claims & metadata from fact- checking sites (e.g. snopes.com, Politifact.com etc); currently approx. 30.000 claims (plus mining schema.org/ClaimReview markup (> 500.000 statements in Common Crawl 2017)  Information extraction & linking o Linking mentioned entities to DBpedia o Normalisation of ratings (true, false, mixture, other); coreference resolution of claims o Exposing data through established vocabulary and W3C standards (e.g. SPARQL endpoint) https://data.gesis.org/claimskg/ A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K. Todorov, ClaimsKG – A Live Knowledge Graph of fact-checked Claims, ISWC2019
  • 25. Conclusions & open challenges Retrieving, extracting, linking of research dataset metadata (KGs)  Mining of unstructured Web pages and scholarly articles for research datasets & metadata  Profiling of research datasets for discovery, sampling, generation of synthetic data  Plenty of related initiatives and efforts (e.g. Rich Context, Research Graph, OpenAIRE, ORKG)  Challenges: transparent/reproducible/reusable methods for extraction & mining across domains and corpora Mining and sharing novel forms of research data (KGs)  Mining the Web for novel forms of research data  Examples from social sciences: opinions (sentiments on entities) and interactions on Twitter & structured knowledge about resource relations (for instance: stances) and claims  Challenges: language understanding/interpretation, representativity and bias 17/03/20 26Stefan Dietze
  • 26. Acknowledgements • Maribel Acosta (KIT, Karlsruhe) • Felix Bensmann (GESIS) • Katarina Boland (GESIS, Germany) • Stefan Conrad (HHU, Germany) • Elena Demidova (L3S, Germany) • Dimitar Dimitrov (GESIS, Germany) • Asif Ekbal (IIT Patna, India) • Pavlos Fafalios (FORTH ICS, Greece) • Daniel Hienert (GESIS, Germany) • Vasileios Iosifidis (L3S, Germany) • Dagmar Kern (GESIS, Germany) • Eirini Ntoutsi (LUH, Germany) • Vasilis Iosifidis (L3S, Germany) • Wolfgang Otto (GESIS, Germany) • Andrea Papenmeier (GESIS, Germany) • Markus Rokicki (L3S, Germany) • Arjun Roy (IIT Patna, India) • Renato Stoffalette Joao (L3S, Germany) • Nicolas Tempelmeier (L3S, Germany) • Konstantin Todorov (LIRMM, France) • Ran Yu (GESIS, Germany) • Benjamin Zapilko (GESIS, Germany) • Matthäus Zloch (GESIS, Germany) 17/03/20 27Stefan Dietze
  • 27. 28Stefan Dietze Knowledge Technologies for the Social Sciences (WTS) https://www.gesis.org/en/institute/departments/knowledge-technologies-for-the-social-sciences/ WTS Labs https://www.gesis.org/en/research/applied-computer-science/labs/wts-research-labs Data & Knowledge Engineering @ HHU https://www.cs.hhu.de/en/research-groups/data-knowledge-engineering.html L3S http://www.l3s.de Personal http://stefandietze.net