Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
What's all the data about? - Linking and Profiling of Linked Datasets
1. What‘s all the data about –
profiling and interlinking Web datasets
Stefan Dietze
L3S Research Center
27/03/14 1Stefan Dietze
2. Recent work on Linked Data exploration/discovery/search
Entity interlinking & dataset interlinking recommendation
Dataset profiling
Data consistency & conflicts
Research areas
Web science, Information Retrieval, Semantic Web & Linked
Data, data & knowledge integration (mapping, classification,
interlinking)
Application domains: education/TEL, Web archiving, …
Some projects
Introduction
http://www.l3s.de/
Stefan Dietze 27/03/14 2
See also: http://purl.org/dietze
3. …why are there so few datasets actually used?
Date reuse and in-links focused on trusted „reference
graphs“ such as DBpedia, Freebase etc
Long tail of LD datasets which are neither reused nor linked
to (LOD Cloud alone 300+ datasets, 50 bn triples)
Explanations?
Linked Data is awesome, but...
27/03/14
„HTTP-accessibility“
(SPARQL, URI-dereferencing)
„Structure“ & „Semantics“
(=> shared/linked vocabularies)
„Interlinked“
„Persistent“
Hm,
really?
Stefan Dietze
4. Linked data is more diverse than we think
SPARQL Web-Querying Infrastructure: Ready for Action?,
Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves
Vandenbussch, International Semantic Web Conference 2013,
(ISWC2013).
SPARQL endpoint availability over time [Buil-Aranda et al 2013]
Accessibility of datasets?
Less than 50% of all SPARQL endpoints actually responsive
at given point of time
“THE” SPARQL protocol? No, but many variants & subsets
…
Shared vocabularies & schemas, but:
…still very heterogeneous [d’Aquin, WebSci13]
…data partially messy and not conformant
(RDFS, schemas) [HoganJWS2012]
…even widely used reference datasets such as
DBpedia noisy [Paulheim2013]
Co-occurence graph of data
types in 146 datasets: 144
Vocabularies, 588 highly
overlapping types, 719
Properties
Assessing the Educational Linked Data Landscape, D’Aquin, M.,
Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris,
France, May 2013.
Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic
Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218,
2013, pp 510-525
An empirical survey of Linked Data conformance. Hogan, A., Umbrich,
J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., In the Journal of Web
Semantics 14: pp. 14–44, 2012Stefan Dietze
5. What about data consistency?
Inconsistency and Incompleteness of Linked Datasets – a
Case Study, Yuan, W., Demidova, E., Dietze, S., Zhu, X., Web
Science 2014, WebSci14, under review.
27/03/14
6. Too many/diverse datasets, too little information
Stefan Dietze 27/03/14
?
?
? ?? ?
Which datasets are useful & trustworthy for case
XY (eg „learning about the solar system“) ? Which
topics are covered?
Types: which datasets describe statistics, videos,
slides, publications etc?
Currentness, dynamics, accessability/reliability,
data quantity & quality?
7. Data curation and dataset profiling
Dataset
Catalog/Registry
Stefan Dietze 27/03/14
Catalog of data: classification of
datasets according to resource
types, disciplines/topics, data
quality, accessability, etc
Infrastructure for
distributed/federated querying
describes
Which datasets are useful & trustworthy for case
XY (eg „learning about the solar system“) ? Which
topics are covered?
Types: which datasets describe statistics, videos,
slides, publications etc?
Currentness, dynamics, accessability/reliability,
data quantity & quality?
8. db:Astro. Objects
Dataset profiling: what’s all the data about
Dataset
Metadata
Stefan Dietze 27/03/14
BIBO
AAISO
FOAF
contains
Entity disambiguation &
linking [ESWC13]
Topic profile extraction
[WWW13, ESCW14]
db:Astronomy
db:Astro. Objects
Dataset
Catalog/Registry
yov:Video
po:Programme
BBC Programme
<po:Programme …>
<po:Series>Wonders of the Solar System</.>
<po:Actor>Brian Cox</…>
</po:Programme…>
<yo:Video …>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video…>
Yovisto Video
bibo:Fil
bibo:Fi
bibo:Film
Schema mappings
[WebSci13]
10. Schema assessment and mapping
Co-occurence of
data types
(in 146 datasets:
144 Vocabularies,
588 highly
overlapping types,
719 Properties)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
<po:Programme …>
<po:title>Secret Universe –
The Life of the Cell</po:title>
…
</po:Programme…>
BBC Programme
<sioc:Item …>
<label>Viral diseases &
bacteria</title>
…
</sioc:Item ….>
SlideShare Set
po:Programme
sioc:Item
?
http://datahub.io/group/linked-education
Stefan Dietze 27/03/14
11. Schema assessment and mapping
Co-occurence of
data types
(in 146 datasets:
144 Vocabularies,
588 highly
overlapping types,
719 Properties)
Co-occurence after
mapping into most
frequent schemas
(201 frequent types
mapped into 79
classes)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
bibo:Slideshow
bibo:Film
bibo:Document
<po:Programme …>
<po:title>Secret Universe –
The Life of the Cell</po:title>
…
</po:Programme…>
BBC Programme
<sioc:Item …>
<label>Viral diseases &
bacteria</title>
…
</sioc:Item ….>
SlideShare Set
po:Programme
sioc:Item
Stefan Dietze 27/03/14
12. LinkedUp Data Catalog
in a nutshell http://datahub.io/group/linked-education
http://data.linkededucation.org/linkedup/catalog/
RDF (VoID) dataset catalog: browse &
query distributed datasets
Live information about endpoint
accessibility
Federated queries using type mappings
Stefan Dietze 27/03/14
http://datahub.io/group/linked-education
13. <yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
Topics/categories addressed?
Relatedness of resources/entities?
(types, semantics)
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Combining a co-occurrence-based and a semantic measure
for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended
Semantic Web Conference, (May 2013).
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B., Dietze, S.,
Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended
Semantic Web Conference (ESWC2014), Crete, Greece, (2014).
Challenge: semantics of resources/datasets?
15Stefan Dietze 27/03/14
14. <yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Data disambiguation (for linking & profiling)
Brian Cox?
Sun?
Pluto?
16Stefan Dietze 27/03/14
15. db:Pluto
(Dwarf
Planet)
db:Astrono-
mical Objects
db:Sun
Data disambiguation using background knowledge
„Semantic relatetedness“ of resources?
db:Astronomy
17
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
<yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
Stefan Dietze 27/03/14
16. db:Pluto
(Dwarf
Planet)
db:Astrono-
mical Objects
<yov:Lecture8748720>
<title>Pluto & the Dwarf
Planets</title>
…
< yov:Lecture8748720>
Online Lecture
db:Astronomy
Computation of connectivity scores
between resources/entities
Method: combination of a
(i) semantic (graph-based) connectivity
score (SCS) with
(ii) a Web co-occurence-based measure
(CBM) (similar to NGD)
For (i): adaptation of Katz-Index from SNA
for (linked) data graphs (considering path
number and path lengths of transversal
properties)
db:Sun
SCS = 0.32
CBM = 0.24
http://purl.org/vol/doc/
http://purl.org/vol/ns/
19/09/2013 19Stefan Dietze
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
Entity linking: semantic relatedness
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
17. Entity linking: evaluation
27/03/14 20Stefan Dietze
Evaluation based on USA Today News items (80.000 entity pairs)
Manually created gold standard
(1000 entity pairs)
Baseline: Explicit Semantic Analysis (ESA)
=> CBM/SCS: „relatedness“; ESA: „similarity“
Precision/Recall/F1 for SCS, CBM, ESA.
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
18. db:Astrono-
mical Objects
db:Astronomy
db:Sun
Extracting representative metadata („topic profile“) for each dataset
Ranking of most representative (DBpedia) categories (= topics); applied to all responsive LOD datasets
Scalability vs representativeness: sampling & ranking for good scalability/accuracy balance
DBpedia category graph
Stefan Dietze 27/03/14
Dataset profiling: what‘s the data about?
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
,(ESWC2014), Crete, Greece, (2014).
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
19. Dataset profiling: approach
Stefan Dietze 27/03/14
1. Sampling of resource instances
(random sampling, weighted sampling, resource
centrality sampling)
2. Entity and topic extraction (NER via DBpedia
Spotlight, category mapping and expansion)
3. Normalisation and ranking (using graphical-
models such as PageRank with Priors, HITS with
Priors and K-Step Markov)
=> Result: weighted dataset-topic profile graph
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
20. Dataset profiling: exploring LOD datasets/topics
in a nutshell http://data-observatory.org/lod-profiles/
Stefan Dietze 27/03/14
Automatic extraction of dataset “topics” [ESWC2014]
Visualisation & exploration of dataset-topic graph
(datasets, topics, relationships)
Includes all (responsive) datasets of LOD Cloud
21. Dataset profiling: results evaluation
Stefan Dietze 27/03/14
NDCG (averaged over all datasets) .
Datasets & Ground Truth
Yovisto, Oxpoints, LAK Dataset, Semantic Web
Dogfood
Crowd-sourced topic indicators from datasets
(keywords, tags)
Manual mapping to entities & category extraction
(ranking according to frequency)
Baselines
1) LDA, 2) tf/idf (applied to entire datasets)
Topic extraction according to our approach,
weighting/ranking based on term weight
Measure
NDCG @ rank l
Performance (time/NDCG) for different sampling
strategies/sizes etc
23. Stefan Dietze 27/03/14
Diversity of category profile for a single paper
Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Semantic Web".
Scientific American Magazine.
person
document
dbp:Tim_Berners-Lee
dbp:Category:1955_births
dbp:Category:People_from_London
dbp:Category:Buzzwords
dbp:Semantic_Web
dbp:Category:Semantic_Web
dbp:Category:Web_Services
dbp:Category:HTTP
dbp:Category:Unitarian_Universalists
first-level categories (dcterms:subject)
dbp:Category:World_Wide_Web
dbp:Category:Royal_Medal_winners
24. DBpedia category graph not an ideal “topic” vocabulary:
Broad and noisy
“Categories” vs “topics” (for capturing disciplines, thesauri
like UMBEL or UNESCO Thesaurus seem better suited)
Hierarchy ?
Filtering of certain partitions of category graph (too generic
categories etc)
Mixing categories across resource types (document, person)
creates “perceived noise”
But: broadness is useful as general vocabulary for
categorisation of all sorts of resource types
Stefan Dietze 27/03/14
Dataset profiling: some lessons learned
25. Stefan Dietze 27/03/14
http://data-observatory.org/led-explorer/
Type specific views on datasets/
categories
“Document” (foaf:document)
“Person “ (foaf:person)
“Course” (aaiso:course)
Currently applied to datasets in
LinkedUp Catalog only (as
schema mappings already
available here)
Type-specific exploration of dataset categories
26. Stefan Dietze 27/03/14
Dataset interlinking recommendation
Candidate datasets for interlinking?
34
t
Linkset1
Linkset2
Problem
Given dataset t, ranking datasets from D
according to probability score (di, t) to
contain linking candidates (entities)
Features:
Vocabulary overlap
Existing links (SNA)
Datasets more likely to contain linking
candidates if they (a) share common
schema elements, or (b) already link to t
or datasets t links to (friend of a friend)
Conclusions
Roughly 60% MAP for both approaches
Future work: quantity of links, more
remote links, extraction of dataset links
rather than data from DataHub
Lopes, G.R., Paes Leme, L.A.P., Nunes, B.P., Casanova, M.A.,
Dietze, S., Recommending Tripleset Interlinking through a
Social Network Approach, The 14th International Conference
on Web Information System Engineering (WISE 2013),
Nanjing, China, 2013.
Paes Leme, L. A. P., Lopes, G. R., Nunes, B. P., Casanova,
M.A., Dietze, S., Identifying candidate datasets for data
interlinking, in Proceedings of the 13th International
Conference on Web Engineering, (2013).
Rank
1 DBLP
2 ACM
3 OAI
4 CiteSeer
5 IBM
6 Roma
7 IEEE
8 Ulm
9 Pisa
?
?
27. Stefan Dietze 27/03/14 37
Success models:
data & applications
LinkedUp Challenge
to identify innovative
tools & applications
Evaluation methods
and approaches
“LinkedUp” – Linking Web Data (for Education)
L
Data linking & curation
Technology transfer
& community-building
Collecting & exposing open
data
=> LinkedUp Data Catalog
Profiling and linking of Web
Data for education
=> educational data graph
[ESWC2013], [ISWC2013],
Disseminating knowledge &
building communities
(educators, computer
scientists, data engineers)
Gathering stakeholder
feedback: use cases, and
requirements
http://linkedup-challenge.org/#usecases
http://linkedup-project.eu/events
http://www.linkedup-challenge.org/
http://data.linkededucation.org
European suport action to
advance take-up of open
data & related technologies
http://www.linkedup-project.eu
29. LinkedUp Challenge: using open data (for learning)
Open Data Competition to promote tools and applications that analyse / integrate (Linked)
Web data
Organised by LinkedUp project over 2 years (“Veni”, “Vidi”, “Vici”) with 40.000 EUR awards
Veni Competition - 22 submissions, 8 shortlisted for presentation at Open Knowledge
Conference (17 September, Geneva Switzerland)
http://linkedup-challenge.org
Stefan Dietze 27/03/14
30. Open & focused track(s)
Final events at ESWC2014
(May, Crete)
Open Track only
Final events at OKCon 2013
(September 2013, Geneva)
Open track & focused tracks
Submission details and calls to be
released soon
Final events at ISWC2014
(October, Riva del Garda, Italy)
May –September 2013 October 2013 – May 2014 May 2014 – October 2014
?
33. Learning Analytics & Knowledge Dataset & Challenge
Facilitating Research on Learning Analytics and EDM
a nutshell
Stefan Dietze 27/03/14
http://lak.linkededucation.org/
http://lak.linkededucation.org/
LAK Dataset (450 publications in RDF/R)
ACM International Conference on Learning Analytics and
Knowledge (LAK) (2011-13)
International Conference on Educational Data Mining (2008-13)
Journal of Educational Data Mining (2008-12)
LAK Data Challenge
Analyse, explore correlate the LAK Dataset
At ACM LAK 2014 (April 2014, Indianapolis)
34. KEYSTONE COST ACTION
27/03/14 51Stefan Dietze
http://www.keystone-cost.eu/
Research network focused on distributed search,
dataset profiling, to Semantic Web, Databases, etc.
Running 2013-2017
WG1: Representation of structured data sources
WG2: Keyword search
WG3: User interaction and query interpretation
WG4: Research integration, showcases,
benchmarks, and evaluations
Open to new members (even beyond Europe)
Joint workshops (eg PROFILES2014 @ ESWC2014)
35. Ongoing/future work … and some upcoming events
Linked Data evolution, preservation, consistency
In RDF graphs (eg LOD Cloud), „all“ nodes are connected
LD preservation: which datasets to preserve (direct links
or even more distant neighbours)?
=> semantic relatedness as guidance for scalable
preservation strategies /data enrichment
Link correctness in evolving LD
Investigating impact of changes on link correctness
(weekly LOD crawls over 1 year time span)
Application: informed preservation strategies
Conflict detection and LD quality (link quality, impact of
conflicts in distant nodes)
PROFILES workshop @ ESWC2014
(http://keystone-cost.eu/profiles2014)
26 May 2014, Crete, Greece
Linking User Data 2014 at UMAP2014
(http://liud.linkededucation.org)
Deadline: 1 April
Online Learning & LD Tutorial at WWW2014
(http://www2014.kr/)
07 April, Seoul
36. Thank you!
WWW
See also (general)
http://linkedup-project.eu
http://linkededucation.org
http://data.l3s.de
http://purl.org/dietze
See also (data)
http://data.linkededucation.org
http://data.linkededucation.org/linkedup/catalog/
http://lak.linkededucation.org
27/03/14 54Stefan Dietze
Besnik Fetahu (L3S)
Bernardo Pereira Nunes (PUC Rio)
Marco Casanova (PUC Rio)
Luiz Andre Paes Leme (PUC Rio)
Giseli Lopes (PUC Rio)
Davide Taibi (CNR, IT)
Mathieu d’Aquin (Open University, UK)
and many more…
Acknowledgements