1. How links can make your
open data even greater
Cristina Sarasua
Institute for Web Science and Technologies (WeST)
University of Koblenz-Landau, DE
Open Data Day 2017
Zurich, CH
2. Goals
● Spread the word about Semantic Web and
Linked Data technologies
● Share tips on how to link your data properly
3. Open Data
● Data implemented in any open format
○ CSV/TSV, XML,JSON etc.
● Made available for free
● By any organisation or individual person
● “Usable, reusable and distributable“
(Open Definition, by OKFN [1])
● Findable
○ Registered in data repositories
○ Via search engines
❖ Boosts
transparency
❖ Enables
reinterpretation of
the data
❖ Facilitates the
development of
new applications
[1] http://opendefinition.org/
4. The Web, an ocean of data
Logos:
SBB: http://www.sbb.ch/en/home.html
Deutsche Bahn:https://www.bahn.de/p/view/index.shtml
Wikipedia: https://commons.wikimedia.org/wiki/File:Wikipedia-logo.png
5. The Web, an ocean of data
Logos:
SBB: http://www.sbb.ch/en/home.html
Deutsche Bahn:https://www.bahn.de/p/view/index.shtml
Wikipedia: https://commons.wikimedia.org/wiki/File:Wikipedia-logo.png
How many trains
go from Zurich
to Basel SBB
daily?
How frequently do
trains arrive to
Basel Bad Bf?
What are the most
populated cities in
the south of
Germany?
6. The Web, an ocean of data
Logos:
SBB: http://www.sbb.ch/en/home.html
Deutsche Bahn:https://www.bahn.de/p/view/index.shtml
Wikipedia: https://commons.wikimedia.org/wiki/File:Wikipedia-logo.png
How many trains
go from Zurich
to Basel SBB
daily?
How frequently
do trains arrive
to Basel BB?
What are the most
populated cities in
the south of
Germany?
Who is in average more
punctual in locations with
more than 50,000
inhabitants, German or
Swiss trains?
7. The Web, an ocean of data
● Collect data
● Put the data together
● Transform data
● Enable joint query
The main effort is on the
data consumer side
Logos:
SBB: http://www.sbb.ch/en/home.html
Deutsche Bahn:https://www.bahn.de/p/view/index.shtml
Wikipedia: https://commons.wikimedia.org/wiki/File:Wikipedia-logo.png
CSV JSON HTML
Land
Schweiz CH
Staat Country
Switzerland
Linien Haltestelle Cities
8. The Web, an ocean of data
● Collect data
● Put the data together
● Transform data
● Enable joint query
The main effort is on the
data consumer side
Logos:
SBB: http://www.sbb.ch/en/home.html
Deutsche Bahn:https://www.bahn.de/p/view/index.shtml
Wikipedia: https://commons.wikimedia.org/wiki/File:Wikipedia-logo.png
CSV JSON HTML
Land
Schweiz CH
Staat Country
Switzerland
Linien Haltestelle Cities
9. The Web, an ocean of data
● Collect data
● Put the data together
● Transform data
● Enable joint query
The main effort is on the
data consumer side
Logos:
SBB: http://www.sbb.ch/en/home.html
Deutsche Bahn:https://www.bahn.de/p/view/index.shtml
Wikipedia: https://commons.wikimedia.org/wiki/File:Wikipedia-logo.png
CSV JSON HTML
Land
Schweiz CH
Staat Country
Switzerland
Linien Haltestelle Cities
10. The Web, an ocean of data
● Collect data
● Put the data together
● Transform data
● Enable joint query
The main effort is on the
data consumer side
Logos:
SBB: http://www.sbb.ch/en/home.html
Deutsche Bahn:https://www.bahn.de/p/view/index.shtml
Wikipedia: https://commons.wikimedia.org/wiki/File:Wikipedia-logo.png
CSV JSON HTML
Land
Schweiz CH
Staat Country
Switzerland
Linien Haltestelle Cities
11. The Web, an ocean of data
● Collect data
● Put the data together
● Transform data
● Enable joint query
The main effort is on the
data consumer side
Logos:
SBB: http://www.sbb.ch/en/home.html
Deutsche Bahn:https://www.bahn.de/p/view/index.shtml
Wikipedia: https://commons.wikimedia.org/wiki/File:Wikipedia-logo.png
CSV JSON HTML
Land
Schweiz CH
Staat Country
Switzerland
Linien Haltestelle Cities
12. The Web of Data
● Publish an explicit description of
the schema separately
● Follow RDF data model [2]
● Align schemas, link to other
entities in distributed data sources
The data publisher does a
data integration effort
[2] https://www.w3.org/RDF/
Land
Schweiz CH
Staat Country
Switzerland
Concepts
Entities
equivalent equivalent
same same
13. Typed relations between concepts and between entities from distributed and
(possibly) heterogeneous data sources.
subject predicate object
nyt:Zürich owl:sameAs dbpedia:Zürich
dbpedia:Tim_Berne
rs-Lee
rdf:type foaf:Person
uzh:hackzhodd geo:location geop:Point564
uzh:hackzhodd rdfs:seeAlso wdt:Q25112115
Links
[4] https://www.w3.org/DesignIssues/LinkedData.html
[3] LOD diagram by Abele et al. 2017
http://lod-cloud.net/versions/2017-02-20/lod.svg
15. Some key advantages
✔ The data publisher knows her data
✔ No need to integrate schemas upfront
✔ Data and metadata can be easily extended and modified
✔ Applications may query the schema description
✔ Structured search
Read more about it: Franklin et al. 2005, Heath et al. 2011
16. I have data, what should I do?
I have data,
what should I do?
17. Standard process
CSV
Transform it into RDF
vocabulary
data with
metadata
HowTo: Best Practices for publishing Linked Data, by Hyland et al. 2014.
Comparison of technology: Nentwig et al. 2015, Survey Link Discovery Frameworks.
Link to other entities
Data interlinking
Link discovery
Entity resolution
Open source framework for
interlinking (Isele et al.2009-2017):
http://silkframework.org
Publish
1. Target data set(s)
2. Type of entities to be connected
(e.g.Persons and Humans)
3. Link predicate (e.g. owl:sameAs)
4. Interlinking criteria (e.g. if similar
names)
18. Common “mistakes” in data interlinking
● Link only to “popular” data sets
●
● Think solely of owl:sameAs links
● Focus on target data sets of similar
topical domain and provenance
● No documented links
● Minimum number of links to appear in
the LOD diagram
● No link maintenance
[5] https://www.w3.org/TR/void/
➢ Gaining visibility is good, but that’s not the
only reason for interlinking.
➢ uzh:r_user_groupMaybe no one
described it yet!
➢
➢ Specify your outlinks in the data set
description to help Web data crawlers!
VoiD [5] :UZH a void:Linkset;
void:target :Wikidata;
void:linkPredicate rdfs:seeAlso;
void:triples 100; . .
➢ Target data sets die, and new data sets
appear all the time.
19. # Tip 1: Answer these questions and design the
interlinking accordingly
●
● Who should benefit from the interlinking?
a. You, as data publisher
b. Applications (and end-users) consuming your data
c. Applications (and end-users) consuming a collection of data sets, yours among others
● Why do you want to interlink your data?
a. To gain visibility (via other data sets)
b. To complement your data
c. To enable “on update cascade”
● What things do you want to connect?
a. Are there alternative ways of naming such things? E.g. Person, Human
b. Are there more general / more specific terms to label such things? E.g. Animals, Mammals.
20. # Tip 2: When implementing the interlinking
● Assess the quality of target data sets, or your own data quality will be
damaged. (See Zaveri et al. 2015 for quality issues and quality
control methods).
● Publish outlinks, but also send link requests to others for inlinks.
●
● Check how others interlinked by querying link repositories [6] or the
data sets.
○ Consider declared data set IDs and not raw PLDs e.g.
http://ns.nature.com/subjects/
[6] http://sameas.org/ , http://www.linklion
uzh
nyt
wdt
21. Wikidata
“Wikidata records
what other sources
say” Lydia
Pintscher, 2016 [7].
Introduction to Wikidata, Sarasua 2016: https://goo.gl/gGzMzK
[7] https://goo.gl/On9Qz1
22. # Tip 3: Link, improve, repeat.
● Stop criterion should not be the % entities interlinked.
a. If you have very specific data, being able to connect 1% of
source entities might be normal.
● Improve quality in these two dimensions
a. semantic accuracy
❌ cch:koblenz owl:sameAs cde:koblenz
b. links should “enable the discovery of more things” (4th LD
Principle)
[4] https://www.w3.org/DesignIssues/LinkedData.html
23. Semantic accuracy
● Let humans revise the links.
Detailed Crowdsourcing Tutorial by Demartini et
al.:https://itsgettingcrowded.wordpress.com/
See Demartini et al. 2012, 2013;
Sarasua et al. 2012, 2015
With microtask crowdsourcing
24. Enable the discovery of more things
● Link to entities that make you learn
something new and non-redundant
about the source entity
○ New value
○ New classification
○ New way of describing
metadata
● The more entities you linked to, the
better.
● The more data sets you connect to,
the better
See also Sarasua et al. 2017.
25. When you publish your open data, consider using
Semantic Web / Linked Data technologies and
linking your data to other people’s data.
26. Thanks! Danke! Grazie! Merci!
Cristina Sarasua
E-mail: csarasua@uni-koblenz.de
Twitter: @csarasuagar
Don’t forget that you can
also become a
Wikimedia member
and donate :)
https://wikimedia.de/wiki/Mitgliedschaft
27. References
Franklin et al. 2005. From databases to dataspaces. ACM SIGMOD Record.
https://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf
Heath et al. 2011. Linked data: Evolving the web into a global data space. Morgan &
Claypool.
http://www.morganclaypool.com/doi/abs/10.2200/s00334ed1v01y201102wbe001
Hyland et al. 2014. Best Practices for Publishing Linked Data https://www.w3.org/TR/ld-bp/
Nentwig et al. 2015. Survey of Current Link Discovery Frameworks. Semantic Web Journal.
http://www.semantic-web-journal.net/system/files/swj1029.pdf
Zaveri et al. 2015. Quality Assessment for Linked Data: A Survey. Semantic Web Journal.
http://www.semantic-web-journal.net/system/files/swj773.pdf
28. References
Demartini et al. 2012. ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for
Large-Scale Entity Linking, WWW2012.
https://diuf.unifr.ch/main/xi/sites/diuf.unifr.ch.main.xi/files/fp0982-demartini.pdf
Demartini et al. 2013. Large-scale linked data integration using probabilistic reasoning and
crowdsourcing.. VLDB Journal. 2013. https://link.springer.com/article/10.1007/s00778-013-0324-z
Sarasua et al. 2012. CrowdMap: Crowdsourcing Ontology Alignment with Microtasks. ISWC2012.
http://web.stanford.edu/~natalya/papers/iswc2012_crowdmap.pdf
Sarasua 2015. Programmatic Access to Crowdsourced Human Computation for Designing and Enhancing
Interlinking. SemWebDev, ESWC 2015. http://ceur-ws.org/Vol-1361/paper6.pdf
Sarasua et al. 2017. Methods for Intrinsic Evaluation of Links in the Web of Data. ESWC 2017. Upcoming.