Linked Open Data promises to provide guiding principles to publish interlinked knowledge graphs on the Web in the form of findable, accessible, interoperable, and reusable datasets. In this talk I argue that while as such, Linked Data may be viewed as a basis for instantiating the FAIR principles, there are still a number of open issues that cause significant data quality issues even when knowledge graphs are published as Linked Data. In this talk I will first define the boundaries of what constitutes a single coherent knowledge graph within Linked Data, i.e., present a principled notion of what a dataset is and what links within and between datasets are. I will also define different link types for data in Linked datasets and present the results of our empirical analysis of linkage among the datasets of the Linked Open Data cloud. Recent results from our analysis of Wikidata, which has not been part of the Linked Open Data Cloud, will also be presented.
What Are Links in Linked Open Data? A Characterization and Evaluation of Links between Knowledge Graphs on the Web
1. What Are Links in Linked Open Data?
A Characterization and Evaluation of Links between
Knowledge Graphs on the Web
[Invited Talk 04/08/21 at spatial@UCSB]
Armin Haller
Associate Professor, ANU
Armin Haller, Javier D. Fernández, Maulik R. Kamdar, Axel Polleres: What Are Links in Linked Open Data?
A Characterization and Evaluation of Links between Knowledge Graphs on the Web. ACM Journal of Data
and Information Quality 12(2): 9:1-9:34 (2020)
2. Knowledge Graphs (KGs)
“A Knowledge Graph is a a graph of data intended to accumulate and
convey knowledge of the real world, whose nodes represent entities of
interest and whose edges represent relations between these entities.”
[Hogan et. al 2020]
• Knowledge graphs are created collaboratively by many users
• Information can be added in a relatively arbitrary manner as
structural constraints are few
Closed KGs (~2019)
Microsoft ~2bn entities, ~55bn facts
Google ~1bn entities, ~70bn assertions
Facebook ~50m entities, ~500m assertions
eBay ~1bn triples
IBM ~100m entities, 5bn relationships
Open KGs (April 2021)
DBpedia ~4.58m entities, ~9.25GB
Yago4 ~50m entities, ~18.4GB
Wikidata ~93m entities, ~99GB
N. Noy, Y. Gao, A. Jain, A. Narayanan, A. Patterson, J. Taylor: Industry-scale Knowledge Graphs: Lessons and Challenges. ACM Queue 17(2), 2019.
A. Hogan et al.: Knowledge Graphs. CoRR abs/2003.02320, 2020.
3. Limited
many entities
Generic
applies to many
Specific
applies to few
RDF Knowledge Graphs
Comprehensive
fewer entities
ABox (Data)
TBox (Schema)
Q58043963
Q76
Barack Obama
(3,947 axioms)
Armin Haller
(189 axioms)
P361
Q35120
Entity
partOf
minimum
no of players
Chess Person Q73145133
P1872
5. Linked (Open) data principles
• LDP1: Use URIs as identifiers for things;
• LDP2: Use HTTP URIs so those identifiers can be
dereferenced;
• LDP3: return useful information upon dereferencing of those
URIs using a standard format (typically, RDF);
• LDP4: include links using externally dereferenceable URIs
T. Berners-Lee. 2006. Linked Data. W3C Design Issues. From http://www.w3.org/DesignIssues/LinkedData.html, 2010.
6. Challenges with Links in Linked Data
(KGs)
• References to many inaccessible URIs (i.e., broken
links) may render a dataset largely useless
• Changes in linked external dataset are out of control of
the data publisher
• No definition of what constitutes “internal links” (i.e.,
links between parts of one coherent dataset/KG), and
“external links” (i.e., links between different
datasets/KGs)
7. Related Work
Availability and Discoverability of Linked Open Data sources
• P.-Y. Vandenbussche, J. Umbrich, L. Matteis, A. Hogan, C. B. Aranda. 2017. SPARQLES: Monitoring public SPARQL endpoints. Semantic Web 8, 6
(2017), 1049–1065.
• J. Debattista, C. Lange, S. Auer, and D. Cortis. 2018. Evaluating the Quality of the LOD Cloud: An Empirical Investigation. Semantic Web 9, 6 (2018),
859–901.
• A. Polleres, M. R. Kamdar, J. D. Fernández, T. Tudorache, and Mark A. Musen. 2018. A More Decentralized Vision for Linked Data. In Proc. of the
2nd Workshop on Decentralizing the Semantic Web, co-located with ISWC, Vol. 2165. CEUR-WS.org, Monterey, CA, USA, 2019.
Metadata Representation and Quality
• K. Alexander, R. Cyganiak, M. Hausenblas, and J. Zhao. 2011. Describing Linked Datasets with the VoID Vocabulary. W3C Interest Group Note 03
March 2011. W3C.
• A. Zaveri, A.a Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. 2016. Quality assessment for Linked Data: A Survey. Semantic Web 7, 1
(2016), 63–93.
• A. Hogan, J. Umbrich, A. Harth, R. Cyganiak, A. Polleres, and S. Decker. 2012. An empirical survey of Linked Data conformance. Journal of Web
Semantics 14 (2012), 14 – 44.
• L. Rietveld, W. Beek, R. Hoekstra, and S. Schlobach. 2017. Metadata for a lot of LOD. Semantic Web 8, 6 (2017), 1067–1080.
Authoritative Namespaces and Links Between Linked Datasets
• Max Schmachtenberg, Christian Bizer, and Heiko Paulheim. 2014. Adoption of the Linked Data Best Practices in Different Topical Domains. In Proc.
of ISWC. LNCS, Riva del Garda, Italy, 245–260.
• A. Harth, S. Kinsella, and S. Decker. 2009. Using Naming Authority to Rank Data and Ontologies for Web Search. In Proc. of ISWC. LNCS,
Washington, DC., USA, 277–292.
• Aidan Hogan, Andreas Harth, Alexandre Passant, Stefan Decker, and Axel Polleres. 2010. Weaving the Pedantic Web. In Proc. s of the WWW2010
Workshop on Linked Data on the Web, LDOW (CEUR Workshop Proceedings), Vol. 628. CEUR-WS.org, Raleigh, USA, 1–10.
• A. S. Butt, A. Haller, and L. Xie. 2014. Ontology Search: An Empirical Evaluation. In Proc. of ISWC. LNCS, Riva del Garda, Italy, 130–147.
Linked Data Profiling and Link Analysis Tools
• C. Böhm, F. Naumann, Z. Abedjan, D. Fenz, T. Grütze, D. Hefenbrock, M. Pohl, and D. Sonnabend. 2010. Profiling linked open data with ProLOD. In
Proc. of 26th ICDEW 2010. IEEE, Long Beach, CA, USA, 17–178.
• N. Mihindukulasooriya, M. P.-Villalón, R. García-Castro, and A. Gómez-Pérez. 2015. Loupe - An Online Tool for Inspecting Datasets in the Linked
Data Cloud. In Proc. of ISWC (Posters & Demos). CEUR-WS.org, Bethlehem, PA, USA.
• C. B. Neto, K. Müller, M. Brümmer, D. Kontokostas, and S. Hellmann. 2016. LODVader: An Interface to LOD Visualization, Analytics and DiscovERy
in Real-time. In Proc. of 25th WWW Conference. ACM, Montreal, Quebec, Canada, 163–166.
• M. Ben Ellefi, Z. Bellahsene, S. Dietze, and K. Todorov. 2016. Dataset Recommendation for Data Linking: An Intensional Approach. In Proc. of
ESWC. LNCS, Crete, Greece, 36–51.
• J. Debattista, S. Auer, and C. Lange. 2016. Luzzu - A Methodology and Framework for Linked Data Quality Assessment. Journal of Data and
Information Quality 8, 1 (Oct. 2016), 4:1–4:32.
8. What is a dataset?
• No notion about the sets of triples which form a dataset/KG
– Linked Data datasets/KGs published on the Web are often partitioned
into several files, are made available through Linked Data APIs or are
in separate named graphs behind SPARQL endpoints
• Common practice suggests that single datasets and the URIs
“belonging” to these datasets can be referred to by sharing a
common namespace
• This notion of a namespace is typically not tied to a notion of
authority, as opposed to the original intention of URIs in the
Web architecture, where authority is an integral part of URIs:
URI = scheme ":" [//authority] path ["?"query] ["#"fragment]
9. What is a dataset?
• In RDF a namespace and thereby authority depends on the RDF serialization
and if the prefix of an identifier determining the namespace is clearly recognizable
as such or not (as opposed to XML)
• Best practice suggest to declare certain namespace prefixes to be authoritatively
owned by the dataset within metadata (using for example
vann:preferredNamespacePrefix, vann:preferredNamespaceUri).
Examples of namespaces and their prefixes:
dbr: http://dbpedia.org/resource/
dbo: http://dbpedia.org/ontology/
foaf: http://xmlns.com/foaf/0.1/
wd: http://www.wikidata.org/entity/
– However, 53.8% of all datasets in our analysis did not explicitly declare their namespace(s)
10. What is a link?
• Links in Linked Data do not have a clear definition
of direction (in contrast to hyperlinks)
t1:[dbr:Wolfgang_Amadeus_Mozart, owl:sameAs, wd:Q254]
t2:[dbr:Wolfgang_Amadeus_Mozart, rdf:type, dbo:Person]
t3:[dbr:Wolfgang_Amadeus_Mozart, foaf:name, “Wolfgang Amadeus Mozart”@en]
• Direction of the link does not depend on the URI in
its subject, but rather on the fact in which dataset
(KG) the triple appears, e.g., 𝑡1 can be defined in
DBpedia or Wikidata, in either direction.
11. Dataset
Definition 1: A dataset is a collection of
one or more associated RDF graphs,
published by a single controlling entity.
Given a dataset ds, we denote by Gds the
merge of all of its graphs.
12. Namespace
Definition 2: Let us assume that each
dataset uses a finite set of namespaces,
some of which it controls authoritatively.
Given a dataset 𝑑𝑠, we denote by 𝑁𝑆𝑑𝑠 the
set of its authoritative namespaces for 𝑑𝑠.
We assume each namespace is
authoritatively controlled by at most a single
dataset. That is, we assume that 𝑑𝑠1 ≠ 𝑑𝑠2
implies that 𝑁𝑆𝑑𝑠1
∩ 𝑁𝑆𝑑𝑠2
=∅.
13. Non-standard use
Definition 3: (Non-Standard-use). Let 𝑅𝐷𝐹, 𝑅𝐷𝐹𝑆, 𝑂𝑊𝐿 and 𝑋𝑆𝐷,
respectively, denote the reserved namespaces. Let 𝐺𝑅𝐷𝐹, 𝐺𝑅𝐷𝐹𝑆,
and 𝐺𝑂𝑊𝐿, resp., denote the RDF graphs accessible at these URIs,
where we write 𝐺𝑟𝑒𝑠 = 𝐺𝑅𝐷𝐹 ∪ 𝐺𝑅𝐷𝐹𝑆 ∪ 𝐺𝑂𝑊𝐿. A non-standard triple
in any RDF graph other than 𝐺𝑟𝑒𝑠 is a triple where:
• a class in 𝐺𝑟𝑒𝑠 appears in a position other than as the value of
rdf:type, or
t:[rdfs:Class, rdf:subClassOf, rdf:Property]
• a property in 𝐺𝑟𝑒𝑠 appears outside of the predicate position.
t:[rdfs:range, rdf:subPropertyOf, rdf:Property]
14. Class position
Definition 4: A URI 𝑢 outside of one of the reserved
namespaces in an RDF triple 𝑡 = (𝑠, 𝑝, 𝑜) is in a
class position if:
• 𝑠 = 𝑢 ∧ 𝑝 ∈ {𝑝|(𝑝, 𝑟𝑑𝑓𝑠: 𝑑𝑜𝑚𝑎𝑖𝑛, 𝑜𝑤𝑙: 𝐶𝑙𝑎𝑠𝑠) ∈ 𝐺𝑟𝑒𝑠 ∨
(𝑝, 𝑟𝑑𝑓𝑠: 𝑑𝑜𝑚𝑎𝑖𝑛, 𝑟𝑑𝑓𝑠: 𝐶𝑙𝑎𝑠𝑠) ∈ 𝐺𝑟𝑒𝑠}
t:[foaf:Person, rdfs:subClassOf, foaf:Agent]
• 𝑜 = 𝑢 ∧ 𝑝 ∈ {𝑝|(𝑝, 𝑟𝑑𝑓𝑠: 𝑟𝑎𝑛𝑔𝑒, 𝑜𝑤𝑙: 𝐶𝑙𝑎𝑠𝑠) ∈ 𝐺𝑟𝑒𝑠 ∨
(𝑝, 𝑟𝑑𝑓𝑠: 𝑟𝑎𝑛𝑔𝑒, 𝑟𝑑𝑓𝑠: 𝐶𝑙𝑎𝑠𝑠) ∈ 𝐺𝑟𝑒𝑠}
t:[foaf:knows, rdfs:range, foaf:Person]
• 𝑜 = 𝑢 ∧ 𝑝 = 𝑟𝑑𝑓: 𝑡𝑦𝑝𝑒
t:[dbr:Wolfgang_Amadeus_Mozart, rdf:type, dbo:Person]
15. Property position
Definition 5: A URI 𝑢 outside of the reserved namespaces
in an RDF triple 𝑡 = (𝑠, 𝑝, 𝑜) is in a property position if:
• 𝑠 = 𝑢 ∧ 𝑝 ∈ {𝑝|(𝑝, 𝑟𝑑𝑓𝑠: 𝑑𝑜𝑚𝑎𝑖𝑛, 𝑜𝑤𝑙: 𝑂𝑏𝑗𝑒𝑐𝑡𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦) ∈ 𝐺𝑟𝑒𝑠} ∪
{𝑝|(𝑝, 𝑟𝑑𝑓𝑠: 𝑑𝑜𝑚𝑎𝑖𝑛, 𝑟𝑑𝑓: 𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦) ∈ 𝐺𝑟𝑒𝑠}
t:[foaf:knows, rdfs:domain, foaf:Person]
• 𝑝 = 𝑢
t:[wd:Q58043963, foaf:knows, wd:Q54860587]
• 𝑜 = 𝑢 ∧ 𝑝 ∈ {𝑝|(𝑝, 𝑟𝑑𝑓𝑠: 𝑟𝑎𝑛𝑔𝑒, 𝑜𝑤𝑙: 𝑂𝑏𝑗𝑒𝑐𝑡𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦) ∈ 𝐺𝑟𝑒𝑠} ∪
{𝑝|(𝑝, 𝑟𝑑𝑓𝑠: 𝑟𝑎𝑛𝑔𝑒, 𝑟𝑑𝑓: 𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦) ∈ 𝐺𝑟𝑒𝑠}
t:[foaf:homepage, rdfs:subPropertyOf, foaf:page]
16. Property position (cont’d)
Definition 6: A URI 𝑢 outside of the reserved
namespaces in an RDF triple 𝑡 = (𝑠, 𝑝, 𝑜) is in a
datatype position if:
• 𝑠 = 𝑢 ∧ 𝑝 ∈ 𝑝 𝑝, 𝑟𝑑𝑓𝑠: 𝑑𝑜𝑚𝑎𝑖𝑛, 𝑟𝑑𝑓𝑠: 𝐷𝑎𝑡𝑎𝑡𝑦𝑝𝑒 ∈ 𝐺𝑟𝑒𝑠
t:[sosa:resultTime, rdfs:range, xsd:dateTime]
• 𝑢 occurs as the datatype of a typed literal 𝑜 = ”𝑙”^^𝑢
t:[ex:observation/123, sosa:hasSimpleResult, "12.4m"^^cdt:length]
• 𝑜 = 𝑢 ∧ 𝑝 ∈ {𝑝|(𝑝, 𝑟𝑑𝑓𝑠: 𝑟𝑎𝑛𝑔𝑒, 𝑟𝑑𝑓𝑠: 𝐷𝑎𝑡𝑎𝑡𝑦𝑝𝑒) ∈ 𝐺𝑟𝑒𝑠}
t:[wd:Q58043963, foaf:name, “Armin Haller”]
17. Instance position
Definition 7: A URI 𝑢 outside of the
reserved namespaces in an RDF triple
𝑡 = (𝑠, 𝑝, 𝑜) that is neither in a class, nor
property, nor datatype position, is in an
instance position.
t:[ex:observation/108, sosa:observedProperty, ex:tree/124/height]
18. Link Types
Definition 8: Let 𝑑𝑠1, 𝑑𝑠2 be datasets. Then, we call triple 𝑡 ∈ 𝐺𝑑𝑠1
a link from 𝑑𝑠1 to
𝑑𝑠2, if 𝑡 contains a URI 𝑢 from a namespace in 𝑁𝑆𝑑𝑠2
. Depending on the position of 𝑢 we
distinguish:
• 𝑡 is called an instance link, if 𝑢 is in an instance position in 𝑡
t:[dbr:Wolfgang_Amadeus_Mozart, owl:sameAs, wd:Q254]
• otherwise, 𝑡 is called an ontology link, where we further distinguish:
– 𝑡 is called a class link, if 𝑢 is in a class position other than the 𝑜 position of an rdf:type triple
t:[dbo:Person, rdfs:subClassOf, foaf:Person]
– 𝑡 is called an instance typing link, if 𝑢 is in the class position 𝑢 = 𝑜 of an rdf:type triple
t:[dbr:Wolfgang_Amadeus_Mozart, rdf:type, foaf:Person]
– 𝑡 is called a property link, if 𝑢 is in a property position other than 𝑝
t:[dbr:Wolfgang_Amadeus_Mozart, foaf:name, "Wolfgang Amadeus Mozart"@en]
– 𝑡 is called an instance role link, if 𝑢 is in the property position 𝑢 = 𝑝
t:[dbr:Wolfgang_Amadeus_Mozart, foaf:knows, wd:Q51088] (Antonio Salieri)
• If 𝑢 does not appear in 𝐺𝑑𝑠2
, we call 𝑡 a broken link.
19. Empirical dataset
• Crawl of the LODcloud + historical datasets from the LODcloud that were
cached in the LODLaundromat
• 430 Linked datasets in resulting corpus, each encoded in HDT for a total
size of 51 GB (3.3bn triples)
% of total Available Available as % of total
Total # of datasets 1,359 100%
SPARQL endpoint 459 33.5% 125 9.1%
Available as download 890 65.4% 226 16.6%
Characteristic Median Mean
Number of Triples 4,478 17,860,436
Number of Unique Subjects 613 1,774,578
Number of Unique Predicates 31 65.4%
Number of unique objects 2,245 5,296,390
A. Abele, J. P. McCrae, P. Buitelaar, A. Jentzsch, and R. Cyganiak. 2017. Linking open data cloud diagram. URL: http://lod-cloud.net. Insight-Centre.
20. Ontology corpus
• Ontologies typically only consist of terminological axioms 𝑇 (TBox),
they may also include a set of assertional axioms 𝐴 (ABox) (e.g.,
codelists or thesaurological terms)
• Datasets registered in the LODcloud are typically not ontologies (i.e.,
in our analysis only 3/430 are ontologies)
• Ontology corpus created through a crawl of prefix.cc and through the
analysis of all ontology links in our empirical dataset
# of unique Classes 204,616
# of unique Properties 1,821
Ratio 1/112
21. Authoritative namespace
• To identify links (ontology or instance links) we need to identify
the dataset authority for each namespace
1. Use HDT to and extract all namespaces in each RDF dataset
2. Compute the ‘relative occurrence’ of each namespace in the dataset
3. Check if a namespace that is extensively used in a dataset is in fact
an external link to a dataset that is not in the corpus. Therefore, we
define only one authoritative namespace for each dataset
4. Only consider the Pay Level Domains (PLD) of the authoritative name
# of datasets in our corpus 430
# of datasets with authoritative namespace 395
# of datasets with namespace in LOD Cloud metadata 257
# of datasets matching authoritative namespace and LOD Cloud metadata 162
27. Link analysis in Wikidata
• Wikidata by far the largest openly available KG and the only one truly built bottom-up → cause of
many modelling errors/inconsistencies (e.g., 4,182 separate properties for external identifiers)
• Not part of the LODCloud, therefore was not included in our paper, however, we did an analysis
since for the 9th of March, 2020 Wikidata dump (HDT file 49.4GB compressed)
Number of triples 3,381,623,911
Number of unique subjects 1,327,447,995
Number of predicates 32,713
Number of unique objects 2,010,015,636
Number of shared subject-object 1,173,987,281
Unique Individuals 75,261,968
Class Links 375,351,770
Property Links 2,723,834
of which sameAs links
2,723,834
Instance Typing Links 77,479,623
# of Classes 1,045,455
# of Properties 74,746
Ratio 1/14
# of unique Properties 7,259
e.g.,
P4330, contains
P150, contains administrative
territorial entity
P1383, contains settlement
P2821, by-product
P2822, by-product of ID
Propert
y, 4182
Others,
3077
e.g.,
P2014, Museum of Modern Art work ID
P6276, Amazon Music artist ID
P6145, Academy Awards Database film ID
28. Discussion
• Instance Typing Links are most used
– Only one’s that grow linear with dataset size
– Instance links are less popular than one would assume, and do not linearly grow with size
• Ontologies are reused widely
– Only a few datasets define their own ontology. This is a sign that:
1. Dataset publishers follow best practices and separate the ontology namespace from the authoritative
namespace of the dataset
2. There exists a large number of ontologies that cover already many domains that can be readily reused
• Need for ontology publishing best practices
– Authoritative ontology register needed (e.g., the LOV portal)
– A persistence mechanism that assigns a DOI to an ontology and persists the document
• Ubiquity of broken Class and Property links
– Alarming number of broken links, i.e., more than half of all class and property URIs
– Data publishers need to consider to replicate linked ontologies
P.-Y. Vandenbussche, G. Atemezing, M. Poveda-Villalón, and B. Vatant. 2017. Linked Open Vocabularies (LOV): A gateway to
reusable semantic vocabularies on the Web. Semantic Web 8, 3 (2017), 437–452.
29. Discussion (cont’d)
• Lack of ABox Links
– Many (28% of all) datasets do not use any Instance Links, and owl:sameAs is not
particularly popular at all (other than in Wikidata)
1. these links are expensive to establish manually
2. expensive to maintain, and
3. even if they exist, there is no incentive to publish them openly.
• Lack of and incorrect namespace declarations
– Only 59% of all datasets in our corpus publish namespace in their metadata, and of
those 257, only 162 match the namespace we obtained through our analysis
• Plethora of data and metadata formats
– heterogeneity of publication formats, potentially involving parse errors, constituted a
major part of the effort used for our experiments
– Once each dataset node/dump had been converted to HDT, the analysis was easy:
link computations can be done at scale on even large datasets in HDT