Knowledge Graph Engineering

Knowledge Graph Engineering
Keynote at Summer School on AI for Industry 4.0
Armin Haller
Associate Professor, ANU

Knowledge Graphs (KGs)
“A Knowledge Graph is a graph of data intended to accumulate and
convey knowledge of the real world, whose nodes represent entities of
interest and whose edges represent relations between these entities.”
[Hogan et al., 2020]
• Knowledge graphs are created collaboratively by many users
• Information can be added in a relatively arbitrary manner as
structural constraints are few
Closed KGs (~2019) [Noy et al., 2019]
Microsoft ~2bn entities, ~55bn facts
Google ~1bn entities, ~70bn assertions
Facebook ~50m entities, ~500m assertions
eBay ~1bn triples
IBM ~100m entities, 5bn relationships
Open KGs (April 2021)
DBpedia ~4.58m entities, ~9.25GB
Yago4 ~50m entities, ~18.4GB
Wikidata ~93m entities, ~99GB

Knowledge Graphs (KGs)
Graphs
Natural way of
structuring and
presenting
knowledge
Heterogenous
Knowledge from
different sources
can be integrated
and/or interlinked
Schema-later
Schema often not
decided until later,
and does not impose
integrity constraints

Schema in KGs
Ontologies as schemas in KGs
An ontology is an “explicit specification of a conceptualization consisting of a set of
objects, and the describable relationships among them”
[Gruber, 1993]
Components of an Ontology
• Classes: abstract groups (sets) of objects that are defined by properties that all its
members share (e.g., Person, Organisation, Event)
• Attributes: characteristics or parameters that objects (and classes) can have (e.g.,
data of birth, longitude, latitude, timestamp)
• Relationships: ways in which classes and individuals can be related to one another
(e.g., role, attributed to, observed by)
• Individuals: Concrete objects that are inherent to the domain of discourse, such as
specific people, organisations or abstract individuals such as numbers (e.g., g, π)

Limited
many entities
Generic
applies to many
Specific
applies to few
RDF Knowledge Graphs
Comprehensive
fewer entities
ABox (Data)
TBox (Schema)
Q58043963
Q76
Barack Obama
(3,947 axioms)
Armin Haller
(189 axioms)
P361
Q35120
Entity
partOf
minimum
no of players
Chess Person Q73145133
P1872

Meta-modelling issues in KGs
Without enforced (upfront designed) schemas, KGs suffer from, e.g.:
• Inconsistent modelling of classes/instances
<Q1412680> <P279> <Q28100368> | <Beef Wellington> <subclass of> <Beef Dish>
<Q6497852> <P31> <Q28100665> | <Wiener Schnitzel> <instance of> <Veal Dish>
• Subclassing of disjoint super-classes
<Q190928> <P279> <Q124282> | <shipyard> <subclass of> <dock>
<Q190928> <P279> <Q4830453> | <shipyard> <subclass of> <business>
<Q124282> <P279> <Q7184903> | <shipyard> <subclass of> <abstract object>
<Q190928> <P279> <Q223557> | <shipyard> <subclass of> <physical object>
• Instance of relations between first-order classes
<Q12156> <P31> <Q12136> | <Malaria> <instance of> <Disease>
<Q12156> <P279> <Q12136> | <Malaria> <subclass of> <Disease>
• Redundant/circular inheritances between first-order classes
<Q18557307> <P279> <Q692536> | <muscle tissue disease> <subclass of> <muscular disease>
<Q692536> <P279> <Q18557307> | <muscular disease <subclass of> <muscle tissue disease>

Types of Schemas (Ontologies)
Level
of
Abstraction
Most
General
Most
Specific
Reusability
Highest
Lowest
Upper
Ontologies
Mid-Level Ontologies
Domain Ontologies
Use-Case Ontologies
e.g., CyC,
SUMO,
DOLCE, BFO,
CYC
e.g., PROV-O,
FOAF, ORG,
SOSA/SSN,
AGRIF
e.g., GO,
ChEBI,
DO,
BTO
[Haller & Polleres, 2020a]

KG Engineering
KG Creation
Extract data
from existing
resources
KG Usage
KG Linking
Add instance
assertions
KG Curation
Add schema
assertions

KG Creation – Develop Schema
Top-Down
Schema first,
Data later
Bottom Up
Data first,
Schema later
ABox (Data)
TBox (Schema)
Middle-Out

KG Creation
Bottom-Up KG Creation
• Schema is not defined, and data is added organically and manually using tools such as:
– OntoWiki [Frischmuth et al., 2015]
– Semantic MediaWiki [Krötzsch et al., 2006]
– Wikibase
– Schímatos [Wright et al., 2020]
Top-Down KG Creation
• Schema is created upfront, existing data mapped to schema using languages/tools such as:
– R2RML
– SPARQL Generate [Lefrançois et al., 2017]
– SHACL Rules
– TARQL
– NLP/NER from unstructured text
Middle-Out KG Creation [Sure et al., 2004]
• Schema is partly defined upfront, with mappings added later when data defines semantics
• Use case data is provided upfront

KG Curation
Correctness
– Evaluation
Accessibility, Accuracy, Consistency, Conciseness, Trustability,
Dynamicity, Representationality [Zaveri et al., 2016]
– Correction
Evaluating data quality (SHACL, SheX)
• Syntactic errors
• Semantic errors
Completeness
– KG Completion [Paulheim, 2017]
Using structural information observed in triples
• Classification
• Probabilistic and Statistical Methods

KG Linking
Linked Data Principles [Berners-Lee, 2006]
• LDP1: Use URIs as identifiers for things;
• LDP2: Use HTTP URIs so those identifiers can be
dereferenced;
• LDP3: return useful information upon dereferencing
of those URIs using a standard format (typically,
RDF);
• LDP4: include links using externally
dereferenceable URIs

KG Linking
Linking Issues [Haller et al., 2020b]
• References to many inaccessible URIs (i.e., broken links) may
render a KG largely useless
• Changes in linked external KGs are out of control of the KG
publisher
• Previously, no definition of what constitutes a “link”, specifically
“internal links”, i.e., links between parts of one coherent KG,
and “external links”, i.e., links between different KGs)
– A triple is a link if it contains a URI in a namespace other than the
authoritative namespace URI of the dataset/KG where the triple is
defined. [Haller et al., 2020b]

KG Linking – Link Types
• Ontology links [Haller et al., 2020b]
– class link
t:[dbo:Person, rdfs:subClassOf, foaf:Person]
– instance typing link
t:[dbr:Wolfgang_Amadeus_Mozart, rdf:type, foaf:Person]
– property link
t:[dbr:Wolfgang_Amadeus_Mozart, foaf:name, "Wolfgang
Amadeus Mozart"@en]
– instance role link
t:[dbr:Wolfgang_Amadeus_Mozart, foaf:knows, wd:Q51088]
(Antonio Salieri)
• Instance link
t:[dbr:Wolfgang_Amadeus_Mozart, owl:sameAs, wd:Q254]

KG Linking in the wild
• Crawl of the LODcloud [Abele et al., 2017] + historical datasets from the
LODcloud that were cached in the LODLaundromat
• 430 Linked datasets in resulting corpus, each encoded in HDT for a total
size of 51 GB (3.3bn triples)
% of total Available Available as % of total
Total # of datasets 1,359 100%
SPARQL endpoint 459 33.5% 125 9.1%
Available as download 890 65.4% 226 16.6%
Characteristic Median Mean
Number of Triples 4,478 17,860,436
Number of Unique Subjects 613 1,774,578
Number of Unique Predicates 31 65.4%
Number of unique objects 2,245 5,296,390

KG Linking in the wild (cont’d)
Class Links
http://vivo.iu.edu 119,538
http://vivo.scripps.edu 63,128
http://www.imagesnippets.com 12,874
http://core.kmi.open.ac.uk 9,143
http://commons.wikimedia.org 8,258
http://vivo.psm.edu 8,036
http://datos.bne.es 2,778
http://dbpedia.org 1,614
http://www.productontology.org 1,000
http://vivoweb.org 84
http://commons.wikimedia.org 4,995
http://datos.bne.es 1,255
http://vivo.iu.edu 510
http://vivo.psm.edu 481
http://vivoweb.org 386
http://vivo.scripps.edu 187
http://semanticscience.org 168
http://www.iupac.org 102
http://dbpedia.org 101
http://tkm.kiom.re.kr 60
Property Links
Median 0
Mean 1,299
% above 0 44%
Median 0
Mean 47
% above 0 18%

Instance Typing Links
Instance Links
http://webisa.webdatacommons.org 101,491,507
http://commons.wikimedia.org 100,022,186
http://lod.b3kat.de 40,674,519
http://lod.hebis.de 39,160,423
http://d-nb.info 20,096,228
http://datos.bne.es 7,419,630
http://data.ordnancesurvey.co.uk 5,653,997
http://data.europeana.eu 4,987,332
http://id.loc.gov 1,570,877
http://data.bibsys.no 1,440,011
http://ld.zdb-services.de 398,381,851
https://data.gov.cz 3,081,559
http://core.kmi.open.ac.uk 1,696,618
http://id.loc.gov 1,143,545
http://data.europeana.eu 687,735
http://spraakbanken.gu.se 451,081
http://www.imagesnippets.com 214,362
http://data.coi.cz 34,277
Median 206
Mean 1,967,570
% above
0
97%
Median 206
Mean 4,240,890
% above 0 72%

• Selected predicates used in links
owl:samesAs owl:DifferentFrom Rdfs:seeAlso owl:AllDifferent
Median 0 0 0 0
Mean 503,859 581 2,735 0
% above 0 53% <1% 14% 0
P90% 1,460 0 1 0
1st
1st #
http://commons.wikimedia.org N/A
40,636,493 103,439 324,659
2nd
2nd #
http://ld.zdb-services.de
18,049,155
N/A http://stitch.cs.vu.nl N/A
3rd
3rd #
http://d-nb.info
17,410,586
N/A http://data.nobelprize.org N/A

Total Links
http://ld.zdb-services.de 421,206,061
http://webisa.webdatacommons.org 101,491,507
http://lod.b3kat.de 40,677,795
http://datos.bne.es 7,428,111
http://data.europeana.eu 5,675,067
https://data.gov.cz 3,958,043
Median 416
Mean 6,209,808
%
above 0
96%
Broken Class URIs Broken Property URIs
Prefix.cc crawl LOD corpus Prefix.cc crawl LOD corpus
HTTP Code # % # % # % # %
200 7,175 12.3% 2,579 12.8% 814 44.7% 58,108 40.9%
301 18,598 31.8% 2,610 12.9% 442 24.3% 1,137 0.8%
302 4,331 7.4% 925 0.5% 194 10.7% 1,391 1.0%
303 12,805 21.9% 3,903 19.3% 108 5.9% 5,247 3.7%
40x 12,054 20.6% 8,664 42.9% 130 7.1% 73,366 51.7%
50x 66 <0.1% 111 <0.1% 4 <0.1% 362 0.3%
No response 146,145 5.9% 1,425 7% 129 7.1% 2,332 1.6%
Total 204,616 100% 20,217 100% 1,821 100% 141,943 100%

KG Linking in the wild – Wikidata
• Wikidata by far the largest openly available KG and the only one truly built bottom-up → cause of
many modelling errors/inconsistencies
• Not part of the LODCloud, therefore was not included in [Haller et al., 2020b], however, we did
an analysis since for the 9th of March 2020 Wikidata dump (HDT file 49.4GB compressed)
Number of triples 3,381,623,911
Number of unique subjects 1,327,447,995
Number of predicates 32,713
Number of unique objects 2,010,015,636
Number of shared subject-object 1,173,987,281
Unique Individuals 75,261,968
Class Links 375,351,770
Property Links 2,723,834
of which sameAs links
2,723,834
Instance Typing Links 77,479,623
# of Classes 1,045,455
# of Properties 74,746
Ratio 1/14
# of unique Properties 7,259

• Ontologies are reused widely
– Only a few KGs define their own ontology → a large number of
ontologies exist that cover already many domains
• Ubiquity of broken Class and Property links
– Alarming number of broken links, i.e., more than half of all class and
property URIs
– Data publishers need to consider to replicate linked ontologies
• Lack of Instance Links
– Many (28% of all) KGs do not use any Instance Links, and
owl:sameAs is not particularly popular at all (other than in Wikidata)
1. these links are expensive to establish manually
2. expensive to maintain, and
3. even if they exist, there is no incentive to publish them openly.

KG Usage
• Knowledge Management, Knowledge
Discovery
• Training of ML models with KGs
• Conversational Agents
– Q&A
– Personal Assistants
– Chatbots
• Open Data

Building the AGRIF KG
Australian Government Records Interoperability Framework
• Address discovery and semantic interoperability needs in Australian
Government
• Combine records/archives/information management with contemporary
data science
• Emphasis business benefit to the creators of information
• Make sure it does not require an entirely new skillset for everyone involved
• Build proof-of-concept KG for two use case agencies

Learning graph
shapes from
KG
KG Usage
Adding schema
links to external
KG
Develop AGRIF
ontology
Map from
source
metadata to
JSON objects
Map from
JSON objects
to RDFS/OWL
Extract data from
unstructured
sources using
NLP/NER
KG Curation
(e.g., entity
reconciliation)

Metadata Extractor
Document
Store
(CouchDB)
Triple Store
(Virtuoso)
JSON
NLP/NER-Toolkit
Schímatos
Platform
SHACL Learner
Active Knowledge
Graph Completion
J2RM
RDF
A
P
I
.pdf
.docx
.msg
.xlsx
.csv
…
End User
Domain Expert
A
P
I
KG-I
Protégé
Architecture

AGRIF KG tools
• Schema
– AGRIF Ontology
http://reference.data.gov.au/def/ont/agrif
• Open-source software
– Metadata Extractor & Loader (MEL)
– JSON to RDF Mappings (J2RM) [Méndez et al., 2020]
– SHACLearner [Omran et al., 2020]
– Schímatos [Wright et al., 2020]

Conclusions
• Stronger focus on the end user needed
– Tools/methods needed for creating/maintaining KGs
– Tools/methods needed to support querying/analysing KG
Schemas
• Improved NLP/NER-based learning techniques
needed (distant supervision) that build s-p-o
relations from unstructured text [Mintz et al., 2009]
• Permanent Distributed querying/replication of
data/schema

References
• Hogan, A., et al.: Knowledge Graphs. ACM Computing Surveys (to appear), 2021.
• Noy, N., Gao, Y., Jain, A., Narayanan, A., Patterson, A. , Taylor, J.: Industry-scale Knowledge Graphs: Lessons and Challenges. ACM Queue 17(2),
2019.
• Gruber, T.: A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 5(2):199-220, 1993.
• Frischmuth, P., Martin, M., Tramp, S., Riechert, T., Auer, S.: OntoWiki – An Authoring, Publication and Visualization Interface for the Data Web.
Semantic Web, vol. 6, no. 3, pp. 215-240, 2015.
• Krötzsch, M., Vrandečić, D., Völkel, M.: Semantic MediaWiki. The Semantic Web – ISWC 2006.
• Wright, J., Méndez, S. J. R., Haller, A., Taylor, K., Omran, P. G.: Schímatos: a SHACL-based Web-Form Generator for Knowledge Graph Editing. The
Semantic Web – ISWC 2020.
• Lefrançois, M., Zimmermann, A., Bakerally, N.: A SPARQL Extension for Generating RDF from Heterogeneous Formats. ESWC (1), 2017.
• Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: A survey. Semantic Web 7 (1), 63-93,
2016.
• Paulheim, H.: Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8(3): 489-508, 2017.
• Berners-Lee, T.: Linked Data. W3C Design Issues. URL: http://www.w3.org/DesignIssues/LinkedData.html, 2006.
• Haller, A., Polleres, A.: Are we better off with just one ontology on the Web? Semantic Web 11(1): 87-99, 2020a.
• Sure, Y., Staab, S., Studer, R., On-To-Knowledge Methodology (OTKM), Handbook on Ontologies (2004) pp 117-132.
• Haller, A., Fernández, J. D., Kamdar, M. R. , Polleres, A.: What Are Links in Linked Open Data? A Characterization and Evaluation of Links between
Knowledge Graphs on the Web. ACM J. Data Inf. Qual. 12(2): 9:1-9:34, 2020b.
• Abele, A., McCrae, J. P., Buitelaar, P., Jentzsch, A., Cyganiak, R: Linking open data cloud diagram. URL: http://lod-cloud.net. Insight-Centre. 2017.
• Méndez, S. J. R., Haller, A., Omran, P.G., Wright, J., Taylor, K.: J2RM: An ontology-based JSON-to-RDF Mapping tool. ISWC (Demos/Industry) 2020.
• Omran, P. G., Taylor, K., Méndez, S. J. R., Haller, A.: Towards SHACL Learning from Knowledge Graphs. ISWC (Demos/Industry) 2020.
• Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. Joint Conference of the 47th Annual
Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, (ACL ‘09), 2009.

Knowledge Graph Engineering

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Knowledge Graph Engineering

Ähnlich wie Knowledge Graph Engineering (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Knowledge Graph Engineering