This invited keynote at the Social Computing Track at WI-IAT21 gives an introduction to Knowledge Graphs and how they are built collaboratively by us. It gives also presents a brief analysis of the links in Wikidata.
3. Machine Learning/AI
ML/AI approaches are performing extremely well in dealing
with such massive amounts of data on tasks such as:
– Image Recognition
– Speech Recognition
– Product recommendations
– Question & Answering
– Spam filtering
… and for neither of these applications we need an
explanation of the learned facts.
4. Machine Learning/AI and its
limitations
However, if it comes to:
– Self driving cars
– Medical diagnosis
– Drug design
– Robot interactions
– Military applications
– etc.
Humans need to understand the rationale of a decision.
– Facebook employs nearly 15,000 people to moderate posts
deemed inappropriate by ML/AI
5. eXplainable AI
XAI requires
• Encoding of context (Who, What, How,
When...)
• Encoding the semantics of inputs,
outputs and their properties
• Encoding of common sense knowledge
(e.g., one sits on a chair and eats on a
table)
6. Knowledge Graphs (KGs)
• Performance and explainability of ML
improves when data is given a context
– a Knowledge Graph increases the informative value
of the collected data that is given to the model
Knowledge Graphs [Paulheim 2017]
– describe real-world entities and their interrelations
– define possible classes and relations of entities in a
schema (ontology)
– allow for interrelating arbitrary entities with each
other
7. Knowledge Graphs (KGs)
• Knowledge graphs are (generally) created collaboratively by many
users
• Information can be added in a relatively arbitrary manner as
structural constraints are few
Closed KGs (~2019) [Noy et al., 2019]
Microsoft ~2bn entities, ~55bn facts
Google ~1bn entities, ~70bn assertions
Facebook ~50m entities, ~500m assertions
eBay ~1bn triples
IBM ~100m entities, 5bn relationships
Open KGs (April 2021)
DBpedia ~4.58m entities, ~9.25GB
Yago4 ~50m entities, ~18.4GB
Wikidata ~93m entities, ~99GB
8. Knowledge Graphs (KGs)
Graphs
Natural way of
structuring and
presenting
knowledge
Heterogenous
Knowledge from
different sources
can be integrated
and/or interlinked
Schema-later
Schema often not
decided until later,
and does not impose
integrity constraints
9. Schema in KGs
Ontologies as schemas in KGs
An ontology is an “explicit specification of a conceptualization consisting of a set of
objects, and the describable relationships among them”
[Gruber, 1993]
Components of an Ontology
• Classes: abstract groups (sets) of objects that are defined by properties that all its
members share (e.g., Person, Organisation, Event)
• Attributes: characteristics or parameters that objects (and classes) can have (e.g.,
data of birth, longitude, latitude, timestamp)
• Relationships: ways in which classes and individuals can be related to one another
(e.g., role, attributed to, observed by)
• Individuals: Concrete objects that are inherent to the domain of discourse, such as
specific people, organisations or abstract individuals such as numbers (e.g., g, π)
10. Limited
many entities
Generic
applies to many
Specific
applies to few
KG modelling detail
Comprehensive
fewer entities
Data
Schema
Q58043963
Q76
Barack Obama
(3,947 axioms)
Armin Haller
(189 axioms)
P361
Q35120
Entity
partOf
minimum
no of players
Chess Person Q73145133
P1872
11. Types of Schemas (Ontologies)
Level
of
Abstraction
Most
General
Most
Specific
Reusability
Highest
Lowest
Upper
Ontologies
Mid-Level Ontologies
Domain Ontologies
Use-Case Ontologies
e.g., CyC,
SUMO,
DOLCE, BFO,
CYC
e.g., PROV-O,
FOAF, ORG,
SOSA/SSN,
AGRIF
e.g., GO,
ChEBI,
DO,
BTO
[Haller & Polleres, 2020a]
12. KG Engineering
KG Creation
Extract data
from existing
resources
KG Usage
KG Linking
Add instance
assertions
KG Curation
Add schema
assertions
14. KG Creation (cont’d)
Bottom-Up KG Creation
• Schema is not defined, and data is added organically and manually using tools such as:
– OntoWiki [Frischmuth et al., 2015]
– Semantic MediaWiki [Krötzsch et al., 2006]
– Wikibase
– Schímatos [Wright et al., 2020]
Top-Down KG Creation
• Schema is created upfront, existing data mapped to schema using languages/tools such as:
– R2RML
– SPARQL Generate [Lefrançois et al., 2017]
– SHACL Rules
– TARQL
– Metadata Extractor & Loader (MEL) [Méndez et al., 2021]
– JSON to RDF Mappings (J2RM) [Méndez et al., 2020]
Middle-Out KG Creation [Sure et al., 2004]
• Schema is partly defined upfront based on use cases, with mappings added later when data
defines semantics
15. Collaboratively building KGs
• Biggest KGs on the Web are built, collaboratively, bottom-up:
– Schema.org Ontology and KG
• Over 10 million sites use Schema.org to markup their web pages and email messages
– Wikidata Ontology and KG
• Wikipedia for Data, 149GB
schema.org Wikidata
Availability • Ontology highly available
• Data availability depending on publisher
• Ontology highly available
• Data highly available
Discoverability • Ontology → Easy
• Instances → Very Difficult
• Ontology → Relatively Difficult
• Instances → Very Easy
Completeness
& Adaptability
• Domain specific (E-Commerce)
• Community extensions available
• (All of) Human Knowledge
Maintenance
& Versioning
• Continuous curation
• Versions are not made explicit
• Continuous curation
• Explicit entity versions + version history
Modularization • Fully distributed, easily accessible,
ontology
• Fully distributed, difficult to access, data
• Fully distributed, relatively difficult to
access, ontology
• Fully distributed, easy to access, data
Quality • High quality ontology
• Low quality data
• High quality ontology
• High quality data
17. KG Curation
Correctness
– Evaluation
Accessibility, Accuracy, Consistency, Conciseness, Trustability,
Dynamicity, Representationality [Zaveri et al., 2016]
– Correction
Evaluating data quality (SHACL, SheX)
• Syntactic errors
• Semantic errors
Completeness
– KG Completion [Paulheim, 2017]
Using structural information observed in triples
• Classification
• Probabilistic and Statistical Methods
18. KG Linking
Internal vs. External links [Haller et al., 2020b]
– internal links, i.e., links between parts of one coherent KG, i.e., edges linking
nodes within the graph
• Link prediction techniques are used to learn those new links
– external links, i.e., links between different KGs, i.e., edges between nodes from
different graphs, or reusing edges from a different graph to link nodes in one KG
Linking Issues [Haller et al., 2020b]
• References to many inaccessible URIs (i.e., broken links) may render
a KG largely useless
• Changes in linked external KGs are out of control of the KG publisher
19. KG Linking
• Ontology links [Haller et al., 2020b]
– class link
t:[dbo:Person, rdfs:subClassOf, foaf:Person]
– instance typing link
t:[dbr:Wolfgang_Amadeus_Mozart, rdf:type, foaf:Person]
– property link
t:[dbr:Wolfgang_Amadeus_Mozart, foaf:name, "Wolfgang
Amadeus Mozart"@en]
– instance role link
t:[dbr:Wolfgang_Amadeus_Mozart, foaf:knows, wd:Q51088]
(Antonio Salieri)
• Instance link
t:[dbr:Wolfgang_Amadeus_Mozart, owl:sameAs, wd:Q254]
20. KG Linking in Wikidata
• Wikidata by far the largest openly available KG, truly built bottom-up
schema (ontology) and data
• Wikidata dump (in HDT) from 3rd of March 2021, 53GB (149GB
uncompressed).
General Statistics
# Triples (Facts) 1,693,668,039
# Subjects 1,625,057,179
# Predicates (edges) 38,867
# Unique objects 2,538,585,808
# Unique entities 89,120,227
# Unique Classes 2,522,595
# Unique Properties 74,309
Links
# Class Links 3,955
(0.001 per class)
# Property Links 835
(0.01 per property)
# Instance Typing Links 0
# Instance Links
• Exact Match (P2888)
• Said to be the Same (P460)
• Inverse Property (P1696)
173,177,045
(1.94 per entity)
3,268,021
2
0
21. KG Linking in Wikidata
(cont’d)
• Wikidata ontology includes links to other ontologies,
but relatively fewer class and property links
compared to other open KGs on the Web
• Wikidata defines an extensive ontology (schema)
that is used to define entities within its KG
• Wikidata links to other KGs, but uses relatively
less instance links than other KGs on the Web
– Does not (yet) include many similarity relations even
though it should not be the authoritative source for many
of its entities
22. KG Usage
• Knowledge Management, Knowledge
Discovery
• Training of ML models with KGs
• Conversational Agents
– Q&A
– Personal Assistants
– Chatbots
• Open Data
23. Conclusions
• Stronger focus on the KG contributors and end user needed
– Tools/methods needed for creating/maintaining KGs
– Tools/methods needed to support querying/analysing KG Schemas
• KGs need to be stronger interlinked, e.g., link prediction
techniques need to be deployed between KGs rather than just
on a single KG
• Improved NLP/NER-based learning techniques needed (distant
supervision) that build s-p-o relations from unstructured text [Mintz et
al., 2009]
• Permanent Distributed querying/replication of data/schema
24. References
• Hogan, A., et al.: Knowledge Graphs. ACM Computing Surveys (to appear), 2021.
• Noy, N., Gao, Y., Jain, A., Narayanan, A., Patterson, A. , Taylor, J.: Industry-scale Knowledge Graphs: Lessons and Challenges. ACM Queue 17(2), 2019.
• Gruber, T.: A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 5(2):199-220, 1993.
• Frischmuth, P., Martin, M., Tramp, S., Riechert, T., Auer, S.: OntoWiki – An Authoring, Publication and Visualization Interface for the Data Web. Semantic Web, vol. 6,
no. 3, pp. 215-240, 2015.
• Krötzsch, M., Vrandečić, D., Völkel, M.: Semantic MediaWiki. The Semantic Web – ISWC 2006.
• Wright, J., Méndez, S. J. R., Haller, A., Taylor, K., Omran, P. G.: Schímatos: a SHACL-based Web-Form Generator for Knowledge Graph Editing. The Semantic Web –
ISWC 2020.
• Lefrançois, M., Zimmermann, A., Bakerally, N.: A SPARQL Extension for Generating RDF from Heterogeneous Formats. ESWC (1), 2017.
• Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: A survey. Semantic Web 7 (1), 63-93, 2016.
• Paulheim, H.: Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8(3): 489-508, 2017.
• Berners-Lee, T.: Linked Data. W3C Design Issues. URL: http://www.w3.org/DesignIssues/LinkedData.html, 2006.
• Haller, A., Polleres, A.: Are we better off with just one ontology on the Web? Semantic Web 11(1): 87-99, 2020a.
• Sure, Y., Staab, S., Studer, R., On-To-Knowledge Methodology (OTKM), Handbook on Ontologies (2004) pp 117-132.
• Haller, A., Fernández, J. D., Kamdar, M. R. , Polleres, A.: What Are Links in Linked Open Data? A Characterization and Evaluation of Links between Knowledge
Graphs on the Web. ACM J. Data Inf. Qual. 12(2): 9:1-9:34, 2020b.
• Abele, A., McCrae, J. P., Buitelaar, P., Jentzsch, A., Cyganiak, R: Linking open data cloud diagram. URL: http://lod-cloud.net. Insight-Centre. 2017.
• Méndez, S. J. R., Haller, A., Omran, P.G., Wright, J., Taylor, K.: J2RM: An ontology-based JSON-to-RDF Mapping tool. ISWC (Demos/Industry) 2020.
• Méndez, S. J. R., Haller, A., Omran, P.G., Taylor, K.: MEL: Metadata Extractor & Loader. ISWC (Posters/Demos/Industry) 2021.
• Omran, P. G., Taylor, K., Méndez, S. J. R., Haller, A.: Towards SHACL Learning from Knowledge Graphs. ISWC (Demos/Industry) 2020.
• Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. Joint Conference of the 47th Annual Meeting of the ACL and
the 4th International Joint Conference on Natural Language Processing of the AFNLP, (ACL ‘09), 2009.