Discover Hidden Connections with Knowledge Graph Linking

KNOWLEDGE GRAPH EXPANSION
AND ENRICHMENT
LRI, UNIVERSITÉ PARIS SUD, CNRS, UNIVERSITÉ PARIS
SACLAY
TUTORIAL@BDA 2017
FATIHA SAÏS
1

OUTLINE
§ Linked Open Data
§ Knowledge Graphs
§ Technical Part
1. Data Linking
2. Key Discovery
3. Link invalidation/Requalification
§ Conclusion and Future Challenges
2

FROM THE WWW TO THE
WEB OF DATA
- applying the principles of the WWW to data
data is links,
not only properties
Linked Open Data
3

LINKED DATA
PRINCIPLES
① Use HTTP URIs as identifiers for resources
à so people can look up the data
② Provide data at the location of URIs
à to provide data for interested parties
③ Include links to other resources
àso people can discover more things
àbridging disciplines and domains
è Unlock the potential of island repositories
Tim Berners Lee, 2006
4

RDF – RESOURCE
DESCRIPTION FRAMEWORK
• Statements of < subject predicate object >
http://dbpedia.org/resource/Airbus
dbo:label
“Airbus”
Subject Predicate Object
… is called a triple
5

LINKED OPEN DATA
6
Linked Data - Datasets
under an open access
- 1,139 datasets
- over 100B triples
- about 500M links
- several domains
Ex. DBPedia : 1.5 B
triples
"Linking Open Data cloud diagram 2017, by Andrejs Abele, John
P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak.
http://lod-cloud.net/"

THE ROLE OF KNOWLEDGE IN AI
8
[Artificial Intelligence 47 (1991)]
The knowledge principle: “if a program is
to perform a complex task well, it must
know a great deal about the world in
which it operates.”

ONTOLOGY, A DEFINITION
9
“An ontology is an explicit, formal specification of a shared
conceptualization.”
[Thomas R. Gruber, 1993]
Conceptualization: abstract model of domain related expressions
Specification: domain related
Explicit: semantics of all expressions is clear
Formal: machine-readable
Shared: consensus (different people have different perceptions)

SEMANTIC WEB:
ONTOLOGIES
10
RDFS – Resource Description
Framework Schema
• Lightweight ontologies
OWL – Web Ontology Language
• Expressive ontologies
Source: https://it.wikipedia.org/wiki/File:W3C-
Semantic_Web_layerCake.png

1111
OWL ONTOLOGY
OWL – Web Ontology Language
• Represents rich and complex
knowledge about things
• Based on First Order Logic (FOL)
• Can be used to verify the consistency
of knowledge
• Can make implicit knowledge explicit
• Classes: concepts or collections of
objects (individuals)
• Properties:
• owl:DataTypeProperty (attribute)
• owl:ObjectProperty (relation)
• Hierarchy
• owl:subClassOf
• owl:subPropertyOf
• Individuals: ground-level of the
ontology (instances)

ONTOLOGY LEVELS
12
:type :type :type
Conceptual level:
- classes, properties
(relations)
Instance level:
- facts (individuals)

1313
• Axioms: knowledge definitions in the ontology that were explicitly defined and have
not been proven true.
• Reasoning over an ontology
→ Implicit knowledge can be made explicit by logical reasoning
• Example:
Pompidou museum is an Art Museum
< Pompidou_museum rdf:type ArtMuseum> .
Pompidou museum contains Hallucination partielle
< Pompidou_museum ao:contains Hallucination_partielle> .
• Infer that:
è Pompidou museum is a CulturalPlace
< Pompidou_museum rdf:type CulturalPlace> .
Because: Museum subsumes ArtMuseum and CulturalPlace subsumes Museum
è Hallucination partielle is a Work
<Hallucination_partielle rdf:type ao:Work> .
Because: the range of the object property contains is the class Work.
OWL ONTOLOGY - AXIOMS

LINKED OPEN
VOCABULARIES (LOV)
Linked Open Vocabularies
• Keeps track of available open
ontologies and provides them
as a graph
• Search for available
ontologies, open for reuse
• Example:
http://lov.okfn.org/dataset/lov/vo
cabs/foaf
14

ONTOLOGY PREDICATES SPREAD
ON THE SEMANTIC WEB
• RDFa (or Resource Description Framework in Attributes)
• Top 50 web sites publishing Semantic Web data, clustered by
predicates used.
FOAF

OUTLINE
§ Linked Open Data
§ Knowledge Graphs
§ Technical Part
§ Data Linking
§ Key Discovery
§ Link invalidation/requalification
16

WHO IS DEVELOPING
KNOWLEDGE GRAPHS?
17
2012
2013
2015 2016
2013
2007
2008
2007
2012
Academic side Commercial side

WEB SEARCH WITHOUT
KNOWLEDGE GRAPHS
18

WEB SEARCH WITH
KNOWLEDGE GRAPHS
19

QUESTION ANSWERING WITH
KNOWLEDGE GRAPHS
20

CONNECTING EVENTS AND PEOPLE
WITH KNOWLEDGE GRAPHS
21

TOWARDS A KNOWLEDGE-
POWERED DIGITAL ASSISTANT
22
Cortana (Microsoft)
• Natural access and storage of knowledge
• Chat bots
• Personalization
• Emotion

KNOWLEDGE GRAPH: A
DEFINITION …
23
Wikipedia (en)
This is not a formal definition!

KNOWLEDGE GRAPH: A
DEFINITION …
24
[L. Ehrlinger and W. Wöß, SEMANTiCS’2016]
[16] H. Paulheim. Knowledge Graph Refinement: A Survey of
Approaches and Evaluation Methods. Semantic Web Journal,
(Preprint):1–20, 2016.
[12] M. Kroetsch and G. Weikum. Journal of Web Semantics: Special
Issue on Knowledge Graphs.
http://www.websemanticsjournal.org/index.php/ps/
announcement/view/19 [August, 2016].
[17] J. Pujara, H. Miao, L. Getoor, and W. Cohen. Knowledge Graph
Identification. In Proceedings of the 12th International Semantic Web
Conference - Part I, ISWC ’13, pages 542–557, New York, USA,
2013. Springer.
[3] A. Blumauer. From Taxonomies over Ontologies to Knowledge
Graphs, July 2014. https://blog.semanticweb.at/2014/07/15/from-
taxonomies-over-ontologiesto-knowledge-graphs [August, 2016].
[7] M. Farber, B. Ell, C. Menne, A. Rettinger, and F.
Bartscherer. Linked Data Quality of DBpedia, Freebase,
OpenCyc, Wikidata, and YAGO. Semantic Web Journal,
2016. http://www.semantic-web-journal. net/content/linked-
data-quality-dbpedia-freebaseopencyc-wikidata-and-yago
[August, 2016] (revised version, under review).
à Populated
Ontology
à Populated
Ontology
à RDF Graph
à RDF Graph
à Extracted
RDF Graph

KNOWLEDGE GRAPH (KG)
dbo:Painting
dbo:Museum
dbo:museum-of
dbo:Artist
dbo:author
dbo:city
dbo:PopulatedPlace
Thing
dbo:Person
dbo:Agent
owl:equivalentClass(dbo:Municipality, dbo:Place)
owl:equivalentClass(dbo:Place, dbo:Wikidata:Q532)
owl:equivalentClass(dbo:Village, dbo:PopulatedPlace)
owl:equivalentClass(dbo:PopulatedPlace,dbo:Municipality)
owl:disjointClass(dbo:PopulatedPlace, dbo:Artist)
owl:disjointClass(dbo:PopulatedPlace, dbo:Painting)
owl:FunctionalProperty(dbo:city)
owl:InverseFunctionalProperty(dbo:museum-of)
dbo:birthPlace(X, Y) => dbo:citizsenOf(X, Y)
dbo:parentOf(X, Y) => dbo:child(Y, X)
is-a is-a
is-a
is-a
is-a
Ontology hierarchy
Ontology axioms and rules
PREFIX dbo: <http://dbpedia.org/ontology#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?m ?p
WHERE { ?m rdf:type dbo:Museum . ?m dbo:musuem-of ?p .}
Querying (SPARQL)
- KG saturation: infer whatever can be inferred from the
KG.
- KG consistency checking: no contradictions
- KG repairing
- …
Reasoners: (Pellet, Fact++, Hermit, etc. )
RDF Graphs
25

KNOWLEDGE GRAPH EXPANSION
AND ENRICHMENT
• Expansion: knowledge graphs are incomplete
• Data linking (entity resolution, duplicate detection, reference
reconciliation)
• Link prediction: add relations
• Ontology matching: connect graphs
• Missing values prediction/inference
• Validation: knowledge graphs may contain errors
• Link invalidation
• Dealing with errors and ambiguities
• Enrichment: can new knowledge emerge from knowledge graphs?
• Knowledge discovery
• Automatic reasoning and planning
26

OUTLINE
§ Linked Open Data
§ Knowledge Graphs
§ Technical Part
1. Data Linking
2. Key Discovery
3. Link invalidation/requalification
27

DATA LINKING
29
• Data linking or Identity link detection consists in detecting whether two descriptions of
resources refer to the same real world entity (e.g. same person, same article, same gene).

30
DATA LINKING

DATA LINKING: DIFFICULTIES
31
Different
Vocabularies
Misspelling errorsIncomplete Information :
- date and place of birth ?
- museum phone number ?
- …. ?

IDENTITY LINK DETECTION
PROBLEM
• Identity link detection consists in detecting whether two descriptions of
resources refer to the same real world entity (e.g. same person, same article,
same gene).
• Definition (Link Discovery)
• Given two sets U1 and U2 of resources
• Find a partition of U1 x U2 such that :
• S = {(u1,u2) ∈ u1 × u2: owl:sameAs(s,t)} and
• D = {(u1,u2) ∈ u1 × u2: owl:differentFrom(s,t)}
• A method is said total when (S È D) = (U1 x U2)
• A method is said partial when (S È D) Ì (U1 x U2 )
• Naïve complexity ∈ O(U1 × U2), i.e. O(n2)
32

Problem which exists since the data exists … and under different
terminologies: record linkage, entity resolution, data cleaning,
object coreference, duplicate detection, ….
Record linkage: used to indicate the
bringing together of two or more separately
recorded pieces of information concerning
a particular individual or family.
[NKAJ, Science 1959]
SOME OF HISTORY …
33

ASIDE: DETECTING IDENTITY
LINKS
34
Lise Getoor, VLDB’12 tutorial
Ironically “Identity link detection” has many duplicates
Entity resolution

DATA LINKING IS MORE COMPLEX
FOR GRAPHS THAN TABLES (WHY?)
Databases Semantic Web
Schema/Ontologies Same schema Possibly different ontologies in the same
dataset
Multiple types Single relation Several classes
Open World
Assumption
NO YES
UNA-Unique Name
Assumption
Yes May be no
Data volume XX Thousands XX Millions/Billions
(e.g., DBpedia has 1.5 billion triples)
Multiple values for a
property
NO YES
P1 hasAuthor “Michel Chein”
P1 hasAuthor “Marie-Christine Rousset”
35
• Can propagate similarity decisions è more expensive but better performance
• Can be generic and use domain knowledge, e.g. ontology axioms

DATA LINKING APPROACHES:
DIFFERENT CONTEXTS
• Datasets conforming to the same ontology
• Datasets conforming to different ontologies
• Datasets without ontologies
36

DATA LINKING APPROACHES
• Instance-based approaches: consider only data type properties
(attributes)
• Graph-based approaches: consider data type properties
(attributes) as well as object properties (relations) to propagate
similarity scores/linking decisions (collective data linking)
• Supervised approaches: need an expert to build samples of
linked data to train models (manual and interactive approaches)
• Rule-based approaches: need knowledge to be declared in the
ontology or in other format given by an expert
37

• Instance-based approaches: consider only data type properties
(attributes)
• String comparison
38
m2
architect
address
“Saadiyat Island, Abu
Dhabi”
Jean_Nouvel
“Nov. 8th 2017”
S2
architect
architect
m1
address
Jacques_Lemercier
category
Art_Antiquity
Art_Museum
category
S1
created
Luis_le_Vau
architect
Ange_Jacues_Gabriel
created
“1793”
“99 rue de Rivoli, Paris”
=?
=?
“Louvre Abou Dabi”
“Musée du Louvre”
=?

• Graph-based approaches:
• consider data type properties (attributes) as well as
• object properties (relations) to propagate similarity scores/linking decisions
(collective data linking)
39
m2
architect
address
“Saadiyat Island, Abu
Dhabi”
Jean_Nouvel
“Nov. 8th 2017”
architect
architect
m1
address
Jacques_Lemercier
category
Art_Antiquity
Art_Museum
category
created
Luis_le_Vau
architect
Ange_Jacues_Gabriel
created
“1793”
“99 rue de Rivoli, Paris”
=?
=?
“Louvre Abou Dabi”
“Musée du Louvre”
=?
=?
=?
=?
=?
S2
S1

• Supervised approaches: need an expert to build samples of identity
links to train models (manual and interactive approaches)
40
Examples
of identity
links
S1S2
Learning of parameters,
similarity functions,
thresholds, …
Identity link detection Identity links

• Rule-based approaches: need knowledge to be declared in
the ontology or in other format given by an expert
• homepage(w1, y) ∧ homepage(w2, y) è sameAs(w1, w2)
• sameAs(Restaurant11, Restaurant21)
41
… homepage
Restaurant11 www.kitchenbar.com
Restaurant12 www.jardin.fr
Restaurant13 www.gladys.fr
Restaurant14 …
homepage …
www.kitchenbar.com Restaurant21
www.jardin.fr Restaurant22
www.gladys.fr Restaurant23
… Restaurant24
SameAS
SameAS
SameAS

DATA LINKING APPROACHES:
EVALUATION
• Effectiveness: evaluation of linking results in terms of recall and
precision
• Recall = (#correct-links-sys) /(#correct-links-groundtruth)
• Precision = (#correct-links-sys) /(#links-sys)
• F-measure (F1) = (2 x Recall x Precision) / (Recall +Precision)
• Efficiency: in terms of time and space (i.e. minimize the linking
search space and the interaction actions with an expert/user).
• Robustness: override errors/mistakes in the data
• Use of benchmarks, like those of OAEI (Ontology Alignment
Evaluation Initiative) or Lance
42

SIMILARITY MEASURES
• Token based (e.g. Jaccard, TF/IDF cosinus) :
The similarity depends on the set of tokens that appear in both S and T.
è Efficient, but sensitive to spelling errors
• Edit based (e.g. Levenstein, Jaro, Jaro-Winkler) :
The similarity depends on the smallest sequence of edit operations which
transform S into T.
è Less efficient, may deal with spelling errors, but sensitive to word order
• Hybrids (e.g. N-Grams, Jaro-Winkler/TF-IDF, Soundex)
43
For more details: William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. 2003. A comparison of string
distance metrics for name-matching tasks. In Proceedings of the 2003 International Conference on Information
Integration on the Web (IIWEB'03), Subbarao Kambhampati and Craig A. Knoblock (Eds.). AAAI Press 73-78.

INSTANCE-BASED
DATA LINKING
APPROACHES
44

FRAMEWORK SILK [Volz et al'09]
• Provides a Link Specification Language(LSL)
• Allows specifying linking conditions between two datasets
• The linking conditions may be expressed in terms of:
• Elementary similarity measures (e.g., Jaccard, Jaro) and
• Aggregation functions (e.g. max, average) of the similarity scores
45

SIMILARITY MEASURES IN
SILK [Volz et al'09]
46

EXAMPLE OF LSL
SPECIFICATION [Volz et al'09]
<Silk>
<Prefixes>
<Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#" />
<Prefix id="dbpedia" namespace="http://dbpedia.org/ontology/" />
<Prefix id="gn" namespace="http://www.geonames.org/ontology#" />
</Prefixes>
<DataSources>
<DataSource id="dbpedia">
<Param name="endpointURI" value="http://demo_sparql_server1/sparql" />
<Param name="graph" value="http://dbpedia.org" />
</DataSource>
<DataSource id="geonames">
<Param name="endpointURI" value="http://demo_sparql_server2/sparql" />
<Param name="graph" value="http://sws.geonames.org/" />
</DataSource>
</DataSources>
47
SPARQL
endpoints
Prefixes

EXAMPLE OF LSL
<Interlinks>
<Interlink id="cities">
<LinkType>owl:sameAs</LinkType>
<SourceDataset dataSource="dbpedia" var="a">
<RestrictTo>
?a rdf:type dbpedia:City
</RestrictTo>
</SourceDataset>
<TargetDataset dataSource="geonames" var="b">
<RestrictTo>
?b rdf:type gn:P
</RestrictTo>
</TargetDataset>
48
Entités à
lier
Link types
Entities to
be linked

EXAMPLE OF LSL
<LinkageRule>
<Aggregate type="average">
<Compare metric="levenshteinDistance" threshold="1">
<Input path="?a/rdfs:label" />
<Input path="?b/gn:name" />
</Compare>
<Compare metric="num" threshold="1000" >
<Input path="?a/dbpedia:populationTotal" />
<Input path="?b/gn:population" />
</Compare>
</Aggregate>
</LinkageRule>
<Filter limit="1" />
49
Mesures de
similarité
Similarity
measures
Aggregation
function

EXAMPLE OF LSL
<Outputs>
<Output type="file" minConfidence="0.95">
<Param name="file" value="accepted_links.nt" />
<Param name="format" value="ntriples" />
</Output>
<Output type="file" maxConfidence="0.95">
<Param name="file" value="verify_links.nt" />
<Param name="format" value="alignment" />
</Output>
</Outputs>
</Interlink>
</Interlinks>
</Silk>
50
Possible links
Linking
threshold

KNOFUSS (INSTANCE-BASED,
UNSUPERVISED) [Nikolov et al’12]
• Learns linking rules using genetic algorithms:
Sim(i1, i2) = fag(w11sim11 (V11,V21), …wmnsimmn (V1m,V2n))
• Fag : aggregation function for the similarity scores
• simij: similarity measure between values V1i and V2j
• wij: weights in [0..1]
• Assumptions:
• Unique name assumption (UNA), i.e., two different URIs refer to two
different entities.
• Good coverage rate between the two datasets
• Normalized similarity scores in [0..1]
51

KNOFUSS (INSTANCE-BASED,
UNSUPERVISED) [Nikolov et al’12]
52
Examples of linking rules learned on the OAEI’10 benchmark
Results in term of F-Measure on OAEI’10

GRAPH-BASED AND
INTERACTIVE APPROACH
53
KANG ET AL.: INTERACTIVE ENTITY RESOLUTION IN RELATIONAL DATA: A VISUAL ANALYTIC TOOL AND ITS EVALUATION 1001
Fig. 1. D-Dupe consists of three coordinated windows: the potential duplicate viewer on the left, the relational context viewer on the upper right
corner, and the data detail viewer on the lower right corner. The potential duplicate viewer shows the list of potential duplicate author pairs that were
D-Dupe
[Kang et al’08]

LN2R: A LOGICAL AND
NUMERICAL METHOD FOR
REFERENCE RECONCILIATON
54
[Saïs et al’07, Saïs et al’09]

LN2R
(GRAPH BASED, UNSUPERVISED AND INFORMED)
• A combination of two methods:
• L2R, a Logical method for reference reconciliation: applies
logical rules to infer sure owl:sameAs and
owl:differentFrom links
• N2R, a Numerical method for reference reconciliation:
computes similarity scores for each pair of references
• Assumptions
• The datasets are conforming to the same ontology
• The ontology contains axioms
55

Ontology axioms
• Disjunction axioms between classes, DISJOINT(C, D)
• Functional properties axioms, PF(P)
• Inverse functional properties axioms, PFI(P)
• A set of properties that is functional or inverse functional axioms
Assumptions on the data
• Unique Name Assumption, UNA(src1)
• Local Unique Name Assumption, LUNA(R)
Example:
56
Authored(p, a1), Authored(p, a2), Authored(p, a3) …., Authored(p, an)
è (a1 ¹a2), (a1 ¹ a3), (a2 ¹ a3) , …
LN2R
(GRAPH BASED, UNSUPERVISED AND INFORMED)

N2R: A NUMERICAL METHOD FOR
REFERENCE RECONCILIATION
57
[Saïs et al’09]

N2R: A NUMERICAL METHOD FOR
REFERENCE RECONCILIATION
• N2R computes a similarity score for pair of references obtained from
their common description.
• Uses known similarity measures, e.g. Jaccard, Jaro-Winkler.
• Exploits ontology knowledge in a way to be coherent with L2R.
• May consider the results of L2R: Reconcile(i, i’), ¬Reconcile(i, i’) ,
SynVals(v, v’) and ¬SynVals(v, v’).
58
[Saïs et al’09]

SIMILARITY DEPENDENCY
MODELLING
59
MuseumName+(m1) = {”Le Louvre”},
MuseumName+(m’1) = {”Louvre”},
Located+(m1) = {c1}, Located+(m’1) = {c’1},
Located-(c1) = {m1} , Located-(c’1) = {m’1}, ….
m1, m’1 c1, c’1
p1, p’1
“Le Louvre”,
“Louvre”
“Paris”,
“La ville de Paris”
“La Joconde”,
“l’Européenne”
RDF facts in source S1:
Located(m1, c1), MuseumName(m1, “le Louvre”)
Contains(m1, p1), CityName(c1, “Paris”)
PaintingName(p1, “la Joconde”)
RDF facts in source S2 :
Located(m’1, c’1), MuseumName(m’1, “Louvre”)
Contains(m’1, p’1), CityName(c’1, “la Ville de Paris”)
PaintingName(p’1, “l’Europèenne”)
(c1, c’1) is functionally dependent on (m1, m’1)
è Equation system
x1 x2
x3
b11 b21
b31
1
1/3
½1
1
1
1
CAttr(m1, m’1) = {MuseumName} ,
CAttr(c1, c’1)= {CityName},CAttr(p1,p’1)={PaintingName}
CRel(m1, m’1)= {Located, Contains}
CRel(c1, c’1) = {Located }, CRel(p1,p’1) = {Contains}
[Saïs et al’09]

N2R: ILLUSTRATION
60
m1, m’1 c1, c’1
p1, p’1
“Le Louvre”,
“Louvre”
“Paris”,
“La ville de Paris”
“La Joconde”,
“l’Européenne”
x1 x2
x3
b11
p1, p’2
“La Joconde”,
“Joconde”
x4
b41
b21
b31
l= 1/(| CAttr | + | CRel |) e = 0.02
b11 = 0.8, b21 = 0.3, b31 = 0.1, b41 = 0.7
x1 x2 x3 x4
Initialization 0.0 0.0 0.0 0.0
Iteration 1 0.8 0.3 0.1 0.7
Iteration 2 0.8 0.8 0.4 0.7
Iteration 3 0.8 0.8 0.4 0.7
Solution: x1 = 0.8
x2 = 0.8
x3 = 0.4
x4 = 0.7
x1 = max(max(b11, x3), x4), l * x2)
x2 = max(b21, x1)
x3 = max(b31, l* x1)
x4 = max(b41 , l * x1)
[Saïs et al’09]

N2R: RESULTS ON CORA
Trec=1, all the reconciliations obtained by L2R are also obtained by N2R.
Trec=1 to Trec=0.85, the recall increases of 33 % while the precision decreases
only of 6 %.
Trec = 0.85, the F-measure is of 88 %:
• Better than the results obtained by the supervised method of [Singla and Domingos’05]
• Worst than those (97 %) obtained by the supervised method of [Dong et al.’05]
62
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1 0,95 0,9 0,85 0,8 0,75 0,7 0,65 0,6 0,55
Trec
Rappel
Precision
F-Mesure
[Saïs et al’09]

OAEI 2010 – Instance Matching track (PR), 2nd
N2R: RESULTS IN OAEI2 2010
63
2 Ontology Alignment Evaluation Initiative
[Saïs et al’09]

INSTANCE BASED
ONTOLOGY
MATCHING
64

ONTOLOGY MATCHING
• Ontology alignment [Shvaiko,Euzenat13]: computes a set
A of mappings between elements (classes, properties) of two
ontologies O1 and O2:
f(O1,O2)=A
• The relations that are used to express a mapping can be:
owl:equivalentClass, owl:equivalentProperty,
rdfs:subClassOf, skos:closeMatch, skos:broader, etc.
• Example: A={(
owl:equivalentClass( http://dbpedia.org/ontology/City,
http://schema.org/City, 0.8)}
65

KINDS OF
INFORMATION
• Terminology: lexical information describing the ontology
elements (i.e. labels, comments, …)
• Example: Voie vs Voie souterraine
• Structure: hierarchy of classes and properties
(relations/attributes)
• Example: the sub-classes of Voie are very similar to the
sub-classes of Route
• Extension: the existence of common instances!!
66

PARIS [Suchanek et al. 12]
• Objective: instance-based ontology alignment and data linking (graph-
based, unsupervised and probabilistic)
• Inputs: two populated RDFS ontologies with UNA (two different URI
refer to two different entities)
• Principle:
• Compute the similarities between literal values (“12 cm”=“12”)
• Iterate (1) and (2) until a fix-point :
① Compute the probability that two instances are linked
① Compute the probabilities of subClassOf/subPropertyOf
67
P(Ci ⊆ Cj ), P(Pi ⊆ Pj )
P(i1 = i2)

• Property functionality degree (computed from data)
• The more a property is functional the more the probability of X=Y
will be.
• Local functionality: Fun(p,x) = 1 / #y:p(x, y)
• Global functionality: Fun(p) = (#x : ∃y:p(x,y)) / (#x,y : p(x,y))
• Example:
city(m1,Londres), city(m1,Orsay), city(m2,Tokyo)
Fun(city,m1)= ½ Fun(city,m2)=1
Fun(city)=2/3
è The same is done for inverse functionality (denoted fun-1) 68

Link probability computation
• Positive evidence (P1): if there exists a property that is highly inverse
functional which has range values that are equal with a high probability
isbn(x,isbn1), isbn(x’,isbn2), P(isbn1=isbn2) = 1, fun-1(isbn)=1 …
P1(x=x’) = 1 - ((1 - (1.1)) . …) = 1-(0. …) = 1
• Negative evidence (P2): if there exists a property that is highly
functional which has range values for the probability to be equal is very low.
• Combination : P(x=x’) = P1(x=x’).P2(x=x’)
69
P1(x = x') =1− (1− Fun−1
r(x,y)
r(x',y')
∏ (p).P(y = y'))

• The probabilities of the existence of a subsumption mapping between
properties and between classes are also computed
• It is based on the proportion of common instances comparing to the
number of instances of the general class
• To compute these probabilities, the probabilities of the existence of a
sameAs link between instances are exploited.
70
P(C ⊆ C') = #(C∩C') /#C
P(p ⊆ p') = #(p∩ p') /# p

PARIS - EXPERIMENTS
Ontology #Instances #Classes #Relations
Yago 2 795 289 292 206 67
Dbpedia 2 365 777 318 1 109
71
Instances Classes Relations
Précision Rappel F-Mesure Yago Í DBp
Précision
DBpÍ Yago
Précision
Yago Í DBp
Précision
DBpÍ Yago
Précision
90% 73% 81% - - 100% 92%
90% 73% 81% 94% 84% 100% 92%
Instances: DBPedia and Yago uses the URIs of Wikipedia (recall and precision
possible)
Classes/properties: sampling + expert
5h00 to compute the linking probabilities for instances in one iteration (2h for the
classes and 20 minutes for the properties)
Linking or mapping if the probability >0.4
[Suchanek et al. 12]

SUMMARY
Data linking: numerous and different approaches …
• Supervised approaches: needs samples of linked data
à It can be avoided by using assumptions like (UNA)
• Graph-based approaches: decision propagation (good recall but highly time
consuming)
• Logical approaches: good precision but partial
è Few approaches generate differentFrom(i1,i2) or use dissimilarity evidence
• Informed approaches: need knowledge to be declared in the ontology
(generality) and/or ad-hoc knowledge given by an expert (a selection of
properties, similarity functions)
àThis kind of knowledge are not always available but can be
learnt/discovered from the data (e.g., key/rule discovery approaches)
[Symeonidou et al. 14, Symeonidou et al. 17, Galarraga et al. 13]
72

OUTLINE
§ Linked Open Data
§ Knowledge Graphs
§ Technical Part
1. Data Linking
2. Key Discovery
3. Link invalidation/requalification
73

KEY DISCOVERY
PHD OF DANAI SYMEONIDOU IN COLLABORATION WITH:
NATHALIE PERNELLE (LRI, UNIV. PARIS SUD),
74
[Qualinca ANR project (2012-2016)]

DATA LINKING USING
RULES
homepage(w1, y) ∧ homepage(w2, y) è sameAs(w1, w2)
75
… homepage
Restaurant14 …
homepage …
… Restaurant24

DATA LINKING USING
RULES
76
… homepage
Restaurant14 …
homepage …
… Restaurant24
SameAS
SameAS
SameAS

DATA LINKING USING
RULES
Rules
• Logical Rules
• Ex. For instances of the class Restaurant
{homepage}: discriminative property
• Complex Rules
• Ex. For instances of the class Restaurant
max(Jaccard(lat(w1, n);lat(w2, m));jaro(long(w1, x);long(w2, y)) > 0.8
èsameAs(w1, w2)
{lat, long}: discriminative property set
77
Rules contain discriminative properties => keys

KEYS
Key: A set of properties that uniquely identifies every instance in the data
78
FirstName LastName Birthdate Profession
Person1 Anne Tompson 15/02/88 Actor
Person2 Marie Tompson 02/09/75 Researcher
Person3 Marie David 15/02/85 Teacher
Person4 Vincent Solgar 06/12/90 Teacher
Is [FirstName] a key? ✖
Is [LastName] a key? ✖

KEYS
Key: A set of properties that uniquely identifies every instance in the data
79
Is [FirstName] a key? ✖
Is [LastName] a key? ✖
Is [FirstName,LastName] a key? ✔

KEYS - KEY
MONOTONICITY
Key monotonicity: When a set of properties is a key, all its supersets
are also keys
Minimal Key: A key that by removing one property stops being a key
80
✔Is [FirstName,LastName, Birthday] a key?
✖Is [FirstName,LastName, Birthday] a minimal key?
Minimal keys: [[FirstName,LastName],[FirstName,Profession] [Birthdate], [LastName,Profession]]

KEYS DECLARED BY EXPERTS
FOR DATA LINKING
Not an easy task:
• Experts are not aware of all the keys
Ex. {SSN}, {ISBN} easy to declare
Ex. {Name, DateOfBirth, BornIn} is it a key for the class Person?
• Erroneous keys can be given by experts
• As many keys as possible
• More keys => More linking rules
Goal: Discover keys automatically
81

OWL2 KEY, KEY IN
THE OPEN WORLD
OWL2 Key for a class: a combination of properties that uniquely identify
each instance of a class
• hasKey( CE ( OPE1 ... OPEm ) ( DPE1 ... DPEn ) )
:hasKey(Book(Author) (Title)) means:
Book(x1)∧Book(x2)∧Author(x1, y)∧Author (x2, y)∧Title(x1,w) ∧Title(x2, w)
è sameAs(x1, x2)
82

KEYS IN THE OPEN
WORLD
Key in the Open World
• Key discovery: For a set of properties that is a key, there is no instance that
shares values with another instances
83
lastName firstName hasFriend
Person1 Tompson Manuel Person2,
Person3
Person2 Tompson Maria,
Hellen
Person1
Person3 David George Person2,
Person4
Person4 Solgar Michel Person3

KEYS IN THE OPEN
WORLD
• Ex. firstName is a key in the Open World since
• There do exist any people that share first names
84
Person3
Hellen
Person1
Person4

KEYS IN THE OPEN
WORLD
• Ex. hasFriend is not a key in the Open World since
• There exist at least two people that share a friend
85
Person3
Hellen
Person1
Person4

KEYS IN THE CLOSED
WORLD
Key in the Closed World
• For each property that participate in a key, every set of values for a property
and an instance should be unique
86
Person3
Person2 Tompson Maria Person1
Person4

KEYS IN THE CLOSED
WORLD
Key in the Closed World
• For each property that participate in a key, every set of values for a property
and an instance should be unique
• Ex. hasFriend is a key in the Closed World since
{Person2,Person3} <> {Person1} <> {Person2,Person4} <> {Person3}
87
Person3
Person2 Tompson Maria Person1
Person4

CLOSED WORLD APPROACHES
VS
OPEN WORLD APPROACHES
88

KEY DISCOVERY
APPROACHES
CWA approaches
• Keys and Pseudo-Keys Detection for Web Datasets Cleaning
and Interlinking
• ROCKER: A Refinement Operator for Key Discovery
OWA approaches
• KD2R: a Key Discovery method for reference reconciliation
• SAKey: Scalable almost key discovery in RDF data
• Linkkey: Data interlinking through robust Linkkey extraction
• VICKEY: Conditional key discovery
89

KEY DISCOVERY
APPROACHES
CWA approaches
• Keys and Pseudo-Keys Detection for Web Datasets
Cleaning and Interlinking
OWA approaches
90

KEYS AND PSEUDO-KEYS DETECTION FOR
WEB DATASETS CLEANING AND
INTERLINKING [Atencia et al. 12]
Bottom-up approach that discovers
• Keys in the Closed World
• Pseudo keys - keys that tolerate exceptions
91
x
P2P1 P3 P4
P1P2 P1P3 P1P4 P2P3 P2P4 P3P4
P1P2P3 P1P2P4 P2P3P4
P1P2P3P4

92
x
P1

93
x
P2P1

94
x
P2P1 P3

Top-down approach that discovers
95
x
P2P1 P3 P4

96
x
P2P1 P3 P4
If P4 is a key => P1P4, P2P4,.., P1P2P3P4 are also keys

97
x
P2P1 P3 P4

98
x
P2P1 P3 P4

If P4 is a key => P1P4, P2P4,.., P1P2P3P4 are also keys 99
x
P2P1 P3 P4
P1P2P3P4

To verify if a set of properties is a key
• Partition instances according to their sharing values
• If each partition contains only one instances => Key
Key quality measures
• Support of a set of properties P:
• Discriminability of a set of properties P (pseudo-keys):
100
support ! =
# !"#$%"&'# !"#$%&'"! !" !
# !"" !"#$%"&'#

[ATENCIA ET AL. 12]- EXAMPLE
101
Name Actor Director ReleaseDate Website Language
film
1
“Ocean’s 11” “B. Pitt”
“J.
Roberts”
“S.
Soderbergh”
“3/4/01” www.oceans11.com ---
film
2
“J.
Roberts”
“S.
Soderbergh”
“R. Howard”
“2/5/04” www.oceans12.com “english”
film
3
“G.
Clooney”
“S.
Soderbergh”
“R. Howard”
film
4
“The
descendants”
“N. Krause”
“G.
Clooney”
“A. Payne” “15/9/11” --- “english”
film
5
“Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.co
m
“english”

102
Film3Film2Film1
Film5Film4
[Name]: key
support 5/5 =1
Discriminability = 5/5 =1
Partitions using [Name]
è
film
1
“J.
Roberts”
“S.
Soderbergh”
film
2
“J.
Roberts”
“S.
Soderbergh”
“R. Howard”
film
3
“G.
Clooney”
“S.
Soderbergh”
“R. Howard”
film
4
“The
descendants”
“N. Krause”
“G.
Clooney”
“A. Payne” “15/9/11” --- “english”
film
5
“Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.com “english”

103
Partitions using [Actor]
è
film
1
“J.
Roberts”
“S.
Soderbergh”
film
2
“J.
Roberts”
“S.
Soderbergh”
“R. Howard”
film
3
“G.
Clooney”
“S.
Soderbergh”
“R. Howard”
film
4
“The
descendants”
“N. Krause”
“G.
Clooney”
“A. Payne” “15/9/11” --- “english”
film
5
m
“english”
Film4
Film3Film1
Film2
Film5
[Actor]: ???

104
Partitions using [Actor]
è
film
1
“J.
Roberts”
“S.
Soderbergh”
film
2
“J.
Roberts”
“S.
Soderbergh”
“R. Howard”
film
3
“G.
Clooney”
“S.
Soderbergh”
“R. Howard”
film
4
“The
descendants”
“N. Krause”
“G.
Clooney”
“A. Payne” “15/9/11” --- “english”
film
5
m
“english”
Film4
Film3Film1
Film2
Film5
[Actor]: pseudo key
Support = 5/5 =1
Discriminability = 3/4=0.75

105
Partitions using
[Actor,Director]
è
film
1
“J.
Roberts”
“S.
Soderbergh”
film
2
“J.
Roberts”
“S.
Soderbergh”
“R. Howard”
film
3
“G.
Clooney”
“S.
Soderbergh”
“R. Howard”
film
4
“The
descendants”
“N. Krause”
“G.
Clooney”
“A. Payne” “15/9/11” --- “english”
film
5
m
“english”
Film3Film2Film1
Film5Film4
[Actor, Director]: key
Support = 4/5 =0.8
Discriminability = 5/5=1

KEY DISCOVERY
APPROACHES
CWA approaches
and Interlinking
OWA approaches
106

SAKEY [Symeonidou et al. 14]
SAKey: Scalable Almost Key discovery approach for:
• Incomplete data (optimistic heuristic)
• Errors
• Duplicates
• Large datasets
Discovers almost keys
• Sets of properties that are not keys due to few exceptions
107

ALMOST KEY DISCOVERY
STRATEGY
• Naive automatic way to discover keys
• Examine all the possible combinations of properties
• Scan all instances for each candidate key
• Example: Class described by 15 properties è215 = 32767
candidate keys
• Discover keys efficiently by:
• Reducing the combinations
• Partially scanning the data
108

KEY DISCOVERY
SAKey: Key discovery approach for very large RDF datasets that may
contain erroneous data or duplicates
Region Producer Colour
Wine1 Bordeaux Dupont White
Wine2 Bordeaux Baudin Rose
Wine3 Languedoc Dupont Red
Wine4 Languedoc Faure Red
109
Producer is a key?

KEY DISCOVERY
110
Region is a non key?
• Interested only in maximal non keys:
• All the sets of properties that are not maximal non keys are keys
• Example: class described by the properties p1, p2, p3, p4
Maximal non key = {{p1, p2}} keys = {{p3}, {p4}}è

SAKEY–GENERAL ARCHITECTURE
111
[Symeonidou et al. 14]

KEY DISCOVERY
• n-almost key: A set of properties for which at most n instances are
identical values
• Examples of keys
{Region, Producer}:
0-almost key
112

KEY DISCOVERY
SAKey: Key discovery approach for very large RDF datasets that
may contain erroneous data or duplicates
• n-almost key: A set of properties for which at most n instances are
identical values
• Examples of keys
{Region, Producer}:
0-almost key
{Producer}:
2-almost key
Tool available in https://www.lri.fr/sakey
113

N-ALMOST KEYS
Exception of a key: an instance that shares values with another instance for
a given set of properties P
114
Films HasName HasActor HasDirector ReleaseDate HasWebsite HasLanguage
f1 “Ocean’s 11” “B. Pitt”
“J. Roberts”
“S.
Soderbergh”
“G. Clooney”
“J. Roberts”
“S.
Soderbergh”
“R. Howard”
“G. Clooney”
“S.
Soderbergh”
“R. Howard”
f4 “The
descendants”
“N. Krause”
“G. Clooney”
“A. Payne” “15/9/11” www.descendants.com “english”
f5 “Bourne
Identity“
“D. Liman” --- “12/6/12” www.bourneIdentity.com “english”
f6 “Ocean’s 12“ --- “R. Howard” “2/5/04” --- ---

N-ALMOST KEYS
Exception of a key: an instance that shares values with another instance for a
given set of properties P
• f1, f2 and f3 are three exceptions for the property set {HasActor}
115
“J. Roberts”
“S.
Soderbergh”
“G. Clooney”
“J. Roberts”
“S.
Soderbergh”
“R. Howard”
“G. Clooney”
“S.
Soderbergh”
“R. Howard”
f4 “The
descendants”
“N. Krause”
“G. Clooney”
f5 “Bourne
Identity“
f6 “Ocean’s 12“ --- “R. Howard” “2/5/04” --- ---

N-ALMOST KEYS
Exception of a key: an instance that shares values with another instance for a
given set of properties P
• f1, f2 and f3 are three exceptions for the property set {HasActor}
Exception Set EP: set of exceptions for P
• EP = {f1, f2, f3} È {f2, f3, f4} = {f1, f2, f3, f4} for {HasActor}
116
“J. Roberts”
“S.
Soderbergh”
“G. Clooney”
“J. Roberts”
“S.
Soderbergh”
“R. Howard”
“G. Clooney”
“S.
Soderbergh”
“R. Howard”
f4 “The
descendants”
“N. Krause”
“G. Clooney”
f5 “Bourne
Identity“
f6 “Ocean’s 12“ --- “R. Howard” “2/5/04” --- ---

N-ALMOST KEYS
n-almost key: a set of properties where |EP|≤ n
• {HasActor} is a 4-almost key
117

N-ALMOST KEYS
n-non key: a set of properties where |EP|≥ n
• Using all the maximal n-non keys we can derive all the
minimal (n-1)-almost keys
118

N-ALMOST KEYS
119
All combinations
of properties

N-ALMOST KEYS
120
All sets of properties
that contain at least 5
exceptions
All combinations
of properties
5-non keys

N-ALMOST KEYS
121
that contain at least 5
exceptions
All combinations
of properties
that contain less than 5
exceptions
5-non keys 4-almost keys

N-NON KEY DISCOVERY:
INITIAL MAP
HasActor {{f1, f2}, {f1, f2, f3}, {f2, f3, f4}, {f4}, {f5}}
HasDirector {{f1,f2,f3}, {f2, f3, f6}, {f4}}
ReleaseDate {{f1}, {f2, f6}, {f3}, {f4}, {f5}}
HasName {{f1}, {f2, f6}, {f3}, {f4}, {f5}}
HasLanguage {{f4, f5}}
HasWebsite {{f1}, {f2}, {f3}, {f4}, {f5}, {f6}}
“B. Pitt” “G. Clooney” “D. Liman”“N. Krause”“J. Roberts”“S. Soderbergh”
122

DATA FILTERING
Singleton filtering
123
HasName {{f1}, {f2, f6}, {f3}, {f4}, {f5}}

DATA FILTERING
Singleton filtering
124
HasName {{f1}, {f2, f6}, {f3}, {f4}, {f5}}

DATA FILTERING
Singleton filtering
125
HasName {{f1}, {f2, f6}, {f3}, {f4}, {f5}}
Single key

DATA FILTERING
Singleton filtering
126
HasActor {{f1, f2, f3}, {f2, f3, f4}}
HasDirector {{f1,f2,f3}, {f2, f3, f6}}
ReleaseDate {{f2, f6}}
HasName {{f2, f6}}

POTENTIAL N-NON KEYS
Combinations of properties not needed be explored
• Incomplete data
• Properties referring to different classes
Lake Mountain
NaturalPlace
depth mountainRange
127
rdfs:subclassOfrdfs:subclassOf

POTENTIAL N-NON KEYS
Combinations of properties not needed be explored
• Incomplete data
• Properties referring to different classes
• Potential n-non keys: Sets of properties that possibly refer
to n-non keys
128
Lake Mountain
NaturalPlace
depth mountainRange
rdfs:subclassOfrdfs:subclassOf

N-NON KEY
DISCOVERY
HasActor
• {f1, f2, f3} È {f2, f3, f4} = {f1, f2, f3, f4} => 4-non key
Composite n-non keys
• Intersections between sets of different properties
129
HasActor {{f1, f2, f3}, {f2, f3, f4}}
HasDirector {{f1,f2,f3}, {f2, f3, f6}}
ReleaseDate {{f2, f6}}
HasName {{f2, f6}}

130
N-NON KEY
DISCOVERY [Symeonidou et al. 14]

131
N-NON KEY
DISCOVERY
{hasActor, director} è 3-non key

132
N-NON KEY
DISCOVERY
{hasActor, director} è 3-non key
…

KEY MERGE: SEVERAL DATASETS
Goal: Keys valid in both datasets
• More sure keys
Intuition: Computation of Cartesian product of sets of keys
• Keep only minimal keys
133
[firstName, LastName]
[hasFriend]
[firstName]
KeysA KeysB

KEY MERGE: SEVERAL DATASETS
Goal: Keys valid in both datasets
• More sure keys
Intuition: Computation of Cartesian product of sets of keys
• Keep only minimal keys
134
[hasFriend]
[firstName]
KeysA KeysB
[hasFriend, firstName]
KeysAB

DATA LINKING USING
ALMOST KEYS
Goal: Compare linking results using almost keys with different n
Evaluation of linking using
• Recall
• Precision
• F-Measure
Datasets
• OAEI 2010
• OAEI 2013
Conclusion
• Linking results using n-almost keys are the better than using keys
135

EXAMPLE: DATA LINKING
USING ALMOST KEYS
OAEI 2013 - Person
• BirthName, BirthDate, award, comment, label, BirthPlace, almaMater,
doctoralAdvisor
Almost keys Recall Precision F-Measure
0-almost key {BirthDate, award} 9.3% 100% 17%
2-almost key {BirthDate} 32.5% 98.6% 49%
# exceptions Recall Precision F-measure
0, 1 25.6% 100% 41%
2, 3 47.6% 98.1% 64.2%
4, 5 47.9% 96.3% 63.9%
6, ..., 16 48.1% 96.3% 64.1%
17 49.3% 82.8% 61.8%
136

KD2R VS. SAKEY -
NON KEY DISCOVERY
Class # triples
#
Instances
#Properties
KD2R
Runtime
SAKey
Runtime
(n=0)
DB:Website 8506 2870 66 13min 1s
YA:Building 114783 54384 17 26s 9s
DB:BodyOfWater 1068428 34000 200 outOfMem. 37s
DB:NaturalPlace 1604348 49913 243 outOfMem. 1min10s
Dbpedia class=
DB:NaturalPlace
137

KD2R VS. SAKEY -
KEY DERIVATION
Class
# non
keys
# keys KD2R SAKey
(n=0)
DB:Lake 50 480 1min10s 1s
DB:Mountain 49 821 8min 1s
DB:BodyOfWater 220 3846 > 1 day 66s
DB:NaturalPlace 302 7011 > 2 days 5min
Dbpedia class=
DB:BodyOfWater
138

KEY DISCOVERY
APPROACHES
CWA approaches
and Interlinking
OWA approaches
139

LINKKEY
140
Ontology1 Ontology2
Data of classA Data of classB
Linkkey
Discovery
Given a pair of classes in two datasets conforming to two different ontologies:
1) Discover Linkkeys – maximal sets of property pairs that can link instances of two
different classes
[Atencia et al. 2014]

LINKKEY
141
Person1
Literal
LastName
Literal Literal
Person2
Literal
LN
Literal Literal
different classes
[Atencia et al. 2014]

LINKKEY [ATENCIA ET AL. 2014]
142
Person1
Literal
LastName
Literal Literal
Person2
Literal
LN
Literal Literal
Owl:EquivalentClass
different classes

143
Person1
Literal
LastName
Literal Literal
Person2
Literal
LN
Literal Literal
Owl:EquivalentClass
LastName Nationality Profession
P11 Tompson Greek
P12 Dupont French Researcher
LN NL PR
P21 Tompson Greek
P22 Tompson Greek,
French
Reseacher
different classes

144
Person1
Literal
LastName
Literal Literal
Person2
Literal
LN
Literal Literal
Owl:EquivalentClass
Ex. {<LastName,LN>,<Nationality,NL>}, {<Profession,PR>,<Nationality,NL>}
P11 Tompson Greek
LN NL PR
P21 Tompson Greek
P22 Tompson Greek,
French
Reseacher
different classes

145
different classes
2) Linkkey quality evaluation:
1) use measures of discriminability and coverage for non supervised linkkey
extraction
2) an approximation of precision and recall for the supervised case.
Ex. {<LastName,LN>,<Nationality,NL>}, {<Profession,PR>,<Nationality,NL>}
P11 Tompson Greek
LN NL PR
P21 Tompson Greek
P22 Tompson Greek,
French
Reseacher

CONDITIONAL KEYS
COLLABORATION WITH:
LUIS GALARAGA (INRIA RENNES)
FABIAN SUCHANEK (TELECOM PARISTECH)
DANAI SYMEONIDOU (INRA, MONTPELLIER)
146
Symeonidou et al. 2017

KEYS IN KNOWLEDGE
BASES
To obtain keys found in Knowledge Bases (KBs)
• Different automatic key discovery approaches (eg. SAKey, KD2R)
Key limitations
• Cases of datasets having no keys
• Keys are generic, i.e., they are true for every instance of a class in a
dataset
FirstName LastName Gender Lab Nationality
instance1 Claude Dupont Female Paris-Sud France
instance2 Claude Dupont Male Paris-Sud Belgium
instance3 Juan Rodríguez Male INRA Spain, Italy
instance4 Juan Salvez Male INRA Spain
instance5 Anna Georgiou Female INRA Greece, France
instance6 Pavlos Markou Male Paris-Sud Greece
instance7 Marie Legendre Female INRA France
Instances of the
class Person
Key: {LastName,
Gender}
147

PROBLEM STATEMENT
Motivating example
• In many countries of the world, a student can be supervised
by multiple supervisors
• In German Universities, a student can be supervised by only
one professor
• {supervises} key for the instances of the class Professor with
“condition” that they are in a German university
Approximate keys [Sakey] express keys that do not uniquely
identify every instance of a class in a dataset
Approximate keys do not express the conditions under which they are true
148

CONDITIONAL KEYS
BY VICKEY
Conditional key: a key, valid for instances of a class satisfying a
specific condition
• Condition part: pairs of property and value
Eg. {Lab=INRA}, {Gender=Male}, {Gender=Female, Lab=INRA} etc.
• Key part: a set of properties
{LastName} is a key under the condition {Lab=INRA}
Instances of the
class Person
149

CONDITIONAL KEY
QUALITY MEASURES
Support: #instances both satisfying condition part and instantiating key part
Coverage: Support/#all_Instances
Support: 4
Coverage: 4/7 = 0.57
{LastName} is a key under the condition {Lab=INRA}
Instances of the
class Person
150

VICKEY: MINING EFFICIENTLY
CONDITIONAL KEYS
Goal of VICKEY
• Given all instances of a class in a dataset, a min_support, a min_coverage,
mine all minimal conditional keys with
• support ≥ min_support
• coverage ≥ min_coverage
Exponential search space (O(|V||P|)with
• V, the set of objects of a given class in the dataset
• P, the set of properties of a given class in the dataset
151
Conditional keys can be efficiently discovered from the non-keys

VICKEY: MINING EFFICIENTLY
CONDITIONAL KEYS
Non-key: a set of properties where two instances share at least one value for
each property in the non-key (SAKey)
Given a maximal non-key
• Condition part: subset of properties of the non-key
• Key part: subset of the remaining properties of the non-key
Non-key
{FirstName, Gender, Lab,
Nationality}
Instances of the
class Person
152

CONDITIONAL KEY GRAPH
EXPLORATION
• Given a non-key
• Step 1: Discover all minimal conditional keys with condition of size 1
• Step 2: Discover all minimal conditional keys with condition of size 2
• …
• Step n: Discover all minimal conditional keys with condition of size n
153

• Given a non-key
• Step 1: Discover all minimal conditional keys with condition of size 1 {p=a}
• Step 2: Discover all minimal conditional keys with condition of size 2 {p1=a1^p2=a2}
• …
• Step n: Discover all minimal conditional keys with condition of size n {p1=a1^…^pn=an}
FirstName Lab
FirstName
Lab
FirstName
Nationality
Lab
Nationality
FirstName
Nationalit
y
Lab
Gender=Female Nationality
Condition Conditional key graph
Non-key
{FirstName, Gender, Lab, Nationality}
154
EXPLORATION Symeonidou et al. 2017

• Given a non-key
• …
FirstName Lab
FirstName
Lab
FirstName
Nationalit
y
Lab
Nationality
FirstName
Nationalit
y
Lab
Gender=Female Nationality
Key {FirstName} for cond {Gender=Female}
Non-key
155

• Given a non-key
• …
Gender=Female
firstName
firstName
lab
firstName
nationalit
y
lab
nationalit
y
firstName
nationalit
y
lab
FirstNam
e
FirstName
Lab
FirstName
Nationalit
y
FirstName
Nationalit
y
Lab
Lab
Nationalit
y
NationalityLab
Non-key
156

• Given a non-key
• …
Lab=INRA
FirstNam
e
FirstName
Gender
FirstName
Nationalit
y
Gender
Nationalit
y
FirstName
Nationalit
y
Gender
NationalityGender
Non-key
157

• Given a non-key
• …
Gender=Female
^
Lab=INRA
Nationality
158

EXPERIMENTAL
EVALUATION
Evaluation of VICKEY’s scalability in large datasets
Evaluation of conditional keys in data linking
159

VICKEY’S
SCALABILITY
Run VICKEY in large datasets
Compare the runtime results with AMIE - a generic rule
mining approach adapted to mine conditional keys
• Both approaches take as input non-keys mined by SAKey to
reduce the search space of conditional key mining
Coverage 1%
160

RUNTIME RESULTS -
VICKEY VS. AMIE[2]
Class* #Triples #Instances #Properties #NonKeys VICKEY AMIE #ConditionalKeys
Actor 57.2k 5.8k 71 137 4.52m 12.58h 311
Album 786.1k 85.3k 39 68 1.53h 3.90h 304
Book 258.4k 30.0k 51 95 11.84h >1d 419
Film 832.1k 82.6k 74 132 1.37h 3.64h 185
Mountain 127.8k 16.4k 58 47 2.86m 23.57m 257
Museum 12.9k 1.9k 65 17 1.46s 6.45s 58
Organization 1.82M 178.7k 553 3221 26.32h > 36h 28
Scientist 258.5k 19.7k 73 309 27.67m > 1d 582
University 85.5k 8.7k 89 140 14.45h >1d 941
*All used classes are obtained from DBpedia
161

DATA LINKING -
VICKEY VS. SAKEY[1]
Goal: link two datasets using
• Classical keys discovered by SAKey[1]
• Conditional keys discovered by VICKEY
• Supervises(Prof1, stud1)^Supervises(Prof2, stud1)^teachesIn(Prof1,
“Gemany”)^teachesIn(Prof2, “Gemany”)=> sameAs(Prof1, Prof2)
• Both classical keys and conditional keys
Evaluate obtained links using the existing goal-standard with
• Recall
• Precision
• F-Measure
162

DATA LINKING -
VICKEY VS. SAKEY[1]
Link two knowledge bases containing information of Wikipedia
• Yago[3]
• DBpedia[4]
Used classes of DBpedia and Yago
• Actor
• Album
• Book
• Film
• Mountain
• Museum
• Organization
• Scientist
• University
163

DATA LINKING -
VICKEY VS. SAKEY[1]
Class Recall Precision F-Measure
Actor
Keys[1]* 0.27 0.99 0.43
Conditional keys** 0.57 0.99 0.73
Keys[1]+Conditional
keys 0.6 0.99 0.75
Album
Keys[1] 0 1 0.00
Conditional keys 0.15 0.99 0.26
Keys[1]+Conditional
keys 0.15 0.99 0.26
Film
Keys[1] 0.04 0.99 0.08
Conditional keys 0.38 0.96 0.54
Keys[1]+Conditional
keys 0.39 0.98 0.55
Museum
Keys[1]
0.12 1 0.21
Conditional keys
0.25 1 0.40
Keys[1]+Conditional
keys
0.31 1 0.47
x 869
x 1.75
x 7.1
x 2.19
*Keys[1] from SAKey **Conditional keys from VICKEY
164

SUMMARY
Keys and almost keys:
• Better linking results in terms of recall with n-almost keys than with sure keys
• Optimistic heuristic gets better results than pessimistic one
• Can handle size classes much larger than KD2R (DB: Natural Place more than
16 million triplets)
Conditional keys:
• Better linking results in terms of recall when conditional keys are used.
• Aside data linking, may be used to refine a class hierarchy
Improvements:
• Numerical values
• Dataset heterogeneity
When no (almost-)keys neither conditional keys are discoverable from the dataset
è Possible solution: Referring expressions
165

LINK
INVALIDATION /
REQUALIFICATION
WORK IN COLLABORATION WITH:
LAURA PAPALEO (GENOVA PROVINCE, ITALY)
JOE RAAD (PHD STUDENT, AGROPARISTECH & UNIVERSITÉ PARIS SUD. )
166

• owl:sameAs, indicates that two different
descriptions refer to the same entity
• owl:sameAs semantics is too strict
• Reflexive, symmetric, transitive and
• Property sharing:
" X " Y owl:sameAs(X, Y)Ù p(X, Z) Þ p(Y, Z)
• Automatic data linking tools do not guarantee 100%
precision, because of:
• Errors, missing information, data freshness, …
• [Halpin et al. 2010] showed that 37% of owl:sameAs
links randomly selected among 250 identity links
between books were incorrect.
• Problem: how to (semi)-automatically invalidate/requalify
owl:sameAs links?
IDENTITY CRISIS
b1 b2
owl:sameAs
b4 b3
owl:sameAs
b1 b5
owl:sameAs
b5 b6
owl:sameAs
167

IDENTITY CRISIS: SOME SOLUTIONS
b1 b2
owl:sameAs
b4 b3
owl:sameAs
b1 b5
owl:sameAs
b5 b6
owl:sameAs
168
• [Halpin et al. 2010]: propose ontology of identity and
invalidation of identity links by crowdsourcing.
• [de Melo 2013]: uses the Unique Name Assumption
and the transitivity of links to detect inconsistencies in
the data.
• [Papaleo et al. 2014, Papageorgiou et al. 2017]: exploit
some ontology axioms to logically/numerically detect
invalid identity links.
• [Raad et al. 2017] computes contextual identity links
for each pair of instances

LINK
INVALIDATION
[QUALINCA ANR PROJECT
(2012-2016)]
- NATHALIE PERNELLE (LRI, UNIV. PARIS SUD)
- LAURA PAPLEO(GENOVA PROVINCE, ITALY)
169

LOGICAL AND NUMERICAL
APPROACH FOR LINK INVALIDATION
• Two ontology-based methods to detect invalid sameAs statements: a
logical method and a numerical method
• We build a contextual graph «around» each one of the two resources
involved in the sameAs by exploiting ontology axioms on:
• functionality and inverse functionality of properties and
• local completeness of some properties (e.g., the author list of a
book).
• We analyse the descriptions provided in these contextual graphs to
eventually detect inconsistencies or high dissimilarities.
170

LOGICAL METHOD
F is the set of RDF facts
enriched by a set of ¬synVals
facts in the form
¬synVals(w1, w2)
w1 and w2, being literals and
different.
171of 20
EXAMPLES:
- notSynVals(‘231’,’100’)
for a functional property numOfPages
-notSynVals(‘New York’, ‘Paris’)
for a functional property cityName
… knowledge from expert or extracted.
171
[Papaleo et al. 2014]

LOGICAL METHOD
R the set of rules
172of 20
(inverse) functional properties
local complete properties
sameAs(x,y) ÙnumOfPages(x,w1) Ù numOfPages(y,w2) à SynVals(w1,w2)
sameAs(x,y) Ù hasAuthor(x,w1)
à hasAuthor(y,w1)
172

LOGICAL INVALIDAITON
• If the property nb-pages is declared as functional then:
nbPages(b1, n1) Ù nbPages(b2, n2) Ù (b1=b2) Þ n1=n2.
è owl:sameAs(b1, b2) est faux.
b1 b2
a11
a21
owl:sameAs(b1, b2) ?
author
author
A Semantic Web
Primer
titre
A Semantic Web
Primer
titre
2007
pubYear
2007
pubYear
G. Antoniou
aName
Grigoris
Antoniou
aName
author
a12
author
a22
208
nbPages 288nbPages
Paul Grauth
aName
P. Grauth
aName
173

b1
e1
b2
a11 e2 a21
owl:sameAs(b1, b2) ?
author
author
publisherpublisher
A Semantic Web
Primer
title
A Semantic Web
Primer
title
2008
pubYear
2007
pubYear
G. Antoniou
aName
Grigoris
Antoniou
aName
author
…
a12
…
author
…
a22
…
r2n
r21ref
ref
…r1m r11
ref ref
…
NUMERICAL METHOD: EXAMPLE
• P= {title, pubYear, publisher, author, aName, pName}
• Sim(“A Semantic Web …``, “A Semantic Web …``) = 1, Sim(“ 2007“, “2008“) = 0
• Sim(“MIT Press``, “MIT Press``) = 1, è CSim(e1, e2) = 1
• Sim(“Grigoris Antoniou``, “G. Antoniou``) = 0.5, ... è CSim(a11, a21) = 0.5, ….
è CSim(a12, a22) = 0.5
MIT Press
MIT Press
pName
pNam
e
• CSim(b1, b2) = 0.62 if agregation function is average and
• CSim(b1, b2) = 0 if agregation function is minimum 174
[Papageorgio et al. 2017]

COMPARAISON
LOGICAL/NUMERICAL
Logical method
[Papaleo, Pernelle and Saïs (2014)]
Numerical method
(Agregation = average)
[Papageorgiou, Pernelle and Saïs 2017]
Datasets Precision Recall F-measure Precision Recall F-measure
thresh
old
Person1 0.69 0.98 0.81 1.0 0.98 0.99 0.3
Person2 0.5 1.0 0.67 0.994 0.989 0.99 0.2
Restaurant 0.63 0.97 0.77 0.97 1.0 0.98 0.4
• Average gain of 23% F-measure (significant increase in precision, comparable recall)
175

CONTEXTUAL
IDENTITY
[JOE RAAD PHD
LIONES (2015-2018) PROJECT,
FUNDED BY CDS PARIS SACLAY]
- NATHALIE PERNELLE (LRI, UNIV. PARIS SUD),
- JULIETTE DIBIE (AGROPARISTECH, INRA),
- LILIANA IBANESCO (AGROPARISTECH, INRA),
- CAROLINE PENICAUD (INRA PARIS)
176

• owl:sameAs, indicates that two different descriptions refer to the same
entity
• owl:sameAs semantics is too strict
• Reflexive, symmetric, transitive and
• Property sharing:
" X " Y owl:sameAs(X, Y)Ù p(X, Z) Þ p(Y, Z)
• In some specific domains, the identity depends on the context.
Examples:
• same-components and same-quantity è same-juice. [Hypocaloric diet context]
• proportionally-same-components è same-juice. [Shopping list context]
• Problem: how to automatically compute the contextual identity links?
CONTEXTUAL IDENTITY
177
[Raad et al. 2014]

178
CONTEXTUAL IDENTITY
Proposition of a new contextual identity link
• A new predicate to represent identity links valid in explicit contexts
• Hierarchy of the contexts and the identity links
• Defining identity contexts for each class (local contexts)
• Defining identity contexts for each class and its connected graph
(global context)
[Raad et al. 2014]

CONTEXTUAL IDENTITY
Contextual Identity Link Example
Πa(Juice) = { (Juice, {rdf:Type, expiryDate}, {isComposedOf}),
(Banana, {rdf:Type}, {isComposedOf -1}),
(Strawberry, {rdf:Type}, {hasAttribute, isComposedOf -1}),
(Weight, {rdf:Type, hasValue, hasUnit}, {hasAttribute-1}) }
identiConTo<Πa(Juice)>(juice1, juice2)
179
[Raad et al. 2014]

CONTEXTUAL IDENTITY
Global Context
• π(ck): local context of the class ck =(ck, DPck, Opck)
• src: a given source class of the ontology
• G: connected sub-graph containing src
• CG: set of classes belonging to G
• Ck: class belonging to CG
Π(src) =Uck ∈ CG π(ck)
180
[Raad et al. 2014]

CONTEXTUAL IDENTITY
Πa(Juice) ≤ Πc(Juice) Πb(Juice) ≤ Πc(Juice)
Πa(Juice) Πb(Juice)
Πc(Juice)
…
…
each local context in Πc(Juice) is less specific or equal to its corresponding local
context in Πa(Juice) 181
[Raad et al. 2014]

DECIDE METHOD – DETECTION OF
CONTEXTUAL IDENTITY
It automatically detects and adds these contextual
identity links in the knowledge graph
DECIDE
DEtection of Contextual IDEntity
Knowledge
Graph
Source
Class
Unwanted
Properties
Paired
Properties
Necessary
Properties
For each pair of instances (i1, i2) of the source class
set of the most specific global contexts in which (i1, i2)
are identical
182
[Raad et al. 2014]

EXPERIMENTS - LIONES PROJECT
Transformation of Micro-organisms Digestion Process
≈ 1000 files ≈ 2000 files
Guidelines
CellExtraDry
- Classes: ≈ 4 700
- Individuals: ≈ 415 000
- Statements: ≈ 1 700 000
Carredas
- Classes : ≈ 5 000
- Individuals: ≈ 42 000
- Statements : ≈ 237 000
RDFRDF
183
[Raad et al. 2014]

Mixture
DECIDE
DEtection of Contextual IDEntity
CellExtraDry + Carredas
184
[Raad et al. 2014]

identiConTo<Π(Mixture)25>(i, j): two mixtures have the same nature (e.g. humid) and
the same components (water, yeast, mono-potassium, phosphate, and glucose),
with each of these components having identical weights (value + unit of measure)
Generation of rules for missing observations prediction
185
[Raad et al. 2014]

INVALIDATION /
REQUALIFICATION: SUMMARY
• Contextual identity useful for profiling identity links
• Useful for giving explanations for possible experts validation
• Allow to generate missing values prediction rules.
• Future directions:
• Is Contextual difference relevant/useful?
• How to make explicit incomplete information for a pair of resources?
• Use when the ontologies are different
• SameAs link invalidation
• Improves the precision without decreasing the recall
• May be used for error detection and/or anomalies in ontology axioms
• Future directions:
• Provide comprehensible explanations for experts
• Use when the ontologies are different
186

SOME FUTURE CHALLENGES
q Data evolution
q ontology axiom evolution
q link evolution
q temporal data linking
q Data veracity è what is the truth?
q Knowledge discovery in complex data
q key graphs
q causality rules
q Explanation problem
q provenance of the data and the data/knowledge processes
187

REFERENCES (1)
[Atencia et al. 2014] Data interlinking through robust Linkkey extraction.
Atencia, Manuel, Jérôme David, and Jérôme Euzenat. ECAI, 2014.
[Atencia et al.’12] Keys and Pseudo-Keys Detection for Web Datasets Cleansing and Interlinking.
Manuel Atencia, Jérôme David, François Scharffe. In EKAW 2012
[Cohen et al. 2003] A comparison of string distance metrics for name-matching tasks.
William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg.
In IIWEB@AAAI 2003.
[Ferrara13] Evaluation of instance matching tools: The experience of OAEI.
Alfio Ferrara, Andriy Nikolov, Jan Noessner, François Scharffe. OM@ISWC 2013
[Hu et al. 2011] A Self-Training Approach for Resolving Object Coreference on the Semantic Web.
Wei Hu, Jianfeng Chen, Yuzhong Qu. In WWW 2011
[Kang et al. 2008] Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its
Evaluation. Kang, Getoor, Shneiderman, Bilgic, Licamele,
In IEEE Trans. Vis. Comput. Graph2008
189

REFERENCES (2)
[Lenat and Feigenbaum 1991] On the threshold of knowledge.
Douglas B. Lenat and Edward A. Feigenbaum
In Artificial Intelligence 47 (1991)
[Nikolov et al’12] Unsupervised Learning of Link Discovery Configuration
Andriy Nikolov, Mathieu d'Aquin, Enrico Motta. In ESWC 2012.
[Pernelle et al.’13] An Automatic Key Discovery Approach for Data Linking.
Nathalie Pernelle, Fatiha Saïs. and Danai Symeounidou.
In Journal of Web Semantics
[Saïs et al.07] L2R: a Logical method for Reference Reconciliation.
Fatiha Saïs, Nathalie Pernelle and Marie-Christine Rousset. In AAAI 2007.
[Saïs et al.09] Combining a Logical and a Numerical Method for Data Reconciliation.
Fatiha Saïs., Nathalie Pernelle and Marie-Christine Rousset.
In Journal of Data Semantics.
[Shvaiko,Euzenat13] Ontology Matching: State of the Art and Future Challenges,
Pavel Shvaiko, Jérôme Euzenat. In TKDE 2013
190

REFERENCES (3)
[Suchanek11] PARIS: Probabilistic Alignment of Relations, Instances, and Schema
Fabian Suchanek, Serge Abiteboul, Pierre Senellart. In VLDB 2011.
[Soru et al. 2015] ROCKER: a refinement operator for key discovery.
Soru, Tommaso, Edgard Marx, and Axel-Cyrille Ngonga Ngomo.
In WWW, 2015.
[Symeonidou et al. 2014] SAKey: Scalable almost key discovery in RDF data.
Symeonidou, Danai, Vincent Armant, Nathalie Pernelle, and Fatiha Saïs.
In ISWC 2014.
[Symeonidou et al. 2017] VICKEY: Mining Conditional Keys on RDF datasets .
Danai Symeonidou, Luis Galarraga, Nathalie Pernelle, Fatiha Saïs and Fabian Suchanek.
In ISWC 2017.
[Papageorgiou et al. 2017] Approche numérique pour l'invalidation de liens d'identité (owl:SameAs).
Dimitrios Christaras Papageorgiou, Nathalie Pernelle and Fatiha Saïs. In IC 2017.
191

REFERENCES (4)
[Papaleo et al. 2014] Logical Detection of Invalid SameAs Statements in RDF Data,
Laura Papaleo, Nathalie Pernelle, Fatiha Saïs and Cyril Dumont. In EKAW 2014
[Volz et al'09] Silk – A Link Discovery Framework for the Web of Data.
Julius Volz, Christian Bizer et al. In WWW 2009.
[Zheng et al. 2013] Results for OAEI 2013
Qian Zheng, Chao Shao, Juanzi Li, Zhichun Wang and Linmei Hu. OM@ISWC 2010
192

THANKS
193
N. Pernelle M.C. Rousset D. Symeonidou
J. Raad L. Papaleo N. Niraula

Discover Hidden Connections with Knowledge Graph Linking

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Discover Hidden Connections with Knowledge Graph Linking

Ähnlich wie Discover Hidden Connections with Knowledge Graph Linking (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Discover Hidden Connections with Knowledge Graph Linking