SlideShare ist ein Scribd-Unternehmen logo
1 von 193
Downloaden Sie, um offline zu lesen
KNOWLEDGE GRAPH EXPANSION
AND ENRICHMENT
LRI, UNIVERSITÉ PARIS SUD, CNRS, UNIVERSITÉ PARIS
SACLAY
TUTORIAL@BDA 2017
FATIHA SAÏS
1
OUTLINE
§ Linked Open Data
§ Knowledge Graphs
§ Technical Part
1. Data Linking
2. Key Discovery
3. Link invalidation/Requalification
§ Conclusion and Future Challenges
2
FROM THE WWW TO THE
WEB OF DATA
- applying the principles of the WWW to data
data is links,
not only properties
Linked Open Data
3
LINKED DATA
PRINCIPLES
① Use HTTP URIs as identifiers for resources
à so people can look up the data
② Provide data at the location of URIs
à to provide data for interested parties
③ Include links to other resources
àso people can discover more things
àbridging disciplines and domains
è Unlock the potential of island repositories
Tim Berners Lee, 2006
4
RDF – RESOURCE
DESCRIPTION FRAMEWORK
• Statements of < subject predicate object >
http://dbpedia.org/resource/Airbus
dbo:label
“Airbus”
Subject Predicate Object
… is called a triple
5
LINKED OPEN DATA
6
Linked Data - Datasets
under an open access
- 1,139 datasets
- over 100B triples
- about 500M links
- several domains
Ex. DBPedia : 1.5 B
triples
"Linking Open Data cloud diagram 2017, by Andrejs Abele, John
P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak.
http://lod-cloud.net/"
NEED OF
KNOWLEDGE
7
THE ROLE OF KNOWLEDGE IN AI
8
[Artificial Intelligence 47 (1991)]
The knowledge principle: “if a program is
to perform a complex task well, it must
know a great deal about the world in
which it operates.”
ONTOLOGY, A DEFINITION
9
“An ontology is an explicit, formal specification of a shared
conceptualization.”
[Thomas R. Gruber, 1993]
Conceptualization: abstract model of domain related expressions
Specification: domain related
Explicit: semantics of all expressions is clear
Formal: machine-readable
Shared: consensus (different people have different perceptions)
SEMANTIC WEB:
ONTOLOGIES
10
RDFS – Resource Description
Framework Schema
• Lightweight ontologies
OWL – Web Ontology Language
• Expressive ontologies
Source: https://it.wikipedia.org/wiki/File:W3C-
Semantic_Web_layerCake.png
1111
OWL ONTOLOGY
OWL – Web Ontology Language
• Represents rich and complex
knowledge about things
• Based on First Order Logic (FOL)
• Can be used to verify the consistency
of knowledge
• Can make implicit knowledge explicit
• Classes: concepts or collections of
objects (individuals)
• Properties:
• owl:DataTypeProperty (attribute)
• owl:ObjectProperty (relation)
• Hierarchy
• owl:subClassOf
• owl:subPropertyOf
• Individuals: ground-level of the
ontology (instances)
ONTOLOGY LEVELS
12
:type :type :type
Conceptual level:
- classes, properties
(relations)
Instance level:
- facts (individuals)
1313
• Axioms: knowledge definitions in the ontology that were explicitly defined and have
not been proven true.
• Reasoning over an ontology
→ Implicit knowledge can be made explicit by logical reasoning
• Example:
Pompidou museum is an Art Museum
< Pompidou_museum rdf:type ArtMuseum> .
Pompidou museum contains Hallucination partielle
< Pompidou_museum ao:contains Hallucination_partielle> .
• Infer that:
è Pompidou museum is a CulturalPlace
< Pompidou_museum rdf:type CulturalPlace> .
Because: Museum subsumes ArtMuseum and CulturalPlace subsumes Museum
è Hallucination partielle is a Work
<Hallucination_partielle rdf:type ao:Work> .
Because: the range of the object property contains is the class Work.
OWL ONTOLOGY - AXIOMS
LINKED OPEN
VOCABULARIES (LOV)
Linked Open Vocabularies
• Keeps track of available open
ontologies and provides them
as a graph
• Search for available
ontologies, open for reuse
• Example:
http://lov.okfn.org/dataset/lov/vo
cabs/foaf
14
ONTOLOGY PREDICATES SPREAD
ON THE SEMANTIC WEB
• RDFa (or Resource Description Framework in Attributes)
• Top 50 web sites publishing Semantic Web data, clustered by
predicates used.
FOAF
OUTLINE
§ Linked Open Data
§ Knowledge Graphs
§ Technical Part
§ Data Linking
§ Key Discovery
§ Link invalidation/requalification
§ Conclusion and Future Challenges
16
WHO IS DEVELOPING
KNOWLEDGE GRAPHS?
17
2012
2013
2015 2016
2013
2007
2008
2007
2012
Academic side Commercial side
WEB SEARCH WITHOUT
KNOWLEDGE GRAPHS
18
WEB SEARCH WITH
KNOWLEDGE GRAPHS
19
QUESTION ANSWERING WITH
KNOWLEDGE GRAPHS
20
CONNECTING EVENTS AND PEOPLE
WITH KNOWLEDGE GRAPHS
21
TOWARDS A KNOWLEDGE-
POWERED DIGITAL ASSISTANT
22
Cortana (Microsoft)
• Natural access and storage of knowledge
• Chat bots
• Personalization
• Emotion
KNOWLEDGE GRAPH: A
DEFINITION …
23
Wikipedia (en)
This is not a formal definition!
KNOWLEDGE GRAPH: A
DEFINITION …
24
[L. Ehrlinger and W. Wöß, SEMANTiCS’2016]
[16] H. Paulheim. Knowledge Graph Refinement: A Survey of
Approaches and Evaluation Methods. Semantic Web Journal,
(Preprint):1–20, 2016.
[12] M. Kroetsch and G. Weikum. Journal of Web Semantics: Special
Issue on Knowledge Graphs.
http://www.websemanticsjournal.org/index.php/ps/
announcement/view/19 [August, 2016].
[17] J. Pujara, H. Miao, L. Getoor, and W. Cohen. Knowledge Graph
Identification. In Proceedings of the 12th International Semantic Web
Conference - Part I, ISWC ’13, pages 542–557, New York, USA,
2013. Springer.
[3] A. Blumauer. From Taxonomies over Ontologies to Knowledge
Graphs, July 2014. https://blog.semanticweb.at/2014/07/15/from-
taxonomies-over-ontologiesto-knowledge-graphs [August, 2016].
[7] M. Farber, B. Ell, C. Menne, A. Rettinger, and F.
Bartscherer. Linked Data Quality of DBpedia, Freebase,
OpenCyc, Wikidata, and YAGO. Semantic Web Journal,
2016. http://www.semantic-web-journal. net/content/linked-
data-quality-dbpedia-freebaseopencyc-wikidata-and-yago
[August, 2016] (revised version, under review).
à Populated
Ontology
à Populated
Ontology
à RDF Graph
à RDF Graph
à Extracted
RDF Graph
KNOWLEDGE GRAPH (KG)
dbo:Painting
dbo:Museum
dbo:museum-of
dbo:Artist
dbo:author
dbo:city
dbo:PopulatedPlace
Thing
dbo:Person
dbo:Agent
owl:equivalentClass(dbo:Municipality, dbo:Place)
owl:equivalentClass(dbo:Place, dbo:Wikidata:Q532)
owl:equivalentClass(dbo:Village, dbo:PopulatedPlace)
owl:equivalentClass(dbo:PopulatedPlace,dbo:Municipality)
owl:disjointClass(dbo:PopulatedPlace, dbo:Artist)
owl:disjointClass(dbo:PopulatedPlace, dbo:Painting)
owl:FunctionalProperty(dbo:city)
owl:InverseFunctionalProperty(dbo:museum-of)
dbo:birthPlace(X, Y) => dbo:citizsenOf(X, Y)
dbo:parentOf(X, Y) => dbo:child(Y, X)
is-a is-a
is-a
is-a
is-a
Ontology hierarchy
Ontology axioms and rules
PREFIX dbo: <http://dbpedia.org/ontology#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?m ?p
WHERE { ?m rdf:type dbo:Museum . ?m dbo:musuem-of ?p .}
Querying (SPARQL)
- KG saturation: infer whatever can be inferred from the
KG.
- KG consistency checking: no contradictions
- KG repairing
- …
Reasoners: (Pellet, Fact++, Hermit, etc. )
RDF Graphs
25
KNOWLEDGE GRAPH EXPANSION
AND ENRICHMENT
• Expansion: knowledge graphs are incomplete
• Data linking (entity resolution, duplicate detection, reference
reconciliation)
• Link prediction: add relations
• Ontology matching: connect graphs
• Missing values prediction/inference
• Validation: knowledge graphs may contain errors
• Link invalidation
• Dealing with errors and ambiguities
• Enrichment: can new knowledge emerge from knowledge graphs?
• Knowledge discovery
• Automatic reasoning and planning
26
OUTLINE
§ Linked Open Data
§ Knowledge Graphs
§ Technical Part
1. Data Linking
2. Key Discovery
3. Link invalidation/requalification
§ Conclusion and Future Challenges
27
1. DATA LINKING
28
DATA LINKING
29
• Data linking or Identity link detection consists in detecting whether two descriptions of
resources refer to the same real world entity (e.g. same person, same article, same gene).
30
• Data linking or Identity link detection consists in detecting whether two descriptions of
resources refer to the same real world entity (e.g. same person, same article, same gene).
DATA LINKING
DATA LINKING: DIFFICULTIES
31
Different
Vocabularies
Misspelling errorsIncomplete Information :
- date and place of birth ?
- museum phone number ?
- …. ?
• Data linking or Identity link detection consists in detecting whether two descriptions of
resources refer to the same real world entity (e.g. same person, same article, same gene).
IDENTITY LINK DETECTION
PROBLEM
• Identity link detection consists in detecting whether two descriptions of
resources refer to the same real world entity (e.g. same person, same article,
same gene).
• Definition (Link Discovery)
• Given two sets U1 and U2 of resources
• Find a partition of U1 x U2 such that :
• S = {(u1,u2) ∈ u1 × u2: owl:sameAs(s,t)} and
• D = {(u1,u2) ∈ u1 × u2: owl:differentFrom(s,t)}
• A method is said total when (S È D) = (U1 x U2)
• A method is said partial when (S È D) Ì (U1 x U2 )
• Naïve complexity ∈ O(U1 × U2), i.e. O(n2)
32
Problem which exists since the data exists … and under different
terminologies: record linkage, entity resolution, data cleaning,
object coreference, duplicate detection, ….
Record linkage: used to indicate the
bringing together of two or more separately
recorded pieces of information concerning
a particular individual or family.
[NKAJ, Science 1959]
SOME OF HISTORY …
33
ASIDE: DETECTING IDENTITY
LINKS
34
Lise Getoor, VLDB’12 tutorial
Ironically “Identity link detection” has many duplicates
Entity resolution
DATA LINKING IS MORE COMPLEX
FOR GRAPHS THAN TABLES (WHY?)
Databases Semantic Web
Schema/Ontologies Same schema Possibly different ontologies in the same
dataset
Multiple types Single relation Several classes
Open World
Assumption
NO YES
UNA-Unique Name
Assumption
Yes May be no
Data volume XX Thousands XX Millions/Billions
(e.g., DBpedia has 1.5 billion triples)
Multiple values for a
property
NO YES
P1 hasAuthor “Michel Chein”
P1 hasAuthor “Marie-Christine Rousset”
35
• Can propagate similarity decisions è more expensive but better performance
• Can be generic and use domain knowledge, e.g. ontology axioms
DATA LINKING APPROACHES:
DIFFERENT CONTEXTS
• Datasets conforming to the same ontology
• Datasets conforming to different ontologies
• Datasets without ontologies
36
DATA LINKING APPROACHES
• Instance-based approaches: consider only data type properties
(attributes)
• Graph-based approaches: consider data type properties
(attributes) as well as object properties (relations) to propagate
similarity scores/linking decisions (collective data linking)
• Supervised approaches: need an expert to build samples of
linked data to train models (manual and interactive approaches)
• Rule-based approaches: need knowledge to be declared in the
ontology or in other format given by an expert
37
DATA LINKING APPROACHES
• Instance-based approaches: consider only data type properties
(attributes)
• String comparison
38
m2
architect
address
“Saadiyat Island, Abu
Dhabi”
Jean_Nouvel
“Nov. 8th 2017”
S2
architect
architect
m1
address
Jacques_Lemercier
category
Art_Antiquity
Art_Museum
category
S1
created
Luis_le_Vau
architect
Ange_Jacues_Gabriel
created
“1793”
“99 rue de Rivoli, Paris”
=?
=?
“Louvre Abou Dabi”
“Musée du Louvre”
=?
DATA LINKING APPROACHES
• Graph-based approaches:
• consider data type properties (attributes) as well as
• object properties (relations) to propagate similarity scores/linking decisions
(collective data linking)
39
m2
architect
address
“Saadiyat Island, Abu
Dhabi”
Jean_Nouvel
“Nov. 8th 2017”
architect
architect
m1
address
Jacques_Lemercier
category
Art_Antiquity
Art_Museum
category
created
Luis_le_Vau
architect
Ange_Jacues_Gabriel
created
“1793”
“99 rue de Rivoli, Paris”
=?
=?
“Louvre Abou Dabi”
“Musée du Louvre”
=?
=?
=?
=?
=?
S2
S1
DATA LINKING APPROACHES
• Supervised approaches: need an expert to build samples of identity
links to train models (manual and interactive approaches)
40
Examples
of identity
links
S1S2
Learning of parameters,
similarity functions,
thresholds, …
Identity link detection Identity links
DATA LINKING APPROACHES
• Rule-based approaches: need knowledge to be declared in
the ontology or in other format given by an expert
• homepage(w1, y) ∧ homepage(w2, y) è sameAs(w1, w2)
• sameAs(Restaurant11, Restaurant21)
• sameAs(Restaurant12, Restaurant22)
• sameAs(Restaurant13, Restaurant23)
41
… homepage
Restaurant11 www.kitchenbar.com
Restaurant12 www.jardin.fr
Restaurant13 www.gladys.fr
Restaurant14 …
homepage …
www.kitchenbar.com Restaurant21
www.jardin.fr Restaurant22
www.gladys.fr Restaurant23
… Restaurant24
SameAS
SameAS
SameAS
DATA LINKING APPROACHES:
EVALUATION
• Effectiveness: evaluation of linking results in terms of recall and
precision
• Recall = (#correct-links-sys) /(#correct-links-groundtruth)
• Precision = (#correct-links-sys) /(#links-sys)
• F-measure (F1) = (2 x Recall x Precision) / (Recall +Precision)
• Efficiency: in terms of time and space (i.e. minimize the linking
search space and the interaction actions with an expert/user).
• Robustness: override errors/mistakes in the data
• Use of benchmarks, like those of OAEI (Ontology Alignment
Evaluation Initiative) or Lance
42
SIMILARITY MEASURES
• Token based (e.g. Jaccard, TF/IDF cosinus) :
The similarity depends on the set of tokens that appear in both S and T.
è Efficient, but sensitive to spelling errors
• Edit based (e.g. Levenstein, Jaro, Jaro-Winkler) :
The similarity depends on the smallest sequence of edit operations which
transform S into T.
è Less efficient, may deal with spelling errors, but sensitive to word order
• Hybrids (e.g. N-Grams, Jaro-Winkler/TF-IDF, Soundex)
43
For more details: William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. 2003. A comparison of string
distance metrics for name-matching tasks. In Proceedings of the 2003 International Conference on Information
Integration on the Web (IIWEB'03), Subbarao Kambhampati and Craig A. Knoblock (Eds.). AAAI Press 73-78.
INSTANCE-BASED
DATA LINKING
APPROACHES
44
FRAMEWORK SILK [Volz et al'09]
• Provides a Link Specification Language(LSL)
• Allows specifying linking conditions between two datasets
• The linking conditions may be expressed in terms of:
• Elementary similarity measures (e.g., Jaccard, Jaro) and
• Aggregation functions (e.g. max, average) of the similarity scores
45
SIMILARITY MEASURES IN
SILK [Volz et al'09]
46
EXAMPLE OF LSL
SPECIFICATION [Volz et al'09]
<Silk>
<Prefixes>
<Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#" />
<Prefix id="dbpedia" namespace="http://dbpedia.org/ontology/" />
<Prefix id="gn" namespace="http://www.geonames.org/ontology#" />
</Prefixes>
<DataSources>
<DataSource id="dbpedia">
<Param name="endpointURI" value="http://demo_sparql_server1/sparql" />
<Param name="graph" value="http://dbpedia.org" />
</DataSource>
<DataSource id="geonames">
<Param name="endpointURI" value="http://demo_sparql_server2/sparql" />
<Param name="graph" value="http://sws.geonames.org/" />
</DataSource>
</DataSources>
47
SPARQL
endpoints
Prefixes
EXAMPLE OF LSL
SPECIFICATION [Volz et al'09]
<Interlinks>
<Interlink id="cities">
<LinkType>owl:sameAs</LinkType>
<SourceDataset dataSource="dbpedia" var="a">
<RestrictTo>
?a rdf:type dbpedia:City
</RestrictTo>
</SourceDataset>
<TargetDataset dataSource="geonames" var="b">
<RestrictTo>
?b rdf:type gn:P
</RestrictTo>
</TargetDataset>
48
Entités à
lier
Link types
Entities to
be linked
EXAMPLE OF LSL
SPECIFICATION [Volz et al'09]
<LinkageRule>
<Aggregate type="average">
<Compare metric="levenshteinDistance" threshold="1">
<Input path="?a/rdfs:label" />
<Input path="?b/gn:name" />
</Compare>
<Compare metric="num" threshold="1000" >
<Input path="?a/dbpedia:populationTotal" />
<Input path="?b/gn:population" />
</Compare>
</Aggregate>
</LinkageRule>
<Filter limit="1" />
49
Mesures de
similarité
Similarity
measures
Aggregation
function
EXAMPLE OF LSL
SPECIFICATION [Volz et al'09]
<Outputs>
<Output type="file" minConfidence="0.95">
<Param name="file" value="accepted_links.nt" />
<Param name="format" value="ntriples" />
</Output>
<Output type="file" maxConfidence="0.95">
<Param name="file" value="verify_links.nt" />
<Param name="format" value="alignment" />
</Output>
</Outputs>
</Interlink>
</Interlinks>
</Silk>
50
Possible links
Linking
threshold
KNOFUSS (INSTANCE-BASED,
UNSUPERVISED) [Nikolov et al’12]
• Learns linking rules using genetic algorithms:
Sim(i1, i2) = fag(w11sim11 (V11,V21), …wmnsimmn (V1m,V2n))
• Fag : aggregation function for the similarity scores
• simij: similarity measure between values V1i and V2j
• wij: weights in [0..1]
• Assumptions:
• Unique name assumption (UNA), i.e., two different URIs refer to two
different entities.
• Good coverage rate between the two datasets
• Normalized similarity scores in [0..1]
51
KNOFUSS (INSTANCE-BASED,
UNSUPERVISED) [Nikolov et al’12]
52
Examples of linking rules learned on the OAEI’10 benchmark
Results in term of F-Measure on OAEI’10
GRAPH-BASED AND
INTERACTIVE APPROACH
53
KANG ET AL.: INTERACTIVE ENTITY RESOLUTION IN RELATIONAL DATA: A VISUAL ANALYTIC TOOL AND ITS EVALUATION 1001
Fig. 1. D-Dupe consists of three coordinated windows: the potential duplicate viewer on the left, the relational context viewer on the upper right
corner, and the data detail viewer on the lower right corner. The potential duplicate viewer shows the list of potential duplicate author pairs that were
D-Dupe
[Kang et al’08]
LN2R: A LOGICAL AND
NUMERICAL METHOD FOR
REFERENCE RECONCILIATON
54
[Saïs et al’07, Saïs et al’09]
LN2R
(GRAPH BASED, UNSUPERVISED AND INFORMED)
[Saïs et al’07, Saïs et al’09]
• A combination of two methods:
• L2R, a Logical method for reference reconciliation: applies
logical rules to infer sure owl:sameAs and
owl:differentFrom links
• N2R, a Numerical method for reference reconciliation:
computes similarity scores for each pair of references
• Assumptions
• The datasets are conforming to the same ontology
• The ontology contains axioms
55
Ontology axioms
• Disjunction axioms between classes, DISJOINT(C, D)
• Functional properties axioms, PF(P)
• Inverse functional properties axioms, PFI(P)
• A set of properties that is functional or inverse functional axioms
Assumptions on the data
• Unique Name Assumption, UNA(src1)
• Local Unique Name Assumption, LUNA(R)
Example:
56
Authored(p, a1), Authored(p, a2), Authored(p, a3) …., Authored(p, an)
è (a1 ¹a2), (a1 ¹ a3), (a2 ¹ a3) , …
LN2R
(GRAPH BASED, UNSUPERVISED AND INFORMED)
[Saïs et al’07, Saïs et al’09]
N2R: A NUMERICAL METHOD FOR
REFERENCE RECONCILIATION
57
[Saïs et al’09]
N2R: A NUMERICAL METHOD FOR
REFERENCE RECONCILIATION
• N2R computes a similarity score for pair of references obtained from
their common description.
• Uses known similarity measures, e.g. Jaccard, Jaro-Winkler.
• Exploits ontology knowledge in a way to be coherent with L2R.
• May consider the results of L2R: Reconcile(i, i’), ¬Reconcile(i, i’) ,
SynVals(v, v’) and ¬SynVals(v, v’).
58
[Saïs et al’09]
SIMILARITY DEPENDENCY
MODELLING
59
MuseumName+(m1) = {”Le Louvre”},
MuseumName+(m’1) = {”Louvre”},
Located+(m1) = {c1}, Located+(m’1) = {c’1},
Located-(c1) = {m1} , Located-(c’1) = {m’1}, ….
m1, m’1 c1, c’1
p1, p’1
“Le Louvre”,
“Louvre”
“Paris”,
“La ville de Paris”
“La Joconde”,
“l’Européenne”
RDF facts in source S1:
Located(m1, c1), MuseumName(m1, “le Louvre”)
Contains(m1, p1), CityName(c1, “Paris”)
PaintingName(p1, “la Joconde”)
RDF facts in source S2 :
Located(m’1, c’1), MuseumName(m’1, “Louvre”)
Contains(m’1, p’1), CityName(c’1, “la Ville de Paris”)
PaintingName(p’1, “l’Europèenne”)
(c1, c’1) is functionally dependent on (m1, m’1)
è Equation system
x1 x2
x3
b11 b21
b31
1
1/3
½1
1
1
1
CAttr(m1, m’1) = {MuseumName} ,
CAttr(c1, c’1)= {CityName},CAttr(p1,p’1)={PaintingName}
CRel(m1, m’1)= {Located, Contains}
CRel(c1, c’1) = {Located }, CRel(p1,p’1) = {Contains}
[Saïs et al’09]
N2R: ILLUSTRATION
60
m1, m’1 c1, c’1
p1, p’1
“Le Louvre”,
“Louvre”
“Paris”,
“La ville de Paris”
“La Joconde”,
“l’Européenne”
x1 x2
x3
b11
p1, p’2
“La Joconde”,
“Joconde”
x4
b41
b21
b31
l= 1/(| CAttr | + | CRel |) e = 0.02
b11 = 0.8, b21 = 0.3, b31 = 0.1, b41 = 0.7
x1 x2 x3 x4
Initialization 0.0 0.0 0.0 0.0
Iteration 1 0.8 0.3 0.1 0.7
Iteration 2 0.8 0.8 0.4 0.7
Iteration 3 0.8 0.8 0.4 0.7
Solution: x1 = 0.8
x2 = 0.8
x3 = 0.4
x4 = 0.7
x1 = max(max(b11, x3), x4), l * x2)
x2 = max(b21, x1)
x3 = max(b31, l* x1)
x4 = max(b41 , l * x1)
[Saïs et al’09]
N2R
EXPERIMENTS
61
N2R: RESULTS ON CORA
Trec=1, all the reconciliations obtained by L2R are also obtained by N2R.
Trec=1 to Trec=0.85, the recall increases of 33 % while the precision decreases
only of 6 %.
Trec = 0.85, the F-measure is of 88 %:
• Better than the results obtained by the supervised method of [Singla and Domingos’05]
• Worst than those (97 %) obtained by the supervised method of [Dong et al.’05]
62
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1 0,95 0,9 0,85 0,8 0,75 0,7 0,65 0,6 0,55
Trec
Rappel
Precision
F-Mesure
[Saïs et al’09]
OAEI 2010 – Instance Matching track (PR), 2nd
N2R: RESULTS IN OAEI2 2010
63
2 Ontology Alignment Evaluation Initiative
[Saïs et al’09]
INSTANCE BASED
ONTOLOGY
MATCHING
64
ONTOLOGY MATCHING
• Ontology alignment [Shvaiko,Euzenat13]: computes a set
A of mappings between elements (classes, properties) of two
ontologies O1 and O2:
f(O1,O2)=A
• The relations that are used to express a mapping can be:
owl:equivalentClass, owl:equivalentProperty,
rdfs:subClassOf, skos:closeMatch, skos:broader, etc.
• Example: A={(
owl:equivalentClass( http://dbpedia.org/ontology/City,
http://schema.org/City, 0.8)}
65
KINDS OF
INFORMATION
• Terminology: lexical information describing the ontology
elements (i.e. labels, comments, …)
• Example: Voie vs Voie souterraine
• Structure: hierarchy of classes and properties
(relations/attributes)
• Example: the sub-classes of Voie are very similar to the
sub-classes of Route
• Extension: the existence of common instances!!
66
PARIS [Suchanek et al. 12]
• Objective: instance-based ontology alignment and data linking (graph-
based, unsupervised and probabilistic)
• Inputs: two populated RDFS ontologies with UNA (two different URI
refer to two different entities)
• Principle:
• Compute the similarities between literal values (“12 cm”=“12”)
• Iterate (1) and (2) until a fix-point :
① Compute the probability that two instances are linked
① Compute the probabilities of subClassOf/subPropertyOf
67
P(Ci ⊆ Cj ), P(Pi ⊆ Pj )
P(i1 = i2)
• Property functionality degree (computed from data)
• The more a property is functional the more the probability of X=Y
will be.
• Local functionality: Fun(p,x) = 1 / #y:p(x, y)
• Global functionality: Fun(p) = (#x : ∃y:p(x,y)) / (#x,y : p(x,y))
• Example:
city(m1,Londres), city(m1,Orsay), city(m2,Tokyo)
Fun(city,m1)= ½ Fun(city,m2)=1
Fun(city)=2/3
è The same is done for inverse functionality (denoted fun-1) 68
PARIS [Suchanek et al. 12]
Link probability computation
• Positive evidence (P1): if there exists a property that is highly inverse
functional which has range values that are equal with a high probability
isbn(x,isbn1), isbn(x’,isbn2), P(isbn1=isbn2) = 1, fun-1(isbn)=1 …
P1(x=x’) = 1 - ((1 - (1.1)) . …) = 1-(0. …) = 1
• Negative evidence (P2): if there exists a property that is highly
functional which has range values for the probability to be equal is very low.
• Combination : P(x=x’) = P1(x=x’).P2(x=x’)
69
P1(x = x') =1− (1− Fun−1
r(x,y)
r(x',y')
∏ (p).P(y = y'))
PARIS [Suchanek et al. 12]
• The probabilities of the existence of a subsumption mapping between
properties and between classes are also computed
• It is based on the proportion of common instances comparing to the
number of instances of the general class
• To compute these probabilities, the probabilities of the existence of a
sameAs link between instances are exploited.
70
P(C ⊆ C') = #(C∩C') /#C
P(p ⊆ p') = #(p∩ p') /# p
PARIS [Suchanek et al. 12]
PARIS - EXPERIMENTS
Ontology #Instances #Classes #Relations
Yago 2 795 289 292 206 67
Dbpedia 2 365 777 318 1 109
71
Instances Classes Relations
Précision Rappel F-Mesure Yago Í DBp
Précision
DBpÍ Yago
Précision
Yago Í DBp
Précision
DBpÍ Yago
Précision
90% 73% 81% - - 100% 92%
90% 73% 81% 94% 84% 100% 92%
Instances: DBPedia and Yago uses the URIs of Wikipedia (recall and precision
possible)
Classes/properties: sampling + expert
5h00 to compute the linking probabilities for instances in one iteration (2h for the
classes and 20 minutes for the properties)
Linking or mapping if the probability >0.4
[Suchanek et al. 12]
SUMMARY
Data linking: numerous and different approaches …
• Supervised approaches: needs samples of linked data
à It can be avoided by using assumptions like (UNA)
• Graph-based approaches: decision propagation (good recall but highly time
consuming)
• Logical approaches: good precision but partial
è Few approaches generate differentFrom(i1,i2) or use dissimilarity evidence
• Informed approaches: need knowledge to be declared in the ontology
(generality) and/or ad-hoc knowledge given by an expert (a selection of
properties, similarity functions)
àThis kind of knowledge are not always available but can be
learnt/discovered from the data (e.g., key/rule discovery approaches)
[Symeonidou et al. 14, Symeonidou et al. 17, Galarraga et al. 13]
72
OUTLINE
§ Linked Open Data
§ Knowledge Graphs
§ Technical Part
1. Data Linking
2. Key Discovery
3. Link invalidation/requalification
§ Conclusion and Future Challenges
73
KEY DISCOVERY
PHD OF DANAI SYMEONIDOU IN COLLABORATION WITH:
NATHALIE PERNELLE (LRI, UNIV. PARIS SUD),
74
[Qualinca ANR project (2012-2016)]
DATA LINKING USING
RULES
homepage(w1, y) ∧ homepage(w2, y) è sameAs(w1, w2)
75
… homepage
Restaurant11 www.kitchenbar.com
Restaurant12 www.jardin.fr
Restaurant13 www.gladys.fr
Restaurant14 …
homepage …
www.kitchenbar.com Restaurant21
www.jardin.fr Restaurant22
www.gladys.fr Restaurant23
… Restaurant24
DATA LINKING USING
RULES
homepage(w1, y) ∧ homepage(w2, y) è sameAs(w1, w2)
• sameAs(Restaurant11, Restaurant21)
• sameAs(Restaurant12, Restaurant22)
• sameAs(Restaurant13, Restaurant23)
76
… homepage
Restaurant11 www.kitchenbar.com
Restaurant12 www.jardin.fr
Restaurant13 www.gladys.fr
Restaurant14 …
homepage …
www.kitchenbar.com Restaurant21
www.jardin.fr Restaurant22
www.gladys.fr Restaurant23
… Restaurant24
SameAS
SameAS
SameAS
DATA LINKING USING
RULES
Rules
• Logical Rules
• Ex. For instances of the class Restaurant
homepage(w1, y) ∧ homepage(w2, y) è sameAs(w1, w2)
{homepage}: discriminative property
• Complex Rules
• Ex. For instances of the class Restaurant
max(Jaccard(lat(w1, n);lat(w2, m));jaro(long(w1, x);long(w2, y)) > 0.8
èsameAs(w1, w2)
{lat, long}: discriminative property set
77
Rules contain discriminative properties => keys
KEYS
Key: A set of properties that uniquely identifies every instance in the data
78
FirstName LastName Birthdate Profession
Person1 Anne Tompson 15/02/88 Actor
Person2 Marie Tompson 02/09/75 Researcher
Person3 Marie David 15/02/85 Teacher
Person4 Vincent Solgar 06/12/90 Teacher
Is [FirstName] a key? ✖
Is [LastName] a key? ✖
KEYS
Key: A set of properties that uniquely identifies every instance in the data
79
FirstName LastName Birthdate Profession
Person1 Anne Tompson 15/02/88 Actor
Person2 Marie Tompson 02/09/75 Researcher
Person3 Marie David 15/02/85 Teacher
Person4 Vincent Solgar 06/12/90 Teacher
Is [FirstName] a key? ✖
Is [LastName] a key? ✖
Is [FirstName,LastName] a key? ✔
KEYS - KEY
MONOTONICITY
Key monotonicity: When a set of properties is a key, all its supersets
are also keys
Minimal Key: A key that by removing one property stops being a key
80
FirstName LastName Birthdate Profession
Person1 Anne Tompson 15/02/88 Actor
Person2 Marie Tompson 02/09/75 Researcher
Person3 Marie David 15/02/85 Teacher
Person4 Vincent Solgar 06/12/90 Teacher
✔Is [FirstName,LastName, Birthday] a key?
✖Is [FirstName,LastName, Birthday] a minimal key?
Minimal keys: [[FirstName,LastName],[FirstName,Profession] [Birthdate], [LastName,Profession]]
KEYS DECLARED BY EXPERTS
FOR DATA LINKING
Not an easy task:
• Experts are not aware of all the keys
Ex. {SSN}, {ISBN} easy to declare
Ex. {Name, DateOfBirth, BornIn} is it a key for the class Person?
• Erroneous keys can be given by experts
• As many keys as possible
• More keys => More linking rules
Goal: Discover keys automatically
81
OWL2 KEY, KEY IN
THE OPEN WORLD
OWL2 Key for a class: a combination of properties that uniquely identify
each instance of a class
• hasKey( CE ( OPE1 ... OPEm ) ( DPE1 ... DPEn ) )
:hasKey(Book(Author) (Title)) means:
Book(x1)∧Book(x2)∧Author(x1, y)∧Author (x2, y)∧Title(x1,w) ∧Title(x2, w)
è sameAs(x1, x2)
82
KEYS IN THE OPEN
WORLD
Key in the Open World
• Key discovery: For a set of properties that is a key, there is no instance that
shares values with another instances
83
lastName firstName hasFriend
Person1 Tompson Manuel Person2,
Person3
Person2 Tompson Maria,
Hellen
Person1
Person3 David George Person2,
Person4
Person4 Solgar Michel Person3
KEYS IN THE OPEN
WORLD
Key in the Open World
• Key discovery: For a set of properties that is a key, there is no instance that
shares values with another instances
• Ex. firstName is a key in the Open World since
• There do exist any people that share first names
84
lastName firstName hasFriend
Person1 Tompson Manuel Person2,
Person3
Person2 Tompson Maria,
Hellen
Person1
Person3 David George Person2,
Person4
Person4 Solgar Michel Person3
KEYS IN THE OPEN
WORLD
Key in the Open World
• Key discovery: For a set of properties that is a key, there is no instance that
shares values with another instances
• Ex. hasFriend is not a key in the Open World since
• There exist at least two people that share a friend
85
lastName firstName hasFriend
Person1 Tompson Manuel Person2,
Person3
Person2 Tompson Maria,
Hellen
Person1
Person3 David George Person2,
Person4
Person4 Solgar Michel Person3
KEYS IN THE CLOSED
WORLD
Key in the Closed World
• For each property that participate in a key, every set of values for a property
and an instance should be unique
86
lastName firstName hasFriend
Person1 Tompson Manuel Person2,
Person3
Person2 Tompson Maria Person1
Person3 David George Person2,
Person4
Person4 Solgar Michel Person3
KEYS IN THE CLOSED
WORLD
Key in the Closed World
• For each property that participate in a key, every set of values for a property
and an instance should be unique
• Ex. hasFriend is a key in the Closed World since
{Person2,Person3} <> {Person1} <> {Person2,Person4} <> {Person3}
87
lastName firstName hasFriend
Person1 Tompson Manuel Person2,
Person3
Person2 Tompson Maria Person1
Person3 David George Person2,
Person4
Person4 Solgar Michel Person3
CLOSED WORLD APPROACHES
VS
OPEN WORLD APPROACHES
88
KEY DISCOVERY
APPROACHES
CWA approaches
• Keys and Pseudo-Keys Detection for Web Datasets Cleaning
and Interlinking
• ROCKER: A Refinement Operator for Key Discovery
OWA approaches
• KD2R: a Key Discovery method for reference reconciliation
• SAKey: Scalable almost key discovery in RDF data
• Linkkey: Data interlinking through robust Linkkey extraction
• VICKEY: Conditional key discovery
89
KEY DISCOVERY
APPROACHES
CWA approaches
• Keys and Pseudo-Keys Detection for Web Datasets
Cleaning and Interlinking
• ROCKER: A Refinement Operator for Key Discovery
OWA approaches
• KD2R: a Key Discovery method for reference reconciliation
• SAKey: Scalable almost key discovery in RDF data
• Linkkey: Data interlinking through robust Linkkey extraction
• VICKEY: Conditional key discovery
90
KEYS AND PSEUDO-KEYS DETECTION FOR
WEB DATASETS CLEANING AND
INTERLINKING [Atencia et al. 12]
Bottom-up approach that discovers
• Keys in the Closed World
• Pseudo keys - keys that tolerate exceptions
91
x
P2P1 P3 P4
P1P2 P1P3 P1P4 P2P3 P2P4 P3P4
P1P2P3 P1P2P4 P2P3P4
P1P2P3P4
KEYS AND PSEUDO-KEYS DETECTION FOR
WEB DATASETS CLEANING AND
INTERLINKING [Atencia et al. 12]
Bottom-up approach that discovers
• Keys in the Closed World
• Pseudo keys - keys that tolerate exceptions
92
x
P1
KEYS AND PSEUDO-KEYS DETECTION FOR
WEB DATASETS CLEANING AND
INTERLINKING [Atencia et al. 12]
Bottom-up approach that discovers
• Keys in the Closed World
• Pseudo keys - keys that tolerate exceptions
93
x
P2P1
KEYS AND PSEUDO-KEYS DETECTION FOR
WEB DATASETS CLEANING AND
INTERLINKING [Atencia et al. 12]
Bottom-up approach that discovers
• Keys in the Closed World
• Pseudo keys - keys that tolerate exceptions
94
x
P2P1 P3
KEYS AND PSEUDO-KEYS DETECTION FOR
WEB DATASETS CLEANING AND
INTERLINKING [Atencia et al. 12]
Top-down approach that discovers
• Keys in the Closed World
• Pseudo keys - keys that tolerate exceptions
95
x
P2P1 P3 P4
KEYS AND PSEUDO-KEYS DETECTION FOR
WEB DATASETS CLEANING AND
INTERLINKING [Atencia et al. 12]
96
x
P2P1 P3 P4
Bottom-up approach that discovers
• Keys in the Closed World
• Pseudo keys - keys that tolerate exceptions
If P4 is a key => P1P4, P2P4,.., P1P2P3P4 are also keys
KEYS AND PSEUDO-KEYS DETECTION FOR
WEB DATASETS CLEANING AND
INTERLINKING [Atencia et al. 12]
Bottom-up approach that discovers
• Keys in the Closed World
• Pseudo keys - keys that tolerate exceptions
If P4 is a key => P1P4, P2P4,.., P1P2P3P4 are also keys
97
x
P2P1 P3 P4
P1P2 P1P3 P1P4 P2P3 P2P4 P3P4
KEYS AND PSEUDO-KEYS DETECTION FOR
WEB DATASETS CLEANING AND
INTERLINKING [Atencia et al. 12]
Bottom-up approach that discovers
• Keys in the Closed World
• Pseudo keys - keys that tolerate exceptions
If P4 is a key => P1P4, P2P4,.., P1P2P3P4 are also keys
98
x
P2P1 P3 P4
P1P2 P1P3 P1P4 P2P3 P2P4 P3P4
P1P2P3 P1P2P4 P2P3P4
KEYS AND PSEUDO-KEYS DETECTION FOR
WEB DATASETS CLEANING AND
INTERLINKING [Atencia et al. 12]
Bottom-up approach that discovers
• Keys in the Closed World
• Pseudo keys - keys that tolerate exceptions
If P4 is a key => P1P4, P2P4,.., P1P2P3P4 are also keys 99
x
P2P1 P3 P4
P1P2 P1P3 P1P4 P2P3 P2P4 P3P4
P1P2P3 P1P2P4 P2P3P4
P1P2P3P4
KEYS AND PSEUDO-KEYS DETECTION FOR
WEB DATASETS CLEANING AND
INTERLINKING [Atencia et al. 12]
To verify if a set of properties is a key
• Partition instances according to their sharing values
• If each partition contains only one instances => Key
Key quality measures
• Support of a set of properties P:
• Discriminability of a set of properties P (pseudo-keys):
100
support ! =
# !"#$%"&'# !"#$%&'"! !" !
# !"" !"#$%"&'#
[ATENCIA ET AL. 12]- EXAMPLE
101
Name Actor Director ReleaseDate Website Language
film
1
“Ocean’s 11” “B. Pitt”
“J.
Roberts”
“S.
Soderbergh”
“3/4/01” www.oceans11.com ---
film
2
“Ocean’s 12” “B. Pitt”
“J.
Roberts”
“S.
Soderbergh”
“R. Howard”
“2/5/04” www.oceans12.com “english”
film
3
“Ocean’s 13” “B. Pitt”
“G.
Clooney”
“S.
Soderbergh”
“R. Howard”
“30/6/07” www.oceans13.com “english”
film
4
“The
descendants”
“N. Krause”
“G.
Clooney”
“A. Payne” “15/9/11” --- “english”
film
5
“Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.co
m
“english”
[ATENCIA ET AL. 12]- EXAMPLE
102
Film3Film2Film1
Film5Film4
[Name]: key
support 5/5 =1
Discriminability = 5/5 =1
Partitions using [Name]
è
Name Actor Director ReleaseDate Website Language
film
1
“Ocean’s 11” “B. Pitt”
“J.
Roberts”
“S.
Soderbergh”
“3/4/01” www.oceans11.com ---
film
2
“Ocean’s 12” “B. Pitt”
“J.
Roberts”
“S.
Soderbergh”
“R. Howard”
“2/5/04” www.oceans12.com “english”
film
3
“Ocean’s 13” “B. Pitt”
“G.
Clooney”
“S.
Soderbergh”
“R. Howard”
“30/6/07” www.oceans13.com “english”
film
4
“The
descendants”
“N. Krause”
“G.
Clooney”
“A. Payne” “15/9/11” --- “english”
film
5
“Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.com “english”
103
Partitions using [Actor]
è
Name Actor Director ReleaseDate Website Language
film
1
“Ocean’s 11” “B. Pitt”
“J.
Roberts”
“S.
Soderbergh”
“3/4/01” www.oceans11.com ---
film
2
“Ocean’s 12” “B. Pitt”
“J.
Roberts”
“S.
Soderbergh”
“R. Howard”
“2/5/04” www.oceans12.com “english”
film
3
“Ocean’s 13” “B. Pitt”
“G.
Clooney”
“S.
Soderbergh”
“R. Howard”
“30/6/07” www.oceans13.com “english”
film
4
“The
descendants”
“N. Krause”
“G.
Clooney”
“A. Payne” “15/9/11” --- “english”
film
5
“Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.co
m
“english”
Film4
Film3Film1
Film2
Film5
[Actor]: ???
[ATENCIA ET AL. 12]- EXAMPLE
104
Partitions using [Actor]
è
Name Actor Director ReleaseDate Website Language
film
1
“Ocean’s 11” “B. Pitt”
“J.
Roberts”
“S.
Soderbergh”
“3/4/01” www.oceans11.com ---
film
2
“Ocean’s 12” “B. Pitt”
“J.
Roberts”
“S.
Soderbergh”
“R. Howard”
“2/5/04” www.oceans12.com “english”
film
3
“Ocean’s 13” “B. Pitt”
“G.
Clooney”
“S.
Soderbergh”
“R. Howard”
“30/6/07” www.oceans13.com “english”
film
4
“The
descendants”
“N. Krause”
“G.
Clooney”
“A. Payne” “15/9/11” --- “english”
film
5
“Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.co
m
“english”
Film4
Film3Film1
Film2
Film5
[Actor]: pseudo key
Support = 5/5 =1
Discriminability = 3/4=0.75
[ATENCIA ET AL. 12]- EXAMPLE
105
Partitions using
[Actor,Director]
è
Name Actor Director ReleaseDate Website Language
film
1
“Ocean’s 11” “B. Pitt”
“J.
Roberts”
“S.
Soderbergh”
“3/4/01” www.oceans11.com ---
film
2
“Ocean’s 12” “B. Pitt”
“J.
Roberts”
“S.
Soderbergh”
“R. Howard”
“2/5/04” www.oceans12.com “english”
film
3
“Ocean’s 13” “B. Pitt”
“G.
Clooney”
“S.
Soderbergh”
“R. Howard”
“30/6/07” www.oceans13.com “english”
film
4
“The
descendants”
“N. Krause”
“G.
Clooney”
“A. Payne” “15/9/11” --- “english”
film
5
“Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.co
m
“english”
Film3Film2Film1
Film5Film4
[Actor, Director]: key
Support = 4/5 =0.8
Discriminability = 5/5=1
[ATENCIA ET AL. 12]- EXAMPLE
KEY DISCOVERY
APPROACHES
CWA approaches
• Keys and Pseudo-Keys Detection for Web Datasets Cleaning
and Interlinking
• ROCKER: A Refinement Operator for Key Discovery
OWA approaches
• KD2R: a Key Discovery method for reference reconciliation
• SAKey: Scalable almost key discovery in RDF data
• Linkkey: Data interlinking through robust Linkkey extraction
• VICKEY: Conditional key discovery
106
SAKEY [Symeonidou et al. 14]
SAKey: Scalable Almost Key discovery approach for:
• Incomplete data (optimistic heuristic)
• Errors
• Duplicates
• Large datasets
Discovers almost keys
• Sets of properties that are not keys due to few exceptions
107
ALMOST KEY DISCOVERY
STRATEGY
• Naive automatic way to discover keys
• Examine all the possible combinations of properties
• Scan all instances for each candidate key
• Example: Class described by 15 properties è215 = 32767
candidate keys
• Discover keys efficiently by:
• Reducing the combinations
• Partially scanning the data
108
KEY DISCOVERY
SAKey: Key discovery approach for very large RDF datasets that may
contain erroneous data or duplicates
Region Producer Colour
Wine1 Bordeaux Dupont White
Wine2 Bordeaux Baudin Rose
Wine3 Languedoc Dupont Red
Wine4 Languedoc Faure Red
109
Producer is a key?
KEY DISCOVERY
SAKey: Key discovery approach for very large RDF datasets that may
contain erroneous data or duplicates
Region Producer Colour
Wine1 Bordeaux Dupont White
Wine2 Bordeaux Baudin Rose
Wine3 Languedoc Dupont Red
Wine4 Languedoc Faure Red
110
Region is a non key?
• Interested only in maximal non keys:
• All the sets of properties that are not maximal non keys are keys
• Example: class described by the properties p1, p2, p3, p4
Maximal non key = {{p1, p2}} keys = {{p3}, {p4}}è
SAKEY–GENERAL ARCHITECTURE
111
[Symeonidou et al. 14]
KEY DISCOVERY
SAKey: Key discovery approach for very large RDF datasets that may
contain erroneous data or duplicates
• n-almost key: A set of properties for which at most n instances are
identical values
Region Producer Colour
Wine1 Bordeaux Dupont White
Wine2 Bordeaux Baudin Rose
Wine3 Languedoc Dupont Red
Wine4 Languedoc Faure Red
• Examples of keys
{Region, Producer}:
0-almost key
112
[Symeonidou et al. 14]
KEY DISCOVERY
SAKey: Key discovery approach for very large RDF datasets that
may contain erroneous data or duplicates
• n-almost key: A set of properties for which at most n instances are
identical values
Region Producer Colour
Wine1 Bordeaux Dupont White
Wine2 Bordeaux Baudin Rose
Wine3 Languedoc Dupont Red
Wine4 Languedoc Faure Red
• Examples of keys
{Region, Producer}:
0-almost key
{Producer}:
2-almost key
Tool available in https://www.lri.fr/sakey
113
[Symeonidou et al. 14]
N-ALMOST KEYS
Exception of a key: an instance that shares values with another instance for
a given set of properties P
114
Films HasName HasActor HasDirector ReleaseDate HasWebsite HasLanguage
f1 “Ocean’s 11” “B. Pitt”
“J. Roberts”
“S.
Soderbergh”
“3/4/01” www.oceans11.com ---
f2 “Ocean’s 12” “B. Pitt”
“G. Clooney”
“J. Roberts”
“S.
Soderbergh”
“R. Howard”
“2/5/04” www.oceans12.com ---
f3 “Ocean’s 13” “B. Pitt”
“G. Clooney”
“S.
Soderbergh”
“R. Howard”
“30/6/07” www.oceans13.com ---
f4 “The
descendants”
“N. Krause”
“G. Clooney”
“A. Payne” “15/9/11” www.descendants.com “english”
f5 “Bourne
Identity“
“D. Liman” --- “12/6/12” www.bourneIdentity.com “english”
f6 “Ocean’s 12“ --- “R. Howard” “2/5/04” --- ---
[Symeonidou et al. 14]
N-ALMOST KEYS
Exception of a key: an instance that shares values with another instance for a
given set of properties P
• f1, f2 and f3 are three exceptions for the property set {HasActor}
115
Films HasName HasActor HasDirector ReleaseDate HasWebsite HasLanguage
f1 “Ocean’s 11” “B. Pitt”
“J. Roberts”
“S.
Soderbergh”
“3/4/01” www.oceans11.com ---
f2 “Ocean’s 12” “B. Pitt”
“G. Clooney”
“J. Roberts”
“S.
Soderbergh”
“R. Howard”
“2/5/04” www.oceans12.com ---
f3 “Ocean’s 13” “B. Pitt”
“G. Clooney”
“S.
Soderbergh”
“R. Howard”
“30/6/07” www.oceans13.com ---
f4 “The
descendants”
“N. Krause”
“G. Clooney”
“A. Payne” “15/9/11” www.descendants.com “english”
f5 “Bourne
Identity“
“D. Liman” --- “12/6/12” www.bourneIdentity.com “english”
f6 “Ocean’s 12“ --- “R. Howard” “2/5/04” --- ---
[Symeonidou et al. 14]
N-ALMOST KEYS
Exception of a key: an instance that shares values with another instance for a
given set of properties P
• f1, f2 and f3 are three exceptions for the property set {HasActor}
Exception Set EP: set of exceptions for P
• EP = {f1, f2, f3} È {f2, f3, f4} = {f1, f2, f3, f4} for {HasActor}
116
Films HasName HasActor HasDirector ReleaseDate HasWebsite HasLanguage
f1 “Ocean’s 11” “B. Pitt”
“J. Roberts”
“S.
Soderbergh”
“3/4/01” www.oceans11.com ---
f2 “Ocean’s 12” “B. Pitt”
“G. Clooney”
“J. Roberts”
“S.
Soderbergh”
“R. Howard”
“2/5/04” www.oceans12.com ---
f3 “Ocean’s 13” “B. Pitt”
“G. Clooney”
“S.
Soderbergh”
“R. Howard”
“30/6/07” www.oceans13.com ---
f4 “The
descendants”
“N. Krause”
“G. Clooney”
“A. Payne” “15/9/11” www.descendants.com “english”
f5 “Bourne
Identity“
“D. Liman” --- “12/6/12” www.bourneIdentity.com “english”
f6 “Ocean’s 12“ --- “R. Howard” “2/5/04” --- ---
[Symeonidou et al. 14]
N-ALMOST KEYS
n-almost key: a set of properties where |EP|≤ n
• {HasActor} is a 4-almost key
117
[Symeonidou et al. 14]
N-ALMOST KEYS
n-almost key: a set of properties where |EP|≤ n
• {HasActor} is a 4-almost key
n-non key: a set of properties where |EP|≥ n
• Using all the maximal n-non keys we can derive all the
minimal (n-1)-almost keys
118
[Symeonidou et al. 14]
N-ALMOST KEYS
n-almost key: a set of properties where |EP|≤ n
• {HasActor} is a 4-almost key
n-non key: a set of properties where |EP|≥ n
• Using all the maximal n-non keys we can derive all the
minimal (n-1)-almost keys
119
All combinations
of properties
[Symeonidou et al. 14]
N-ALMOST KEYS
n-almost key: a set of properties where |EP|≤ n
• {HasActor} is a 4-almost key
n-non key: a set of properties where |EP|≥ n
• Using all the maximal n-non keys we can derive all the
minimal (n-1)-almost keys
120
All sets of properties
that contain at least 5
exceptions
All combinations
of properties
5-non keys
[Symeonidou et al. 14]
N-ALMOST KEYS
n-almost key: a set of properties where |EP|≤ n
• {HasActor} is a 4-almost key
n-non key: a set of properties where |EP|≥ n
• Using all the maximal n-non keys we can derive all the
minimal (n-1)-almost keys
121
All sets of properties
that contain at least 5
exceptions
All combinations
of properties
All sets of properties
that contain less than 5
exceptions
5-non keys 4-almost keys
[Symeonidou et al. 14]
N-NON KEY DISCOVERY:
INITIAL MAP
HasActor {{f1, f2}, {f1, f2, f3}, {f2, f3, f4}, {f4}, {f5}}
HasDirector {{f1,f2,f3}, {f2, f3, f6}, {f4}}
ReleaseDate {{f1}, {f2, f6}, {f3}, {f4}, {f5}}
HasName {{f1}, {f2, f6}, {f3}, {f4}, {f5}}
HasLanguage {{f4, f5}}
HasWebsite {{f1}, {f2}, {f3}, {f4}, {f5}, {f6}}
“B.	Pitt” “G.	Clooney” “D.	Liman”“N.	Krause”“J.	Roberts”“S.	Soderbergh”
122
[Symeonidou et al. 14]
N-NON KEY DISCOVERY:
DATA FILTERING
Singleton filtering
123
HasActor {{f1, f2}, {f1, f2, f3}, {f2, f3, f4}, {f4}, {f5}}
HasDirector {{f1,f2,f3}, {f2, f3, f6}, {f4}}
ReleaseDate {{f1}, {f2, f6}, {f3}, {f4}, {f5}}
HasName {{f1}, {f2, f6}, {f3}, {f4}, {f5}}
HasLanguage {{f4, f5}}
HasWebsite {{f1}, {f2}, {f3}, {f4}, {f5}, {f6}}
“B.	Pitt” “G.	Clooney” “D.	Liman”“N.	Krause”“J.	Roberts”“S.	Soderbergh”
[Symeonidou et al. 14]
N-NON KEY DISCOVERY:
DATA FILTERING
Singleton filtering
124
HasActor {{f1, f2}, {f1, f2, f3}, {f2, f3, f4}, {f4}, {f5}}
HasDirector {{f1,f2,f3}, {f2, f3, f6}, {f4}}
ReleaseDate {{f1}, {f2, f6}, {f3}, {f4}, {f5}}
HasName {{f1}, {f2, f6}, {f3}, {f4}, {f5}}
HasLanguage {{f4, f5}}
HasWebsite {{f1}, {f2}, {f3}, {f4}, {f5}, {f6}}
“B.	Pitt” “G.	Clooney” “D.	Liman”“N.	Krause”“J.	Roberts”“S.	Soderbergh”
[Symeonidou et al. 14]
N-NON KEY DISCOVERY:
DATA FILTERING
Singleton filtering
125
HasActor {{f1, f2}, {f1, f2, f3}, {f2, f3, f4}, {f4}, {f5}}
HasDirector {{f1,f2,f3}, {f2, f3, f6}, {f4}}
ReleaseDate {{f1}, {f2, f6}, {f3}, {f4}, {f5}}
HasName {{f1}, {f2, f6}, {f3}, {f4}, {f5}}
HasLanguage {{f4, f5}}
HasWebsite {{f1}, {f2}, {f3}, {f4}, {f5}, {f6}}
“B.	Pitt” “G.	Clooney” “D.	Liman”“N.	Krause”“J.	Roberts”“S.	Soderbergh”
Single key
[Symeonidou et al. 14]
N-NON KEY DISCOVERY:
DATA FILTERING
Singleton filtering
126
HasActor {{f1, f2, f3}, {f2, f3, f4}}
HasDirector {{f1,f2,f3}, {f2, f3, f6}}
ReleaseDate {{f2, f6}}
HasName {{f2, f6}}
HasLanguage {{f4, f5}}
[Symeonidou et al. 14]
N-NON KEY DISCOVERY:
POTENTIAL N-NON KEYS
Combinations of properties not needed be explored
• Incomplete data
• Properties referring to different classes
Lake Mountain
NaturalPlace
depth mountainRange
127
rdfs:subclassOfrdfs:subclassOf
[Symeonidou et al. 14]
N-NON KEY DISCOVERY:
POTENTIAL N-NON KEYS
Combinations of properties not needed be explored
• Incomplete data
• Properties referring to different classes
• Potential n-non keys: Sets of properties that possibly refer
to n-non keys
128
Lake Mountain
NaturalPlace
depth mountainRange
rdfs:subclassOfrdfs:subclassOf
[Symeonidou et al. 14]
N-NON KEY
DISCOVERY
HasActor
• {f1, f2, f3} È {f2, f3, f4} = {f1, f2, f3, f4} => 4-non key
Composite n-non keys
• Intersections between sets of different properties
129
HasActor {{f1, f2, f3}, {f2, f3, f4}}
HasDirector {{f1,f2,f3}, {f2, f3, f6}}
ReleaseDate {{f2, f6}}
HasName {{f2, f6}}
HasLanguage {{f4, f5}}
[Symeonidou et al. 14]
130
N-NON KEY
DISCOVERY [Symeonidou et al. 14]
131
N-NON KEY
DISCOVERY
{hasActor, director} è 3-non key
[Symeonidou et al. 14]
132
N-NON KEY
DISCOVERY
{hasActor, director} è 3-non key
…
[Symeonidou et al. 14]
KEY MERGE: SEVERAL DATASETS
Goal: Keys valid in both datasets
• More sure keys
Intuition: Computation of Cartesian product of sets of keys
• Keep only minimal keys
133
[firstName, LastName]
[hasFriend]
[firstName]
KeysA KeysB
[Symeonidou et al. 14]
KEY MERGE: SEVERAL DATASETS
Goal: Keys valid in both datasets
• More sure keys
Intuition: Computation of Cartesian product of sets of keys
• Keep only minimal keys
134
[firstName, LastName]
[hasFriend]
[firstName]
KeysA KeysB
[firstName, LastName]
[hasFriend, firstName]
KeysAB
[Symeonidou et al. 14]
DATA LINKING USING
ALMOST KEYS
Goal: Compare linking results using almost keys with different n
Evaluation of linking using
• Recall
• Precision
• F-Measure
Datasets
• OAEI 2010
• OAEI 2013
Conclusion
• Linking results using n-almost keys are the better than using keys
135
[Symeonidou et al. 14]
EXAMPLE: DATA LINKING
USING ALMOST KEYS
OAEI 2013 - Person
• BirthName, BirthDate, award, comment, label, BirthPlace, almaMater,
doctoralAdvisor
Almost keys Recall Precision F-Measure
0-almost key {BirthDate, award} 9.3% 100% 17%
2-almost key {BirthDate} 32.5% 98.6% 49%
# exceptions Recall Precision F-measure
0, 1 25.6% 100% 41%
2, 3 47.6% 98.1% 64.2%
4, 5 47.9% 96.3% 63.9%
6, ..., 16 48.1% 96.3% 64.1%
17 49.3% 82.8% 61.8%
136
[Symeonidou et al. 14]
KD2R VS. SAKEY -
NON KEY DISCOVERY
Class # triples
#
Instances
#Properties
KD2R
Runtime
SAKey
Runtime
(n=0)
DB:Website 8506 2870 66 13min 1s
YA:Building 114783 54384 17 26s 9s
DB:BodyOfWater 1068428 34000 200 outOfMem. 37s
DB:NaturalPlace 1604348 49913 243 outOfMem. 1min10s
Dbpedia class=
DB:NaturalPlace
137
[Symeonidou et al. 14]
KD2R VS. SAKEY -
KEY DERIVATION
Class
# non
keys
# keys KD2R SAKey
(n=0)
DB:Lake 50 480 1min10s 1s
DB:Mountain 49 821 8min 1s
DB:BodyOfWater 220 3846 > 1 day 66s
DB:NaturalPlace 302 7011 > 2 days 5min
Dbpedia class=
DB:BodyOfWater
138
[Symeonidou et al. 14]
KEY DISCOVERY
APPROACHES
CWA approaches
• Keys and Pseudo-Keys Detection for Web Datasets Cleaning
and Interlinking
• ROCKER: A Refinement Operator for Key Discovery
OWA approaches
• KD2R: a Key Discovery method for reference reconciliation
• SAKey: Scalable almost key discovery in RDF data
• Linkkey: Data interlinking through robust Linkkey extraction
• VICKEY: Conditional key discovery
139
LINKKEY
140
Ontology1 Ontology2
Data of classA Data of classB
Linkkey
Discovery
Given a pair of classes in two datasets conforming to two different ontologies:
1) Discover Linkkeys – maximal sets of property pairs that can link instances of two
different classes
[Atencia et al. 2014]
LINKKEY
141
Person1
Literal
LastName
Literal Literal
Person2
Literal
LN
Literal Literal
Given a pair of classes in two datasets conforming to two different ontologies:
1) Discover Linkkeys – maximal sets of property pairs that can link instances of two
different classes
[Atencia et al. 2014]
LINKKEY [ATENCIA ET AL. 2014]
142
Person1
Literal
LastName
Literal Literal
Person2
Literal
LN
Literal Literal
Owl:EquivalentClass
Given a pair of classes in two datasets conforming to two different ontologies:
1) Discover Linkkeys – maximal sets of property pairs that can link instances of two
different classes
LINKKEY [ATENCIA ET AL. 2014]
143
Person1
Literal
LastName
Literal Literal
Person2
Literal
LN
Literal Literal
Owl:EquivalentClass
LastName Nationality Profession
P11 Tompson Greek
P12 Dupont French Researcher
LN NL PR
P21 Tompson Greek
P22 Tompson Greek,
French
Reseacher
Given a pair of classes in two datasets conforming to two different ontologies:
1) Discover Linkkeys – maximal sets of property pairs that can link instances of two
different classes
LINKKEY [ATENCIA ET AL. 2014]
144
Person1
Literal
LastName
Literal Literal
Person2
Literal
LN
Literal Literal
Owl:EquivalentClass
Ex. {<LastName,LN>,<Nationality,NL>}, {<Profession,PR>,<Nationality,NL>}
LastName Nationality Profession
P11 Tompson Greek
P12 Dupont French Researcher
LN NL PR
P21 Tompson Greek
P22 Tompson Greek,
French
Reseacher
Given a pair of classes in two datasets conforming to two different ontologies:
1) Discover Linkkeys – maximal sets of property pairs that can link instances of two
different classes
LINKKEY [ATENCIA ET AL. 2014]
145
Given a pair of classes in two datasets conforming to two different ontologies:
1) Discover Linkkeys – maximal sets of property pairs that can link instances of two
different classes
2) Linkkey quality evaluation:
1) use measures of discriminability and coverage for non supervised linkkey
extraction
2) an approximation of precision and recall for the supervised case.
Ex. {<LastName,LN>,<Nationality,NL>}, {<Profession,PR>,<Nationality,NL>}
LastName Nationality Profession
P11 Tompson Greek
P12 Dupont French Researcher
LN NL PR
P21 Tompson Greek
P22 Tompson Greek,
French
Reseacher
CONDITIONAL KEYS
COLLABORATION WITH:
LUIS GALARAGA (INRIA RENNES)
NATHALIE PERNELLE (LRI, UNIV. PARIS SUD),
FABIAN SUCHANEK (TELECOM PARISTECH)
DANAI SYMEONIDOU (INRA, MONTPELLIER)
146
Symeonidou et al. 2017
KEYS IN KNOWLEDGE
BASES
To obtain keys found in Knowledge Bases (KBs)
• Different automatic key discovery approaches (eg. SAKey, KD2R)
Key limitations
• Cases of datasets having no keys
• Keys are generic, i.e., they are true for every instance of a class in a
dataset
FirstName LastName Gender	 Lab	 Nationality	
instance1	 Claude	 Dupont Female	 Paris-Sud France	
instance2	 Claude	 Dupont Male	 Paris-Sud Belgium	
instance3	 Juan	 Rodríguez	 Male	 INRA	 Spain,	Italy	
instance4	 Juan	 Salvez	 Male	 INRA	 Spain	
instance5	 Anna	 Georgiou	 Female	 INRA	 Greece,	France	
instance6	 Pavlos	 Markou Male	 Paris-Sud Greece	
instance7	 Marie	 Legendre	 Female	 INRA	 France		
Instances of the
class Person
Key: {LastName,
Gender}
147
Symeonidou et al. 2017
PROBLEM STATEMENT
Motivating example
• In many countries of the world, a student can be supervised
by multiple supervisors
• In German Universities, a student can be supervised by only
one professor
• {supervises} key for the instances of the class Professor with
“condition” that they are in a German university
Approximate keys [Sakey] express keys that do not uniquely
identify every instance of a class in a dataset
Approximate keys do not express the conditions under which they are true
148
Symeonidou et al. 2017
CONDITIONAL KEYS
BY VICKEY
Conditional key: a key, valid for instances of a class satisfying a
specific condition
• Condition part: pairs of property and value
Eg. {Lab=INRA}, {Gender=Male}, {Gender=Female, Lab=INRA} etc.
• Key part: a set of properties
FirstName LastName Gender	 Lab	 Nationality	
instance1	 Claude	 Dupont Female	 Paris-Sud France	
instance2 Claude	 Dupont Male	 Paris-Sud Belgium	
instance3	 Juan	 Rodríguez	 Male	 INRA	 Spain,	Italy	
instance4	 Juan	 Salvez Male	 INRA	 Spain	
instance5	 Anna	 Georgiou	 Female	 INRA	 Greece,	France	
instance6	 Pavlos	 Markou Male	 Paris-Sud Greece	
instance7	 Marie	 Legendre	 Female	 INRA	 France		
{LastName} is a key under the condition {Lab=INRA}
Instances of the
class Person
149
Symeonidou et al. 2017
CONDITIONAL KEY
QUALITY MEASURES
Support: #instances both satisfying condition part and instantiating key part
Coverage: Support/#all_Instances
Support: 4
Coverage: 4/7 = 0.57
FirstName LastName Gender	 Lab	 Nationality	
instance1	 Claude	 Dupont Female	 Paris-Sud France	
instance2 Claude	 Dupont Male	 Paris-Sud Belgium	
instance3	 Juan	 Rodríguez	 Male	 INRA	 Spain,	Italy	
instance4	 Juan	 Salvez Male	 INRA	 Spain	
instance5	 Anna	 Georgiou	 Female	 INRA	 Greece,	France	
instance6	 Pavlos	 Markou Male	 Paris-Sud Greece	
instance7	 Marie	 Legendre	 Female	 INRA	 France		
{LastName} is a key under the condition {Lab=INRA}
Instances of the
class Person
150
Symeonidou et al. 2017
VICKEY: MINING EFFICIENTLY
CONDITIONAL KEYS
Goal of VICKEY
• Given all instances of a class in a dataset, a min_support, a min_coverage,
mine all minimal conditional keys with
• support ≥ min_support
• coverage ≥ min_coverage
Exponential search space (O(|V||P|)with
• V, the set of objects of a given class in the dataset
• P, the set of properties of a given class in the dataset
151
Symeonidou et al. 2017
Conditional keys can be efficiently discovered from the non-keys
VICKEY: MINING EFFICIENTLY
CONDITIONAL KEYS
Non-key: a set of properties where two instances share at least one value for
each property in the non-key (SAKey)
Given a maximal non-key
• Condition part: subset of properties of the non-key
• Key part: subset of the remaining properties of the non-key
FirstName LastName Gender	 Lab	 Nationality	
instance1	 Claude	 Dupont Female	 Paris-Sud France	
instance2 Claude	 Dupont Male	 Paris-Sud Belgium	
instance3	 Juan Rodríguez	 Male INRA	 Spain,	Italy	
instance4	 Juan Salvez Male INRA	 Spain	
instance5	 Anna	 Georgiou	 Female	 INRA	 Greece,	France	
instance6	 Pavlos	 Markou Male	 Paris-Sud Greece	
instance7	 Marie	 Legendre	 Female	 INRA	 France		
Non-key
{FirstName, Gender, Lab,
Nationality}
Instances of the
class Person
152
Symeonidou et al. 2017
CONDITIONAL KEY GRAPH
EXPLORATION
• Given a non-key
• Step 1: Discover all minimal conditional keys with condition of size 1
• Step 2: Discover all minimal conditional keys with condition of size 2
• …
• Step n: Discover all minimal conditional keys with condition of size n
153
Symeonidou et al. 2017
• Given	a	non-key
• Step	1:	Discover	all	minimal	conditional	keys	with	condition	of	size	1	{p=a}
• Step	2:	Discover	all	minimal conditional	keys	with	condition	of	size	2	{p1=a1^p2=a2}
• …
• Step	n:	Discover	all	minimal conditional	keys	with	condition	of	size	n	{p1=a1^…^pn=an}
FirstName Lab
FirstName
Lab
FirstName
Nationality
Lab
Nationality
FirstName
Nationalit
y
Lab
Gender=Female Nationality
Condition Conditional	key	graph
Non-key
{FirstName,	Gender,	Lab,	Nationality}
154
CONDITIONAL KEY GRAPH
EXPLORATION Symeonidou et	al.	2017
• Given	a	non-key
• Step	1:	Discover	all	minimal	conditional	keys	with	condition	of	size	1	{p=a}
• Step	2:	Discover	all	minimal conditional	keys	with	condition	of	size	2	{p1=a1^p2=a2}
• …
• Step	n:	Discover	all	minimal conditional	keys	with	condition	of	size	n	{p1=a1^…^pn=an}
FirstName Lab
FirstName
Lab
FirstName
Nationalit
y
Lab
Nationality
FirstName
Nationalit
y
Lab
Gender=Female Nationality
Condition Conditional	key	graph
Key	{FirstName}	for	cond {Gender=Female}
Non-key
{FirstName,	Gender,	Lab,	Nationality}
155
CONDITIONAL KEY GRAPH
EXPLORATION Symeonidou et	al.	2017
• Given	a	non-key
• Step	1:	Discover	all	minimal	conditional	keys	with	condition	of	size	1	{p=a}
• Step	2:	Discover	all	minimal conditional	keys	with	condition	of	size	2	{p1=a1^p2=a2}
• …
• Step	n:	Discover	all	minimal conditional	keys	with	condition	of	size	n	{p1=a1^…^pn=an}
Gender=Female
Condition Conditional	key	graph
firstName
firstName
lab
firstName
nationalit
y
lab
nationalit
y
firstName
nationalit
y
lab
FirstNam
e
FirstName
Lab
FirstName
Nationalit
y
FirstName
Nationalit
y
Lab
Lab
Nationalit
y
NationalityLab
Key	{FirstName}	for	cond {Gender=Female}
Non-key
{FirstName,	Gender,	Lab,	Nationality}
156
CONDITIONAL KEY GRAPH
EXPLORATION Symeonidou et	al.	2017
• Given	a	non-key
• Step	1:	Discover	all	minimal	conditional	keys	with	condition	of	size	1	{p=a}
• Step	2:	Discover	all	minimal conditional	keys	with	condition	of	size	2	{p1=a1^p2=a2}
• …
• Step	n:	Discover	all	minimal conditional	keys	with	condition	of	size	n	{p1=a1^…^pn=an}
Lab=INRA
Condition Conditional	key	graph
Key	{FirstName}	for	cond {Gender=Female}
FirstNam
e
FirstName
Gender
FirstName
Nationalit
y
Gender
Nationalit
y
FirstName
Nationalit
y
Gender
NationalityGender
Non-key
{FirstName,	Gender,	Lab,	Nationality}
157
CONDITIONAL KEY GRAPH
EXPLORATION Symeonidou et	al.	2017
• Given	a	non-key
• Step	1:	Discover	all	minimal	conditional	keys	with	condition	of	size	1	{p=a}
• Step	2:	Discover	all	minimal	conditional	keys	with	condition	of	size	2	{p1=a1^p2=a2}
• …
• Step	n:	Discover	all	minimal conditional	keys	with	condition	of	size	n	{p1=a1^…^pn=an}
Condition Conditional	key	graph
Gender=Female
^
Lab=INRA
Nationality
158
CONDITIONAL KEY GRAPH
EXPLORATION Symeonidou et	al.	2017
EXPERIMENTAL
EVALUATION
Evaluation of VICKEY’s scalability in large datasets
Evaluation of conditional keys in data linking
159
Symeonidou et al. 2017
VICKEY’S
SCALABILITY
Run VICKEY in large datasets
Compare the runtime results with AMIE - a generic rule
mining approach adapted to mine conditional keys
• Both approaches take as input non-keys mined by SAKey to
reduce the search space of conditional key mining
Coverage 1%
160
Symeonidou et al. 2017
RUNTIME RESULTS -
VICKEY VS. AMIE[2]
Class* #Triples #Instances #Properties #NonKeys VICKEY AMIE #ConditionalKeys
Actor 57.2k 5.8k 71 137 4.52m 12.58h 311
Album 786.1k 85.3k 39 68 1.53h 3.90h 304
Book 258.4k 30.0k 51 95 11.84h >1d 419
Film 832.1k 82.6k 74 132 1.37h 3.64h 185
Mountain 127.8k 16.4k 58 47 2.86m 23.57m 257
Museum 12.9k 1.9k 65 17 1.46s 6.45s 58
Organization 1.82M 178.7k 553 3221 26.32h > 36h 28
Scientist 258.5k 19.7k 73 309 27.67m > 1d 582
University 85.5k 8.7k 89 140 14.45h >1d 941
*All used classes are obtained from DBpedia
161
Symeonidou et al. 2017
DATA LINKING -
VICKEY VS. SAKEY[1]
Goal: link two datasets using
• Classical keys discovered by SAKey[1]
• Conditional keys discovered by VICKEY
• Supervises(Prof1, stud1)^Supervises(Prof2, stud1)^teachesIn(Prof1,
“Gemany”)^teachesIn(Prof2, “Gemany”)=> sameAs(Prof1, Prof2)
• Both classical keys and conditional keys
Evaluate obtained links using the existing goal-standard with
• Recall
• Precision
• F-Measure
162
Symeonidou et al. 2017
DATA LINKING -
VICKEY VS. SAKEY[1]
Link two knowledge bases containing information of Wikipedia
• Yago[3]
• DBpedia[4]
Used classes of DBpedia and Yago
• Actor
• Album
• Book
• Film
• Mountain
• Museum
• Organization
• Scientist
• University
163
Symeonidou et al. 2017
DATA LINKING -
VICKEY VS. SAKEY[1]
Class Recall Precision F-Measure
Actor
Keys[1]* 0.27 0.99 0.43
Conditional keys** 0.57 0.99 0.73
Keys[1]+Conditional
keys 0.6 0.99 0.75
Album
Keys[1] 0 1 0.00
Conditional keys 0.15 0.99 0.26
Keys[1]+Conditional
keys 0.15 0.99 0.26
Film
Keys[1] 0.04 0.99 0.08
Conditional keys 0.38 0.96 0.54
Keys[1]+Conditional
keys 0.39 0.98 0.55
Museum
Keys[1]
0.12 1 0.21
Conditional keys
0.25 1 0.40
Keys[1]+Conditional
keys
0.31 1 0.47
x 869
x 1.75
x 7.1
x 2.19
*Keys[1] from SAKey **Conditional keys from VICKEY
164
Symeonidou et al. 2017
SUMMARY
Keys and almost keys:
• Better linking results in terms of recall with n-almost keys than with sure keys
• Optimistic heuristic gets better results than pessimistic one
• Can handle size classes much larger than KD2R (DB: Natural Place more than
16 million triplets)
Conditional keys:
• Better linking results in terms of recall when conditional keys are used.
• Aside data linking, may be used to refine a class hierarchy
Improvements:
• Numerical values
• Dataset heterogeneity
When no (almost-)keys neither conditional keys are discoverable from the dataset
è Possible solution: Referring expressions
165
LINK
INVALIDATION /
REQUALIFICATION
WORK IN COLLABORATION WITH:
NATHALIE PERNELLE (LRI, UNIV. PARIS SUD),
LAURA PAPALEO (GENOVA PROVINCE, ITALY)
JOE RAAD (PHD STUDENT, AGROPARISTECH & UNIVERSITÉ PARIS SUD. )
166
• owl:sameAs, indicates that two different
descriptions refer to the same entity
• owl:sameAs semantics is too strict
• Reflexive, symmetric, transitive and
• Property sharing:
" X " Y owl:sameAs(X, Y)Ù p(X, Z) Þ p(Y, Z)
• Automatic data linking tools do not guarantee 100%
precision, because of:
• Errors, missing information, data freshness, …
• [Halpin et al. 2010] showed that 37% of owl:sameAs
links randomly selected among 250 identity links
between books were incorrect.
• Problem: how to (semi)-automatically invalidate/requalify
owl:sameAs links?
IDENTITY CRISIS
b1 b2
owl:sameAs
b4 b3
owl:sameAs
b1 b5
owl:sameAs
b5 b6
owl:sameAs
167
IDENTITY CRISIS: SOME SOLUTIONS
b1 b2
owl:sameAs
b4 b3
owl:sameAs
b1 b5
owl:sameAs
b5 b6
owl:sameAs
168
• [Halpin et al. 2010]: propose ontology of identity and
invalidation of identity links by crowdsourcing.
• [de Melo 2013]: uses the Unique Name Assumption
and the transitivity of links to detect inconsistencies in
the data.
• [Papaleo et al. 2014, Papageorgiou et al. 2017]: exploit
some ontology axioms to logically/numerically detect
invalid identity links.
• [Raad et al. 2017] computes contextual identity links
for each pair of instances
LINK
INVALIDATION
[QUALINCA ANR PROJECT
(2012-2016)]
WORK IN COLLABORATION WITH:
- NATHALIE PERNELLE (LRI, UNIV. PARIS SUD)
- LAURA PAPLEO(GENOVA PROVINCE, ITALY)
169
LOGICAL AND NUMERICAL
APPROACH FOR LINK INVALIDATION
• Two ontology-based methods to detect invalid sameAs statements: a
logical method and a numerical method
• We build a contextual graph «around» each one of the two resources
involved in the sameAs by exploiting ontology axioms on:
• functionality and inverse functionality of properties and
• local completeness of some properties (e.g., the author list of a
book).
• We analyse the descriptions provided in these contextual graphs to
eventually detect inconsistencies or high dissimilarities.
170
LOGICAL METHOD
F is the set of RDF facts
enriched by a set of ¬synVals
facts in the form
¬synVals(w1, w2)
w1 and w2, being literals and
different.
171of 20
EXAMPLES:
- notSynVals(‘231’,’100’)
for a functional property numOfPages
-notSynVals(‘New York’, ‘Paris’)
for a functional property cityName
… knowledge from expert or extracted.
171
[Papaleo et al. 2014]
LOGICAL METHOD
R the set of rules
172of 20
(inverse) functional properties
local complete properties
sameAs(x,y) ÙnumOfPages(x,w1) Ù numOfPages(y,w2) à SynVals(w1,w2)
sameAs(x,y) Ù hasAuthor(x,w1)
à hasAuthor(y,w1)
172
[Papaleo et al. 2014]
LOGICAL INVALIDAITON
• If the property nb-pages is declared as functional then:
nbPages(b1, n1) Ù nbPages(b2, n2) Ù (b1=b2) Þ n1=n2.
è owl:sameAs(b1, b2) est faux.
b1 b2
a11
a21
owl:sameAs(b1, b2) ?
author
author
A Semantic Web
Primer
titre
A Semantic Web
Primer
titre
2007
pubYear
2007
pubYear
G. Antoniou
aName
Grigoris
Antoniou
aName
author
a12
author
a22
208
nbPages 288nbPages
Paul Grauth
aName
P. Grauth
aName
173
[Papaleo et al. 2014]
b1
e1
b2
a11 e2 a21
owl:sameAs(b1, b2) ?
author
author
publisherpublisher
A Semantic Web
Primer
title
A Semantic Web
Primer
title
2008
pubYear
2007
pubYear
G. Antoniou
aName
Grigoris
Antoniou
aName
author
…
a12
…
author
…
a22
…
r2n
r21ref
ref
…r1m r11
ref ref
…
NUMERICAL METHOD: EXAMPLE
• P= {title, pubYear, publisher, author, aName, pName}
• Sim(“A Semantic Web …``, “A Semantic Web …``) = 1, Sim(“ 2007“, “2008“) = 0
• Sim(“MIT Press``, “MIT Press``) = 1, è CSim(e1, e2) = 1
• Sim(“Grigoris Antoniou``, “G. Antoniou``) = 0.5, ... è CSim(a11, a21) = 0.5, ….
è CSim(a12, a22) = 0.5
MIT Press
MIT Press
pName
pNam
e
• CSim(b1, b2) = 0.62 if agregation function is average and
• CSim(b1, b2) = 0 if agregation function is minimum 174
[Papageorgio et al. 2017]
COMPARAISON
LOGICAL/NUMERICAL
Logical method
[Papaleo, Pernelle and Saïs (2014)]
Numerical method
(Agregation = average)
[Papageorgiou, Pernelle and Saïs 2017]
Datasets Precision Recall F-measure Precision Recall F-measure
thresh
old
Person1 0.69 0.98 0.81 1.0 0.98 0.99 0.3
Person2 0.5 1.0 0.67 0.994 0.989 0.99 0.2
Restaurant 0.63 0.97 0.77 0.97 1.0 0.98 0.4
• Average gain of 23% F-measure (significant increase in precision, comparable recall)
175
CONTEXTUAL
IDENTITY
[JOE RAAD PHD
LIONES (2015-2018) PROJECT,
FUNDED BY CDS PARIS SACLAY]
WORK IN COLLABORATION WITH:
- NATHALIE PERNELLE (LRI, UNIV. PARIS SUD),
- JULIETTE DIBIE (AGROPARISTECH, INRA),
- LILIANA IBANESCO (AGROPARISTECH, INRA),
- CAROLINE PENICAUD (INRA PARIS)
176
• owl:sameAs, indicates that two different descriptions refer to the same
entity
• owl:sameAs semantics is too strict
• Reflexive, symmetric, transitive and
• Property sharing:
" X " Y owl:sameAs(X, Y)Ù p(X, Z) Þ p(Y, Z)
• In some specific domains, the identity depends on the context.
Examples:
• same-components and same-quantity è same-juice. [Hypocaloric diet context]
• proportionally-same-components è same-juice. [Shopping list context]
• Problem: how to automatically compute the contextual identity links?
CONTEXTUAL IDENTITY
177
[Raad et al. 2014]
178
CONTEXTUAL IDENTITY
Proposition of a new contextual identity link
• A new predicate to represent identity links valid in explicit contexts
• Hierarchy of the contexts and the identity links
• Defining identity contexts for each class (local contexts)
• Defining identity contexts for each class and its connected graph
(global context)
[Raad et al. 2014]
CONTEXTUAL IDENTITY
Contextual Identity Link Example
Πa(Juice) = { (Juice, {rdf:Type, expiryDate}, {isComposedOf}),
(Banana, {rdf:Type}, {isComposedOf -1}),
(Strawberry, {rdf:Type}, {hasAttribute, isComposedOf -1}),
(Weight, {rdf:Type, hasValue, hasUnit}, {hasAttribute-1}) }
identiConTo<Πa(Juice)>(juice1, juice2)
179
[Raad et al. 2014]
CONTEXTUAL IDENTITY
Global Context
• π(ck): local context of the class ck =(ck, DPck, Opck)
• src: a given source class of the ontology
• G: connected sub-graph containing src
• CG: set of classes belonging to G
• Ck: class belonging to CG
Π(src) =Uck ∈ CG π(ck)
180
[Raad et al. 2014]
CONTEXTUAL IDENTITY
Πa(Juice) ≤ Πc(Juice) Πb(Juice) ≤ Πc(Juice)
Πa(Juice) Πb(Juice)
Πc(Juice)
…
…
each local context in Πc(Juice) is less specific or equal to its corresponding local
context in Πa(Juice) 181
[Raad et al. 2014]
DECIDE METHOD – DETECTION OF
CONTEXTUAL IDENTITY
It automatically detects and adds these contextual
identity links in the knowledge graph
DECIDE
DEtection of Contextual IDEntity
Knowledge
Graph
Source
Class
Unwanted
Properties
Paired
Properties
Necessary
Properties
For each pair of instances (i1, i2) of the source class
set of the most specific global contexts in which (i1, i2)
are identical
182
[Raad et al. 2014]
EXPERIMENTS - LIONES PROJECT
Transformation of Micro-organisms Digestion Process
≈ 1000 files ≈ 2000 files
Guidelines
CellExtraDry
- Classes: ≈ 4 700
- Individuals: ≈ 415 000
- Statements: ≈ 1 700 000
Carredas
- Classes : ≈ 5 000
- Individuals: ≈ 42 000
- Statements : ≈ 237 000
RDFRDF
183
[Raad et al. 2014]
EXPERIMENTS - LIONES PROJECT
Mixture
DECIDE
DEtection of Contextual IDEntity
CellExtraDry + Carredas
184
[Raad et al. 2014]
EXPERIMENTS - LIONES PROJECT
identiConTo<Π(Mixture)25>(i, j): two mixtures have the same nature (e.g. humid) and
the same components (water, yeast, mono-potassium, phosphate, and glucose),
with each of these components having identical weights (value + unit of measure)
Generation of rules for missing observations prediction
185
[Raad et al. 2014]
INVALIDATION /
REQUALIFICATION: SUMMARY
• Contextual identity useful for profiling identity links
• Useful for giving explanations for possible experts validation
• Allow to generate missing values prediction rules.
• Future directions:
• Is Contextual difference relevant/useful?
• How to make explicit incomplete information for a pair of resources?
• Use when the ontologies are different
• SameAs link invalidation
• Improves the precision without decreasing the recall
• May be used for error detection and/or anomalies in ontology axioms
• Future directions:
• Provide comprehensible explanations for experts
• Use when the ontologies are different
186
SOME FUTURE CHALLENGES
q Data evolution
q ontology axiom evolution
q link evolution
q temporal data linking
q Data veracity è what is the truth?
q Knowledge discovery in complex data
q key graphs
q causality rules
q Explanation problem
q provenance of the data and the data/knowledge processes
187
THANKS!
188
REFERENCES (1)
[Atencia et al. 2014] Data interlinking through robust Linkkey extraction.
Atencia, Manuel, Jérôme David, and Jérôme Euzenat. ECAI, 2014.
[Atencia et al.’12] Keys and Pseudo-Keys Detection for Web Datasets Cleansing and Interlinking.
Manuel Atencia, Jérôme David, François Scharffe. In EKAW 2012
[Cohen et al. 2003] A comparison of string distance metrics for name-matching tasks.
William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg.
In IIWEB@AAAI 2003.
[Ferrara13] Evaluation of instance matching tools: The experience of OAEI.
Alfio Ferrara, Andriy Nikolov, Jan Noessner, François Scharffe. OM@ISWC 2013
[Hu et al. 2011] A Self-Training Approach for Resolving Object Coreference on the Semantic Web.
Wei Hu, Jianfeng Chen, Yuzhong Qu. In WWW 2011
[Kang et al. 2008] Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its
Evaluation. Kang, Getoor, Shneiderman, Bilgic, Licamele,
In IEEE Trans. Vis. Comput. Graph2008
189
REFERENCES (2)
[Lenat and Feigenbaum 1991] On the threshold of knowledge.
Douglas B. Lenat and Edward A. Feigenbaum
In Artificial Intelligence 47 (1991)
[Nikolov et al’12] Unsupervised Learning of Link Discovery Configuration
Andriy Nikolov, Mathieu d'Aquin, Enrico Motta. In ESWC 2012.
[Pernelle et al.’13] An Automatic Key Discovery Approach for Data Linking.
Nathalie Pernelle, Fatiha Saïs. and Danai Symeounidou.
In Journal of Web Semantics
[Saïs et al.07] L2R: a Logical method for Reference Reconciliation.
Fatiha Saïs, Nathalie Pernelle and Marie-Christine Rousset. In AAAI 2007.
[Saïs et al.09] Combining a Logical and a Numerical Method for Data Reconciliation.
Fatiha Saïs., Nathalie Pernelle and Marie-Christine Rousset.
In Journal of Data Semantics.
[Shvaiko,Euzenat13] Ontology Matching: State of the Art and Future Challenges,
Pavel Shvaiko, Jérôme Euzenat. In TKDE 2013
190
REFERENCES (3)
[Suchanek11] PARIS: Probabilistic Alignment of Relations, Instances, and Schema
Fabian Suchanek, Serge Abiteboul, Pierre Senellart. In VLDB 2011.
[Soru et al. 2015] ROCKER: a refinement operator for key discovery.
Soru, Tommaso, Edgard Marx, and Axel-Cyrille Ngonga Ngomo.
In WWW, 2015.
[Symeonidou et al. 2014] SAKey: Scalable almost key discovery in RDF data.
Symeonidou, Danai, Vincent Armant, Nathalie Pernelle, and Fatiha Saïs.
In ISWC 2014.
[Symeonidou et al. 2017] VICKEY: Mining Conditional Keys on RDF datasets .
Danai Symeonidou, Luis Galarraga, Nathalie Pernelle, Fatiha Saïs and Fabian Suchanek.
In ISWC 2017.
[Papageorgiou et al. 2017] Approche numérique pour l'invalidation de liens d'identité (owl:SameAs).
Dimitrios Christaras Papageorgiou, Nathalie Pernelle and Fatiha Saïs. In IC 2017.
191
REFERENCES (4)
[Papaleo et al. 2014] Logical Detection of Invalid SameAs Statements in RDF Data,
Laura Papaleo, Nathalie Pernelle, Fatiha Saïs and Cyril Dumont. In EKAW 2014
[Volz et al'09] Silk – A Link Discovery Framework for the Web of Data.
Julius Volz, Christian Bizer et al. In WWW 2009.
[Zheng et al. 2013] Results for OAEI 2013
Qian Zheng, Chao Shao, Juanzi Li, Zhichun Wang and Linmei Hu. OM@ISWC 2010
192
THANKS
193
N. Pernelle M.C. Rousset D. Symeonidou
J. Raad L. Papaleo N. Niraula

Weitere ähnliche Inhalte

Was ist angesagt?

From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2Connected Data World
 
Vital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital.AI
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Trey Grainger
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data ModelingVital.AI
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AITrey Grainger
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Trey Grainger
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementTrey Grainger
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesTrey Grainger
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic SearchPaul Wlodarczyk
 
Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Amit Sheth
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overviewAmit Sheth
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrTrey Grainger
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at YahooPeter Mika
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 

Was ist angesagt? (20)

From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
 
Vital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent Apps
 
Semantic search
Semantic searchSemantic search
Semantic search
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AI
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge Management
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic Search
 
Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 

Ähnlich wie Discover Hidden Connections with Knowledge Graph Linking

Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph IntroductionSören Auer
 
Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021hala Skaf
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...Armin Haller
 
Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data VisualizationLaura Po
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphSören Auer
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesAlessandro Adamou
 
The Unreasonable Effectiveness of Metadata
The Unreasonable Effectiveness of MetadataThe Unreasonable Effectiveness of Metadata
The Unreasonable Effectiveness of MetadataJames Hendler
 
20130622 okfn hackathon t2
20130622 okfn hackathon t220130622 okfn hackathon t2
20130622 okfn hackathon t2Seonho Kim
 
Linked Open Data Utrecht University Library
Linked Open Data Utrecht University LibraryLinked Open Data Utrecht University Library
Linked Open Data Utrecht University LibraryRuben Schalk
 
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI PresentationOpen Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentationekansa
 
Reuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and RealizationReuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and Realizationandrea huang
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...Bertram Ludäscher
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Enayat Rajabi
 
On the many graphs of the Web and the interest of adding their missing links.
On the many graphs of the Web and the interest of adding their missing links. On the many graphs of the Web and the interest of adding their missing links.
On the many graphs of the Web and the interest of adding their missing links. Fabien Gandon
 
Open Data Dialog 2013 - Linked Data in Education
Open Data Dialog 2013 - Linked Data in EducationOpen Data Dialog 2013 - Linked Data in Education
Open Data Dialog 2013 - Linked Data in EducationStefan Dietze
 
Tue acosta tut_providing_linkeddata
Tue acosta tut_providing_linkeddataTue acosta tut_providing_linkeddata
Tue acosta tut_providing_linkeddataeswcsummerschool
 
Introduction_to_knowledge_graph.pdf
Introduction_to_knowledge_graph.pdfIntroduction_to_knowledge_graph.pdf
Introduction_to_knowledge_graph.pdfJaberRad1
 

Ähnlich wie Discover Hidden Connections with Knowledge Graph Linking (20)

Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
 
Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data Visualization
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge Graph
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiences
 
The Unreasonable Effectiveness of Metadata
The Unreasonable Effectiveness of MetadataThe Unreasonable Effectiveness of Metadata
The Unreasonable Effectiveness of Metadata
 
20130622 okfn hackathon t2
20130622 okfn hackathon t220130622 okfn hackathon t2
20130622 okfn hackathon t2
 
Linked Open Data Utrecht University Library
Linked Open Data Utrecht University LibraryLinked Open Data Utrecht University Library
Linked Open Data Utrecht University Library
 
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI PresentationOpen Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
 
Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked Data
 
Reuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and RealizationReuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and Realization
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
 
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and QueryingPrateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)
 
On the many graphs of the Web and the interest of adding their missing links.
On the many graphs of the Web and the interest of adding their missing links. On the many graphs of the Web and the interest of adding their missing links.
On the many graphs of the Web and the interest of adding their missing links.
 
Open Data Dialog 2013 - Linked Data in Education
Open Data Dialog 2013 - Linked Data in EducationOpen Data Dialog 2013 - Linked Data in Education
Open Data Dialog 2013 - Linked Data in Education
 
Tue acosta tut_providing_linkeddata
Tue acosta tut_providing_linkeddataTue acosta tut_providing_linkeddata
Tue acosta tut_providing_linkeddata
 
Introduction_to_knowledge_graph.pdf
Introduction_to_knowledge_graph.pdfIntroduction_to_knowledge_graph.pdf
Introduction_to_knowledge_graph.pdf
 

Kürzlich hochgeladen

Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 

Kürzlich hochgeladen (20)

Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 

Discover Hidden Connections with Knowledge Graph Linking

  • 1. KNOWLEDGE GRAPH EXPANSION AND ENRICHMENT LRI, UNIVERSITÉ PARIS SUD, CNRS, UNIVERSITÉ PARIS SACLAY TUTORIAL@BDA 2017 FATIHA SAÏS 1
  • 2. OUTLINE § Linked Open Data § Knowledge Graphs § Technical Part 1. Data Linking 2. Key Discovery 3. Link invalidation/Requalification § Conclusion and Future Challenges 2
  • 3. FROM THE WWW TO THE WEB OF DATA - applying the principles of the WWW to data data is links, not only properties Linked Open Data 3
  • 4. LINKED DATA PRINCIPLES ① Use HTTP URIs as identifiers for resources à so people can look up the data ② Provide data at the location of URIs à to provide data for interested parties ③ Include links to other resources àso people can discover more things àbridging disciplines and domains è Unlock the potential of island repositories Tim Berners Lee, 2006 4
  • 5. RDF – RESOURCE DESCRIPTION FRAMEWORK • Statements of < subject predicate object > http://dbpedia.org/resource/Airbus dbo:label “Airbus” Subject Predicate Object … is called a triple 5
  • 6. LINKED OPEN DATA 6 Linked Data - Datasets under an open access - 1,139 datasets - over 100B triples - about 500M links - several domains Ex. DBPedia : 1.5 B triples "Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"
  • 8. THE ROLE OF KNOWLEDGE IN AI 8 [Artificial Intelligence 47 (1991)] The knowledge principle: “if a program is to perform a complex task well, it must know a great deal about the world in which it operates.”
  • 9. ONTOLOGY, A DEFINITION 9 “An ontology is an explicit, formal specification of a shared conceptualization.” [Thomas R. Gruber, 1993] Conceptualization: abstract model of domain related expressions Specification: domain related Explicit: semantics of all expressions is clear Formal: machine-readable Shared: consensus (different people have different perceptions)
  • 10. SEMANTIC WEB: ONTOLOGIES 10 RDFS – Resource Description Framework Schema • Lightweight ontologies OWL – Web Ontology Language • Expressive ontologies Source: https://it.wikipedia.org/wiki/File:W3C- Semantic_Web_layerCake.png
  • 11. 1111 OWL ONTOLOGY OWL – Web Ontology Language • Represents rich and complex knowledge about things • Based on First Order Logic (FOL) • Can be used to verify the consistency of knowledge • Can make implicit knowledge explicit • Classes: concepts or collections of objects (individuals) • Properties: • owl:DataTypeProperty (attribute) • owl:ObjectProperty (relation) • Hierarchy • owl:subClassOf • owl:subPropertyOf • Individuals: ground-level of the ontology (instances)
  • 12. ONTOLOGY LEVELS 12 :type :type :type Conceptual level: - classes, properties (relations) Instance level: - facts (individuals)
  • 13. 1313 • Axioms: knowledge definitions in the ontology that were explicitly defined and have not been proven true. • Reasoning over an ontology → Implicit knowledge can be made explicit by logical reasoning • Example: Pompidou museum is an Art Museum < Pompidou_museum rdf:type ArtMuseum> . Pompidou museum contains Hallucination partielle < Pompidou_museum ao:contains Hallucination_partielle> . • Infer that: è Pompidou museum is a CulturalPlace < Pompidou_museum rdf:type CulturalPlace> . Because: Museum subsumes ArtMuseum and CulturalPlace subsumes Museum è Hallucination partielle is a Work <Hallucination_partielle rdf:type ao:Work> . Because: the range of the object property contains is the class Work. OWL ONTOLOGY - AXIOMS
  • 14. LINKED OPEN VOCABULARIES (LOV) Linked Open Vocabularies • Keeps track of available open ontologies and provides them as a graph • Search for available ontologies, open for reuse • Example: http://lov.okfn.org/dataset/lov/vo cabs/foaf 14
  • 15. ONTOLOGY PREDICATES SPREAD ON THE SEMANTIC WEB • RDFa (or Resource Description Framework in Attributes) • Top 50 web sites publishing Semantic Web data, clustered by predicates used. FOAF
  • 16. OUTLINE § Linked Open Data § Knowledge Graphs § Technical Part § Data Linking § Key Discovery § Link invalidation/requalification § Conclusion and Future Challenges 16
  • 17. WHO IS DEVELOPING KNOWLEDGE GRAPHS? 17 2012 2013 2015 2016 2013 2007 2008 2007 2012 Academic side Commercial side
  • 21. CONNECTING EVENTS AND PEOPLE WITH KNOWLEDGE GRAPHS 21
  • 22. TOWARDS A KNOWLEDGE- POWERED DIGITAL ASSISTANT 22 Cortana (Microsoft) • Natural access and storage of knowledge • Chat bots • Personalization • Emotion
  • 23. KNOWLEDGE GRAPH: A DEFINITION … 23 Wikipedia (en) This is not a formal definition!
  • 24. KNOWLEDGE GRAPH: A DEFINITION … 24 [L. Ehrlinger and W. Wöß, SEMANTiCS’2016] [16] H. Paulheim. Knowledge Graph Refinement: A Survey of Approaches and Evaluation Methods. Semantic Web Journal, (Preprint):1–20, 2016. [12] M. Kroetsch and G. Weikum. Journal of Web Semantics: Special Issue on Knowledge Graphs. http://www.websemanticsjournal.org/index.php/ps/ announcement/view/19 [August, 2016]. [17] J. Pujara, H. Miao, L. Getoor, and W. Cohen. Knowledge Graph Identification. In Proceedings of the 12th International Semantic Web Conference - Part I, ISWC ’13, pages 542–557, New York, USA, 2013. Springer. [3] A. Blumauer. From Taxonomies over Ontologies to Knowledge Graphs, July 2014. https://blog.semanticweb.at/2014/07/15/from- taxonomies-over-ontologiesto-knowledge-graphs [August, 2016]. [7] M. Farber, B. Ell, C. Menne, A. Rettinger, and F. Bartscherer. Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Semantic Web Journal, 2016. http://www.semantic-web-journal. net/content/linked- data-quality-dbpedia-freebaseopencyc-wikidata-and-yago [August, 2016] (revised version, under review). à Populated Ontology à Populated Ontology à RDF Graph à RDF Graph à Extracted RDF Graph
  • 25. KNOWLEDGE GRAPH (KG) dbo:Painting dbo:Museum dbo:museum-of dbo:Artist dbo:author dbo:city dbo:PopulatedPlace Thing dbo:Person dbo:Agent owl:equivalentClass(dbo:Municipality, dbo:Place) owl:equivalentClass(dbo:Place, dbo:Wikidata:Q532) owl:equivalentClass(dbo:Village, dbo:PopulatedPlace) owl:equivalentClass(dbo:PopulatedPlace,dbo:Municipality) owl:disjointClass(dbo:PopulatedPlace, dbo:Artist) owl:disjointClass(dbo:PopulatedPlace, dbo:Painting) owl:FunctionalProperty(dbo:city) owl:InverseFunctionalProperty(dbo:museum-of) dbo:birthPlace(X, Y) => dbo:citizsenOf(X, Y) dbo:parentOf(X, Y) => dbo:child(Y, X) is-a is-a is-a is-a is-a Ontology hierarchy Ontology axioms and rules PREFIX dbo: <http://dbpedia.org/ontology#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT ?m ?p WHERE { ?m rdf:type dbo:Museum . ?m dbo:musuem-of ?p .} Querying (SPARQL) - KG saturation: infer whatever can be inferred from the KG. - KG consistency checking: no contradictions - KG repairing - … Reasoners: (Pellet, Fact++, Hermit, etc. ) RDF Graphs 25
  • 26. KNOWLEDGE GRAPH EXPANSION AND ENRICHMENT • Expansion: knowledge graphs are incomplete • Data linking (entity resolution, duplicate detection, reference reconciliation) • Link prediction: add relations • Ontology matching: connect graphs • Missing values prediction/inference • Validation: knowledge graphs may contain errors • Link invalidation • Dealing with errors and ambiguities • Enrichment: can new knowledge emerge from knowledge graphs? • Knowledge discovery • Automatic reasoning and planning 26
  • 27. OUTLINE § Linked Open Data § Knowledge Graphs § Technical Part 1. Data Linking 2. Key Discovery 3. Link invalidation/requalification § Conclusion and Future Challenges 27
  • 29. DATA LINKING 29 • Data linking or Identity link detection consists in detecting whether two descriptions of resources refer to the same real world entity (e.g. same person, same article, same gene).
  • 30. 30 • Data linking or Identity link detection consists in detecting whether two descriptions of resources refer to the same real world entity (e.g. same person, same article, same gene). DATA LINKING
  • 31. DATA LINKING: DIFFICULTIES 31 Different Vocabularies Misspelling errorsIncomplete Information : - date and place of birth ? - museum phone number ? - …. ? • Data linking or Identity link detection consists in detecting whether two descriptions of resources refer to the same real world entity (e.g. same person, same article, same gene).
  • 32. IDENTITY LINK DETECTION PROBLEM • Identity link detection consists in detecting whether two descriptions of resources refer to the same real world entity (e.g. same person, same article, same gene). • Definition (Link Discovery) • Given two sets U1 and U2 of resources • Find a partition of U1 x U2 such that : • S = {(u1,u2) ∈ u1 × u2: owl:sameAs(s,t)} and • D = {(u1,u2) ∈ u1 × u2: owl:differentFrom(s,t)} • A method is said total when (S È D) = (U1 x U2) • A method is said partial when (S È D) Ì (U1 x U2 ) • Naïve complexity ∈ O(U1 × U2), i.e. O(n2) 32
  • 33. Problem which exists since the data exists … and under different terminologies: record linkage, entity resolution, data cleaning, object coreference, duplicate detection, …. Record linkage: used to indicate the bringing together of two or more separately recorded pieces of information concerning a particular individual or family. [NKAJ, Science 1959] SOME OF HISTORY … 33
  • 34. ASIDE: DETECTING IDENTITY LINKS 34 Lise Getoor, VLDB’12 tutorial Ironically “Identity link detection” has many duplicates Entity resolution
  • 35. DATA LINKING IS MORE COMPLEX FOR GRAPHS THAN TABLES (WHY?) Databases Semantic Web Schema/Ontologies Same schema Possibly different ontologies in the same dataset Multiple types Single relation Several classes Open World Assumption NO YES UNA-Unique Name Assumption Yes May be no Data volume XX Thousands XX Millions/Billions (e.g., DBpedia has 1.5 billion triples) Multiple values for a property NO YES P1 hasAuthor “Michel Chein” P1 hasAuthor “Marie-Christine Rousset” 35 • Can propagate similarity decisions è more expensive but better performance • Can be generic and use domain knowledge, e.g. ontology axioms
  • 36. DATA LINKING APPROACHES: DIFFERENT CONTEXTS • Datasets conforming to the same ontology • Datasets conforming to different ontologies • Datasets without ontologies 36
  • 37. DATA LINKING APPROACHES • Instance-based approaches: consider only data type properties (attributes) • Graph-based approaches: consider data type properties (attributes) as well as object properties (relations) to propagate similarity scores/linking decisions (collective data linking) • Supervised approaches: need an expert to build samples of linked data to train models (manual and interactive approaches) • Rule-based approaches: need knowledge to be declared in the ontology or in other format given by an expert 37
  • 38. DATA LINKING APPROACHES • Instance-based approaches: consider only data type properties (attributes) • String comparison 38 m2 architect address “Saadiyat Island, Abu Dhabi” Jean_Nouvel “Nov. 8th 2017” S2 architect architect m1 address Jacques_Lemercier category Art_Antiquity Art_Museum category S1 created Luis_le_Vau architect Ange_Jacues_Gabriel created “1793” “99 rue de Rivoli, Paris” =? =? “Louvre Abou Dabi” “Musée du Louvre” =?
  • 39. DATA LINKING APPROACHES • Graph-based approaches: • consider data type properties (attributes) as well as • object properties (relations) to propagate similarity scores/linking decisions (collective data linking) 39 m2 architect address “Saadiyat Island, Abu Dhabi” Jean_Nouvel “Nov. 8th 2017” architect architect m1 address Jacques_Lemercier category Art_Antiquity Art_Museum category created Luis_le_Vau architect Ange_Jacues_Gabriel created “1793” “99 rue de Rivoli, Paris” =? =? “Louvre Abou Dabi” “Musée du Louvre” =? =? =? =? =? S2 S1
  • 40. DATA LINKING APPROACHES • Supervised approaches: need an expert to build samples of identity links to train models (manual and interactive approaches) 40 Examples of identity links S1S2 Learning of parameters, similarity functions, thresholds, … Identity link detection Identity links
  • 41. DATA LINKING APPROACHES • Rule-based approaches: need knowledge to be declared in the ontology or in other format given by an expert • homepage(w1, y) ∧ homepage(w2, y) è sameAs(w1, w2) • sameAs(Restaurant11, Restaurant21) • sameAs(Restaurant12, Restaurant22) • sameAs(Restaurant13, Restaurant23) 41 … homepage Restaurant11 www.kitchenbar.com Restaurant12 www.jardin.fr Restaurant13 www.gladys.fr Restaurant14 … homepage … www.kitchenbar.com Restaurant21 www.jardin.fr Restaurant22 www.gladys.fr Restaurant23 … Restaurant24 SameAS SameAS SameAS
  • 42. DATA LINKING APPROACHES: EVALUATION • Effectiveness: evaluation of linking results in terms of recall and precision • Recall = (#correct-links-sys) /(#correct-links-groundtruth) • Precision = (#correct-links-sys) /(#links-sys) • F-measure (F1) = (2 x Recall x Precision) / (Recall +Precision) • Efficiency: in terms of time and space (i.e. minimize the linking search space and the interaction actions with an expert/user). • Robustness: override errors/mistakes in the data • Use of benchmarks, like those of OAEI (Ontology Alignment Evaluation Initiative) or Lance 42
  • 43. SIMILARITY MEASURES • Token based (e.g. Jaccard, TF/IDF cosinus) : The similarity depends on the set of tokens that appear in both S and T. è Efficient, but sensitive to spelling errors • Edit based (e.g. Levenstein, Jaro, Jaro-Winkler) : The similarity depends on the smallest sequence of edit operations which transform S into T. è Less efficient, may deal with spelling errors, but sensitive to word order • Hybrids (e.g. N-Grams, Jaro-Winkler/TF-IDF, Soundex) 43 For more details: William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of the 2003 International Conference on Information Integration on the Web (IIWEB'03), Subbarao Kambhampati and Craig A. Knoblock (Eds.). AAAI Press 73-78.
  • 45. FRAMEWORK SILK [Volz et al'09] • Provides a Link Specification Language(LSL) • Allows specifying linking conditions between two datasets • The linking conditions may be expressed in terms of: • Elementary similarity measures (e.g., Jaccard, Jaro) and • Aggregation functions (e.g. max, average) of the similarity scores 45
  • 46. SIMILARITY MEASURES IN SILK [Volz et al'09] 46
  • 47. EXAMPLE OF LSL SPECIFICATION [Volz et al'09] <Silk> <Prefixes> <Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#" /> <Prefix id="dbpedia" namespace="http://dbpedia.org/ontology/" /> <Prefix id="gn" namespace="http://www.geonames.org/ontology#" /> </Prefixes> <DataSources> <DataSource id="dbpedia"> <Param name="endpointURI" value="http://demo_sparql_server1/sparql" /> <Param name="graph" value="http://dbpedia.org" /> </DataSource> <DataSource id="geonames"> <Param name="endpointURI" value="http://demo_sparql_server2/sparql" /> <Param name="graph" value="http://sws.geonames.org/" /> </DataSource> </DataSources> 47 SPARQL endpoints Prefixes
  • 48. EXAMPLE OF LSL SPECIFICATION [Volz et al'09] <Interlinks> <Interlink id="cities"> <LinkType>owl:sameAs</LinkType> <SourceDataset dataSource="dbpedia" var="a"> <RestrictTo> ?a rdf:type dbpedia:City </RestrictTo> </SourceDataset> <TargetDataset dataSource="geonames" var="b"> <RestrictTo> ?b rdf:type gn:P </RestrictTo> </TargetDataset> 48 Entités à lier Link types Entities to be linked
  • 49. EXAMPLE OF LSL SPECIFICATION [Volz et al'09] <LinkageRule> <Aggregate type="average"> <Compare metric="levenshteinDistance" threshold="1"> <Input path="?a/rdfs:label" /> <Input path="?b/gn:name" /> </Compare> <Compare metric="num" threshold="1000" > <Input path="?a/dbpedia:populationTotal" /> <Input path="?b/gn:population" /> </Compare> </Aggregate> </LinkageRule> <Filter limit="1" /> 49 Mesures de similarité Similarity measures Aggregation function
  • 50. EXAMPLE OF LSL SPECIFICATION [Volz et al'09] <Outputs> <Output type="file" minConfidence="0.95"> <Param name="file" value="accepted_links.nt" /> <Param name="format" value="ntriples" /> </Output> <Output type="file" maxConfidence="0.95"> <Param name="file" value="verify_links.nt" /> <Param name="format" value="alignment" /> </Output> </Outputs> </Interlink> </Interlinks> </Silk> 50 Possible links Linking threshold
  • 51. KNOFUSS (INSTANCE-BASED, UNSUPERVISED) [Nikolov et al’12] • Learns linking rules using genetic algorithms: Sim(i1, i2) = fag(w11sim11 (V11,V21), …wmnsimmn (V1m,V2n)) • Fag : aggregation function for the similarity scores • simij: similarity measure between values V1i and V2j • wij: weights in [0..1] • Assumptions: • Unique name assumption (UNA), i.e., two different URIs refer to two different entities. • Good coverage rate between the two datasets • Normalized similarity scores in [0..1] 51
  • 52. KNOFUSS (INSTANCE-BASED, UNSUPERVISED) [Nikolov et al’12] 52 Examples of linking rules learned on the OAEI’10 benchmark Results in term of F-Measure on OAEI’10
  • 53. GRAPH-BASED AND INTERACTIVE APPROACH 53 KANG ET AL.: INTERACTIVE ENTITY RESOLUTION IN RELATIONAL DATA: A VISUAL ANALYTIC TOOL AND ITS EVALUATION 1001 Fig. 1. D-Dupe consists of three coordinated windows: the potential duplicate viewer on the left, the relational context viewer on the upper right corner, and the data detail viewer on the lower right corner. The potential duplicate viewer shows the list of potential duplicate author pairs that were D-Dupe [Kang et al’08]
  • 54. LN2R: A LOGICAL AND NUMERICAL METHOD FOR REFERENCE RECONCILIATON 54 [Saïs et al’07, Saïs et al’09]
  • 55. LN2R (GRAPH BASED, UNSUPERVISED AND INFORMED) [Saïs et al’07, Saïs et al’09] • A combination of two methods: • L2R, a Logical method for reference reconciliation: applies logical rules to infer sure owl:sameAs and owl:differentFrom links • N2R, a Numerical method for reference reconciliation: computes similarity scores for each pair of references • Assumptions • The datasets are conforming to the same ontology • The ontology contains axioms 55
  • 56. Ontology axioms • Disjunction axioms between classes, DISJOINT(C, D) • Functional properties axioms, PF(P) • Inverse functional properties axioms, PFI(P) • A set of properties that is functional or inverse functional axioms Assumptions on the data • Unique Name Assumption, UNA(src1) • Local Unique Name Assumption, LUNA(R) Example: 56 Authored(p, a1), Authored(p, a2), Authored(p, a3) …., Authored(p, an) è (a1 ¹a2), (a1 ¹ a3), (a2 ¹ a3) , … LN2R (GRAPH BASED, UNSUPERVISED AND INFORMED) [Saïs et al’07, Saïs et al’09]
  • 57. N2R: A NUMERICAL METHOD FOR REFERENCE RECONCILIATION 57 [Saïs et al’09]
  • 58. N2R: A NUMERICAL METHOD FOR REFERENCE RECONCILIATION • N2R computes a similarity score for pair of references obtained from their common description. • Uses known similarity measures, e.g. Jaccard, Jaro-Winkler. • Exploits ontology knowledge in a way to be coherent with L2R. • May consider the results of L2R: Reconcile(i, i’), ¬Reconcile(i, i’) , SynVals(v, v’) and ¬SynVals(v, v’). 58 [Saïs et al’09]
  • 59. SIMILARITY DEPENDENCY MODELLING 59 MuseumName+(m1) = {”Le Louvre”}, MuseumName+(m’1) = {”Louvre”}, Located+(m1) = {c1}, Located+(m’1) = {c’1}, Located-(c1) = {m1} , Located-(c’1) = {m’1}, …. m1, m’1 c1, c’1 p1, p’1 “Le Louvre”, “Louvre” “Paris”, “La ville de Paris” “La Joconde”, “l’Européenne” RDF facts in source S1: Located(m1, c1), MuseumName(m1, “le Louvre”) Contains(m1, p1), CityName(c1, “Paris”) PaintingName(p1, “la Joconde”) RDF facts in source S2 : Located(m’1, c’1), MuseumName(m’1, “Louvre”) Contains(m’1, p’1), CityName(c’1, “la Ville de Paris”) PaintingName(p’1, “l’Europèenne”) (c1, c’1) is functionally dependent on (m1, m’1) è Equation system x1 x2 x3 b11 b21 b31 1 1/3 ½1 1 1 1 CAttr(m1, m’1) = {MuseumName} , CAttr(c1, c’1)= {CityName},CAttr(p1,p’1)={PaintingName} CRel(m1, m’1)= {Located, Contains} CRel(c1, c’1) = {Located }, CRel(p1,p’1) = {Contains} [Saïs et al’09]
  • 60. N2R: ILLUSTRATION 60 m1, m’1 c1, c’1 p1, p’1 “Le Louvre”, “Louvre” “Paris”, “La ville de Paris” “La Joconde”, “l’Européenne” x1 x2 x3 b11 p1, p’2 “La Joconde”, “Joconde” x4 b41 b21 b31 l= 1/(| CAttr | + | CRel |) e = 0.02 b11 = 0.8, b21 = 0.3, b31 = 0.1, b41 = 0.7 x1 x2 x3 x4 Initialization 0.0 0.0 0.0 0.0 Iteration 1 0.8 0.3 0.1 0.7 Iteration 2 0.8 0.8 0.4 0.7 Iteration 3 0.8 0.8 0.4 0.7 Solution: x1 = 0.8 x2 = 0.8 x3 = 0.4 x4 = 0.7 x1 = max(max(b11, x3), x4), l * x2) x2 = max(b21, x1) x3 = max(b31, l* x1) x4 = max(b41 , l * x1) [Saïs et al’09]
  • 62. N2R: RESULTS ON CORA Trec=1, all the reconciliations obtained by L2R are also obtained by N2R. Trec=1 to Trec=0.85, the recall increases of 33 % while the precision decreases only of 6 %. Trec = 0.85, the F-measure is of 88 %: • Better than the results obtained by the supervised method of [Singla and Domingos’05] • Worst than those (97 %) obtained by the supervised method of [Dong et al.’05] 62 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1 0,95 0,9 0,85 0,8 0,75 0,7 0,65 0,6 0,55 Trec Rappel Precision F-Mesure [Saïs et al’09]
  • 63. OAEI 2010 – Instance Matching track (PR), 2nd N2R: RESULTS IN OAEI2 2010 63 2 Ontology Alignment Evaluation Initiative [Saïs et al’09]
  • 65. ONTOLOGY MATCHING • Ontology alignment [Shvaiko,Euzenat13]: computes a set A of mappings between elements (classes, properties) of two ontologies O1 and O2: f(O1,O2)=A • The relations that are used to express a mapping can be: owl:equivalentClass, owl:equivalentProperty, rdfs:subClassOf, skos:closeMatch, skos:broader, etc. • Example: A={( owl:equivalentClass( http://dbpedia.org/ontology/City, http://schema.org/City, 0.8)} 65
  • 66. KINDS OF INFORMATION • Terminology: lexical information describing the ontology elements (i.e. labels, comments, …) • Example: Voie vs Voie souterraine • Structure: hierarchy of classes and properties (relations/attributes) • Example: the sub-classes of Voie are very similar to the sub-classes of Route • Extension: the existence of common instances!! 66
  • 67. PARIS [Suchanek et al. 12] • Objective: instance-based ontology alignment and data linking (graph- based, unsupervised and probabilistic) • Inputs: two populated RDFS ontologies with UNA (two different URI refer to two different entities) • Principle: • Compute the similarities between literal values (“12 cm”=“12”) • Iterate (1) and (2) until a fix-point : ① Compute the probability that two instances are linked ① Compute the probabilities of subClassOf/subPropertyOf 67 P(Ci ⊆ Cj ), P(Pi ⊆ Pj ) P(i1 = i2)
  • 68. • Property functionality degree (computed from data) • The more a property is functional the more the probability of X=Y will be. • Local functionality: Fun(p,x) = 1 / #y:p(x, y) • Global functionality: Fun(p) = (#x : ∃y:p(x,y)) / (#x,y : p(x,y)) • Example: city(m1,Londres), city(m1,Orsay), city(m2,Tokyo) Fun(city,m1)= ½ Fun(city,m2)=1 Fun(city)=2/3 è The same is done for inverse functionality (denoted fun-1) 68 PARIS [Suchanek et al. 12]
  • 69. Link probability computation • Positive evidence (P1): if there exists a property that is highly inverse functional which has range values that are equal with a high probability isbn(x,isbn1), isbn(x’,isbn2), P(isbn1=isbn2) = 1, fun-1(isbn)=1 … P1(x=x’) = 1 - ((1 - (1.1)) . …) = 1-(0. …) = 1 • Negative evidence (P2): if there exists a property that is highly functional which has range values for the probability to be equal is very low. • Combination : P(x=x’) = P1(x=x’).P2(x=x’) 69 P1(x = x') =1− (1− Fun−1 r(x,y) r(x',y') ∏ (p).P(y = y')) PARIS [Suchanek et al. 12]
  • 70. • The probabilities of the existence of a subsumption mapping between properties and between classes are also computed • It is based on the proportion of common instances comparing to the number of instances of the general class • To compute these probabilities, the probabilities of the existence of a sameAs link between instances are exploited. 70 P(C ⊆ C') = #(C∩C') /#C P(p ⊆ p') = #(p∩ p') /# p PARIS [Suchanek et al. 12]
  • 71. PARIS - EXPERIMENTS Ontology #Instances #Classes #Relations Yago 2 795 289 292 206 67 Dbpedia 2 365 777 318 1 109 71 Instances Classes Relations Précision Rappel F-Mesure Yago Í DBp Précision DBpÍ Yago Précision Yago Í DBp Précision DBpÍ Yago Précision 90% 73% 81% - - 100% 92% 90% 73% 81% 94% 84% 100% 92% Instances: DBPedia and Yago uses the URIs of Wikipedia (recall and precision possible) Classes/properties: sampling + expert 5h00 to compute the linking probabilities for instances in one iteration (2h for the classes and 20 minutes for the properties) Linking or mapping if the probability >0.4 [Suchanek et al. 12]
  • 72. SUMMARY Data linking: numerous and different approaches … • Supervised approaches: needs samples of linked data à It can be avoided by using assumptions like (UNA) • Graph-based approaches: decision propagation (good recall but highly time consuming) • Logical approaches: good precision but partial è Few approaches generate differentFrom(i1,i2) or use dissimilarity evidence • Informed approaches: need knowledge to be declared in the ontology (generality) and/or ad-hoc knowledge given by an expert (a selection of properties, similarity functions) àThis kind of knowledge are not always available but can be learnt/discovered from the data (e.g., key/rule discovery approaches) [Symeonidou et al. 14, Symeonidou et al. 17, Galarraga et al. 13] 72
  • 73. OUTLINE § Linked Open Data § Knowledge Graphs § Technical Part 1. Data Linking 2. Key Discovery 3. Link invalidation/requalification § Conclusion and Future Challenges 73
  • 74. KEY DISCOVERY PHD OF DANAI SYMEONIDOU IN COLLABORATION WITH: NATHALIE PERNELLE (LRI, UNIV. PARIS SUD), 74 [Qualinca ANR project (2012-2016)]
  • 75. DATA LINKING USING RULES homepage(w1, y) ∧ homepage(w2, y) è sameAs(w1, w2) 75 … homepage Restaurant11 www.kitchenbar.com Restaurant12 www.jardin.fr Restaurant13 www.gladys.fr Restaurant14 … homepage … www.kitchenbar.com Restaurant21 www.jardin.fr Restaurant22 www.gladys.fr Restaurant23 … Restaurant24
  • 76. DATA LINKING USING RULES homepage(w1, y) ∧ homepage(w2, y) è sameAs(w1, w2) • sameAs(Restaurant11, Restaurant21) • sameAs(Restaurant12, Restaurant22) • sameAs(Restaurant13, Restaurant23) 76 … homepage Restaurant11 www.kitchenbar.com Restaurant12 www.jardin.fr Restaurant13 www.gladys.fr Restaurant14 … homepage … www.kitchenbar.com Restaurant21 www.jardin.fr Restaurant22 www.gladys.fr Restaurant23 … Restaurant24 SameAS SameAS SameAS
  • 77. DATA LINKING USING RULES Rules • Logical Rules • Ex. For instances of the class Restaurant homepage(w1, y) ∧ homepage(w2, y) è sameAs(w1, w2) {homepage}: discriminative property • Complex Rules • Ex. For instances of the class Restaurant max(Jaccard(lat(w1, n);lat(w2, m));jaro(long(w1, x);long(w2, y)) > 0.8 èsameAs(w1, w2) {lat, long}: discriminative property set 77 Rules contain discriminative properties => keys
  • 78. KEYS Key: A set of properties that uniquely identifies every instance in the data 78 FirstName LastName Birthdate Profession Person1 Anne Tompson 15/02/88 Actor Person2 Marie Tompson 02/09/75 Researcher Person3 Marie David 15/02/85 Teacher Person4 Vincent Solgar 06/12/90 Teacher Is [FirstName] a key? ✖ Is [LastName] a key? ✖
  • 79. KEYS Key: A set of properties that uniquely identifies every instance in the data 79 FirstName LastName Birthdate Profession Person1 Anne Tompson 15/02/88 Actor Person2 Marie Tompson 02/09/75 Researcher Person3 Marie David 15/02/85 Teacher Person4 Vincent Solgar 06/12/90 Teacher Is [FirstName] a key? ✖ Is [LastName] a key? ✖ Is [FirstName,LastName] a key? ✔
  • 80. KEYS - KEY MONOTONICITY Key monotonicity: When a set of properties is a key, all its supersets are also keys Minimal Key: A key that by removing one property stops being a key 80 FirstName LastName Birthdate Profession Person1 Anne Tompson 15/02/88 Actor Person2 Marie Tompson 02/09/75 Researcher Person3 Marie David 15/02/85 Teacher Person4 Vincent Solgar 06/12/90 Teacher ✔Is [FirstName,LastName, Birthday] a key? ✖Is [FirstName,LastName, Birthday] a minimal key? Minimal keys: [[FirstName,LastName],[FirstName,Profession] [Birthdate], [LastName,Profession]]
  • 81. KEYS DECLARED BY EXPERTS FOR DATA LINKING Not an easy task: • Experts are not aware of all the keys Ex. {SSN}, {ISBN} easy to declare Ex. {Name, DateOfBirth, BornIn} is it a key for the class Person? • Erroneous keys can be given by experts • As many keys as possible • More keys => More linking rules Goal: Discover keys automatically 81
  • 82. OWL2 KEY, KEY IN THE OPEN WORLD OWL2 Key for a class: a combination of properties that uniquely identify each instance of a class • hasKey( CE ( OPE1 ... OPEm ) ( DPE1 ... DPEn ) ) :hasKey(Book(Author) (Title)) means: Book(x1)∧Book(x2)∧Author(x1, y)∧Author (x2, y)∧Title(x1,w) ∧Title(x2, w) è sameAs(x1, x2) 82
  • 83. KEYS IN THE OPEN WORLD Key in the Open World • Key discovery: For a set of properties that is a key, there is no instance that shares values with another instances 83 lastName firstName hasFriend Person1 Tompson Manuel Person2, Person3 Person2 Tompson Maria, Hellen Person1 Person3 David George Person2, Person4 Person4 Solgar Michel Person3
  • 84. KEYS IN THE OPEN WORLD Key in the Open World • Key discovery: For a set of properties that is a key, there is no instance that shares values with another instances • Ex. firstName is a key in the Open World since • There do exist any people that share first names 84 lastName firstName hasFriend Person1 Tompson Manuel Person2, Person3 Person2 Tompson Maria, Hellen Person1 Person3 David George Person2, Person4 Person4 Solgar Michel Person3
  • 85. KEYS IN THE OPEN WORLD Key in the Open World • Key discovery: For a set of properties that is a key, there is no instance that shares values with another instances • Ex. hasFriend is not a key in the Open World since • There exist at least two people that share a friend 85 lastName firstName hasFriend Person1 Tompson Manuel Person2, Person3 Person2 Tompson Maria, Hellen Person1 Person3 David George Person2, Person4 Person4 Solgar Michel Person3
  • 86. KEYS IN THE CLOSED WORLD Key in the Closed World • For each property that participate in a key, every set of values for a property and an instance should be unique 86 lastName firstName hasFriend Person1 Tompson Manuel Person2, Person3 Person2 Tompson Maria Person1 Person3 David George Person2, Person4 Person4 Solgar Michel Person3
  • 87. KEYS IN THE CLOSED WORLD Key in the Closed World • For each property that participate in a key, every set of values for a property and an instance should be unique • Ex. hasFriend is a key in the Closed World since {Person2,Person3} <> {Person1} <> {Person2,Person4} <> {Person3} 87 lastName firstName hasFriend Person1 Tompson Manuel Person2, Person3 Person2 Tompson Maria Person1 Person3 David George Person2, Person4 Person4 Solgar Michel Person3
  • 88. CLOSED WORLD APPROACHES VS OPEN WORLD APPROACHES 88
  • 89. KEY DISCOVERY APPROACHES CWA approaches • Keys and Pseudo-Keys Detection for Web Datasets Cleaning and Interlinking • ROCKER: A Refinement Operator for Key Discovery OWA approaches • KD2R: a Key Discovery method for reference reconciliation • SAKey: Scalable almost key discovery in RDF data • Linkkey: Data interlinking through robust Linkkey extraction • VICKEY: Conditional key discovery 89
  • 90. KEY DISCOVERY APPROACHES CWA approaches • Keys and Pseudo-Keys Detection for Web Datasets Cleaning and Interlinking • ROCKER: A Refinement Operator for Key Discovery OWA approaches • KD2R: a Key Discovery method for reference reconciliation • SAKey: Scalable almost key discovery in RDF data • Linkkey: Data interlinking through robust Linkkey extraction • VICKEY: Conditional key discovery 90
  • 91. KEYS AND PSEUDO-KEYS DETECTION FOR WEB DATASETS CLEANING AND INTERLINKING [Atencia et al. 12] Bottom-up approach that discovers • Keys in the Closed World • Pseudo keys - keys that tolerate exceptions 91 x P2P1 P3 P4 P1P2 P1P3 P1P4 P2P3 P2P4 P3P4 P1P2P3 P1P2P4 P2P3P4 P1P2P3P4
  • 92. KEYS AND PSEUDO-KEYS DETECTION FOR WEB DATASETS CLEANING AND INTERLINKING [Atencia et al. 12] Bottom-up approach that discovers • Keys in the Closed World • Pseudo keys - keys that tolerate exceptions 92 x P1
  • 93. KEYS AND PSEUDO-KEYS DETECTION FOR WEB DATASETS CLEANING AND INTERLINKING [Atencia et al. 12] Bottom-up approach that discovers • Keys in the Closed World • Pseudo keys - keys that tolerate exceptions 93 x P2P1
  • 94. KEYS AND PSEUDO-KEYS DETECTION FOR WEB DATASETS CLEANING AND INTERLINKING [Atencia et al. 12] Bottom-up approach that discovers • Keys in the Closed World • Pseudo keys - keys that tolerate exceptions 94 x P2P1 P3
  • 95. KEYS AND PSEUDO-KEYS DETECTION FOR WEB DATASETS CLEANING AND INTERLINKING [Atencia et al. 12] Top-down approach that discovers • Keys in the Closed World • Pseudo keys - keys that tolerate exceptions 95 x P2P1 P3 P4
  • 96. KEYS AND PSEUDO-KEYS DETECTION FOR WEB DATASETS CLEANING AND INTERLINKING [Atencia et al. 12] 96 x P2P1 P3 P4 Bottom-up approach that discovers • Keys in the Closed World • Pseudo keys - keys that tolerate exceptions If P4 is a key => P1P4, P2P4,.., P1P2P3P4 are also keys
  • 97. KEYS AND PSEUDO-KEYS DETECTION FOR WEB DATASETS CLEANING AND INTERLINKING [Atencia et al. 12] Bottom-up approach that discovers • Keys in the Closed World • Pseudo keys - keys that tolerate exceptions If P4 is a key => P1P4, P2P4,.., P1P2P3P4 are also keys 97 x P2P1 P3 P4 P1P2 P1P3 P1P4 P2P3 P2P4 P3P4
  • 98. KEYS AND PSEUDO-KEYS DETECTION FOR WEB DATASETS CLEANING AND INTERLINKING [Atencia et al. 12] Bottom-up approach that discovers • Keys in the Closed World • Pseudo keys - keys that tolerate exceptions If P4 is a key => P1P4, P2P4,.., P1P2P3P4 are also keys 98 x P2P1 P3 P4 P1P2 P1P3 P1P4 P2P3 P2P4 P3P4 P1P2P3 P1P2P4 P2P3P4
  • 99. KEYS AND PSEUDO-KEYS DETECTION FOR WEB DATASETS CLEANING AND INTERLINKING [Atencia et al. 12] Bottom-up approach that discovers • Keys in the Closed World • Pseudo keys - keys that tolerate exceptions If P4 is a key => P1P4, P2P4,.., P1P2P3P4 are also keys 99 x P2P1 P3 P4 P1P2 P1P3 P1P4 P2P3 P2P4 P3P4 P1P2P3 P1P2P4 P2P3P4 P1P2P3P4
  • 100. KEYS AND PSEUDO-KEYS DETECTION FOR WEB DATASETS CLEANING AND INTERLINKING [Atencia et al. 12] To verify if a set of properties is a key • Partition instances according to their sharing values • If each partition contains only one instances => Key Key quality measures • Support of a set of properties P: • Discriminability of a set of properties P (pseudo-keys): 100 support ! = # !"#$%"&'# !"#$%&'"! !" ! # !"" !"#$%"&'#
  • 101. [ATENCIA ET AL. 12]- EXAMPLE 101 Name Actor Director ReleaseDate Website Language film 1 “Ocean’s 11” “B. Pitt” “J. Roberts” “S. Soderbergh” “3/4/01” www.oceans11.com --- film 2 “Ocean’s 12” “B. Pitt” “J. Roberts” “S. Soderbergh” “R. Howard” “2/5/04” www.oceans12.com “english” film 3 “Ocean’s 13” “B. Pitt” “G. Clooney” “S. Soderbergh” “R. Howard” “30/6/07” www.oceans13.com “english” film 4 “The descendants” “N. Krause” “G. Clooney” “A. Payne” “15/9/11” --- “english” film 5 “Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.co m “english”
  • 102. [ATENCIA ET AL. 12]- EXAMPLE 102 Film3Film2Film1 Film5Film4 [Name]: key support 5/5 =1 Discriminability = 5/5 =1 Partitions using [Name] è Name Actor Director ReleaseDate Website Language film 1 “Ocean’s 11” “B. Pitt” “J. Roberts” “S. Soderbergh” “3/4/01” www.oceans11.com --- film 2 “Ocean’s 12” “B. Pitt” “J. Roberts” “S. Soderbergh” “R. Howard” “2/5/04” www.oceans12.com “english” film 3 “Ocean’s 13” “B. Pitt” “G. Clooney” “S. Soderbergh” “R. Howard” “30/6/07” www.oceans13.com “english” film 4 “The descendants” “N. Krause” “G. Clooney” “A. Payne” “15/9/11” --- “english” film 5 “Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.com “english”
  • 103. 103 Partitions using [Actor] è Name Actor Director ReleaseDate Website Language film 1 “Ocean’s 11” “B. Pitt” “J. Roberts” “S. Soderbergh” “3/4/01” www.oceans11.com --- film 2 “Ocean’s 12” “B. Pitt” “J. Roberts” “S. Soderbergh” “R. Howard” “2/5/04” www.oceans12.com “english” film 3 “Ocean’s 13” “B. Pitt” “G. Clooney” “S. Soderbergh” “R. Howard” “30/6/07” www.oceans13.com “english” film 4 “The descendants” “N. Krause” “G. Clooney” “A. Payne” “15/9/11” --- “english” film 5 “Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.co m “english” Film4 Film3Film1 Film2 Film5 [Actor]: ??? [ATENCIA ET AL. 12]- EXAMPLE
  • 104. 104 Partitions using [Actor] è Name Actor Director ReleaseDate Website Language film 1 “Ocean’s 11” “B. Pitt” “J. Roberts” “S. Soderbergh” “3/4/01” www.oceans11.com --- film 2 “Ocean’s 12” “B. Pitt” “J. Roberts” “S. Soderbergh” “R. Howard” “2/5/04” www.oceans12.com “english” film 3 “Ocean’s 13” “B. Pitt” “G. Clooney” “S. Soderbergh” “R. Howard” “30/6/07” www.oceans13.com “english” film 4 “The descendants” “N. Krause” “G. Clooney” “A. Payne” “15/9/11” --- “english” film 5 “Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.co m “english” Film4 Film3Film1 Film2 Film5 [Actor]: pseudo key Support = 5/5 =1 Discriminability = 3/4=0.75 [ATENCIA ET AL. 12]- EXAMPLE
  • 105. 105 Partitions using [Actor,Director] è Name Actor Director ReleaseDate Website Language film 1 “Ocean’s 11” “B. Pitt” “J. Roberts” “S. Soderbergh” “3/4/01” www.oceans11.com --- film 2 “Ocean’s 12” “B. Pitt” “J. Roberts” “S. Soderbergh” “R. Howard” “2/5/04” www.oceans12.com “english” film 3 “Ocean’s 13” “B. Pitt” “G. Clooney” “S. Soderbergh” “R. Howard” “30/6/07” www.oceans13.com “english” film 4 “The descendants” “N. Krause” “G. Clooney” “A. Payne” “15/9/11” --- “english” film 5 “Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.co m “english” Film3Film2Film1 Film5Film4 [Actor, Director]: key Support = 4/5 =0.8 Discriminability = 5/5=1 [ATENCIA ET AL. 12]- EXAMPLE
  • 106. KEY DISCOVERY APPROACHES CWA approaches • Keys and Pseudo-Keys Detection for Web Datasets Cleaning and Interlinking • ROCKER: A Refinement Operator for Key Discovery OWA approaches • KD2R: a Key Discovery method for reference reconciliation • SAKey: Scalable almost key discovery in RDF data • Linkkey: Data interlinking through robust Linkkey extraction • VICKEY: Conditional key discovery 106
  • 107. SAKEY [Symeonidou et al. 14] SAKey: Scalable Almost Key discovery approach for: • Incomplete data (optimistic heuristic) • Errors • Duplicates • Large datasets Discovers almost keys • Sets of properties that are not keys due to few exceptions 107
  • 108. ALMOST KEY DISCOVERY STRATEGY • Naive automatic way to discover keys • Examine all the possible combinations of properties • Scan all instances for each candidate key • Example: Class described by 15 properties è215 = 32767 candidate keys • Discover keys efficiently by: • Reducing the combinations • Partially scanning the data 108
  • 109. KEY DISCOVERY SAKey: Key discovery approach for very large RDF datasets that may contain erroneous data or duplicates Region Producer Colour Wine1 Bordeaux Dupont White Wine2 Bordeaux Baudin Rose Wine3 Languedoc Dupont Red Wine4 Languedoc Faure Red 109 Producer is a key?
  • 110. KEY DISCOVERY SAKey: Key discovery approach for very large RDF datasets that may contain erroneous data or duplicates Region Producer Colour Wine1 Bordeaux Dupont White Wine2 Bordeaux Baudin Rose Wine3 Languedoc Dupont Red Wine4 Languedoc Faure Red 110 Region is a non key? • Interested only in maximal non keys: • All the sets of properties that are not maximal non keys are keys • Example: class described by the properties p1, p2, p3, p4 Maximal non key = {{p1, p2}} keys = {{p3}, {p4}}è
  • 112. KEY DISCOVERY SAKey: Key discovery approach for very large RDF datasets that may contain erroneous data or duplicates • n-almost key: A set of properties for which at most n instances are identical values Region Producer Colour Wine1 Bordeaux Dupont White Wine2 Bordeaux Baudin Rose Wine3 Languedoc Dupont Red Wine4 Languedoc Faure Red • Examples of keys {Region, Producer}: 0-almost key 112 [Symeonidou et al. 14]
  • 113. KEY DISCOVERY SAKey: Key discovery approach for very large RDF datasets that may contain erroneous data or duplicates • n-almost key: A set of properties for which at most n instances are identical values Region Producer Colour Wine1 Bordeaux Dupont White Wine2 Bordeaux Baudin Rose Wine3 Languedoc Dupont Red Wine4 Languedoc Faure Red • Examples of keys {Region, Producer}: 0-almost key {Producer}: 2-almost key Tool available in https://www.lri.fr/sakey 113 [Symeonidou et al. 14]
  • 114. N-ALMOST KEYS Exception of a key: an instance that shares values with another instance for a given set of properties P 114 Films HasName HasActor HasDirector ReleaseDate HasWebsite HasLanguage f1 “Ocean’s 11” “B. Pitt” “J. Roberts” “S. Soderbergh” “3/4/01” www.oceans11.com --- f2 “Ocean’s 12” “B. Pitt” “G. Clooney” “J. Roberts” “S. Soderbergh” “R. Howard” “2/5/04” www.oceans12.com --- f3 “Ocean’s 13” “B. Pitt” “G. Clooney” “S. Soderbergh” “R. Howard” “30/6/07” www.oceans13.com --- f4 “The descendants” “N. Krause” “G. Clooney” “A. Payne” “15/9/11” www.descendants.com “english” f5 “Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.com “english” f6 “Ocean’s 12“ --- “R. Howard” “2/5/04” --- --- [Symeonidou et al. 14]
  • 115. N-ALMOST KEYS Exception of a key: an instance that shares values with another instance for a given set of properties P • f1, f2 and f3 are three exceptions for the property set {HasActor} 115 Films HasName HasActor HasDirector ReleaseDate HasWebsite HasLanguage f1 “Ocean’s 11” “B. Pitt” “J. Roberts” “S. Soderbergh” “3/4/01” www.oceans11.com --- f2 “Ocean’s 12” “B. Pitt” “G. Clooney” “J. Roberts” “S. Soderbergh” “R. Howard” “2/5/04” www.oceans12.com --- f3 “Ocean’s 13” “B. Pitt” “G. Clooney” “S. Soderbergh” “R. Howard” “30/6/07” www.oceans13.com --- f4 “The descendants” “N. Krause” “G. Clooney” “A. Payne” “15/9/11” www.descendants.com “english” f5 “Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.com “english” f6 “Ocean’s 12“ --- “R. Howard” “2/5/04” --- --- [Symeonidou et al. 14]
  • 116. N-ALMOST KEYS Exception of a key: an instance that shares values with another instance for a given set of properties P • f1, f2 and f3 are three exceptions for the property set {HasActor} Exception Set EP: set of exceptions for P • EP = {f1, f2, f3} È {f2, f3, f4} = {f1, f2, f3, f4} for {HasActor} 116 Films HasName HasActor HasDirector ReleaseDate HasWebsite HasLanguage f1 “Ocean’s 11” “B. Pitt” “J. Roberts” “S. Soderbergh” “3/4/01” www.oceans11.com --- f2 “Ocean’s 12” “B. Pitt” “G. Clooney” “J. Roberts” “S. Soderbergh” “R. Howard” “2/5/04” www.oceans12.com --- f3 “Ocean’s 13” “B. Pitt” “G. Clooney” “S. Soderbergh” “R. Howard” “30/6/07” www.oceans13.com --- f4 “The descendants” “N. Krause” “G. Clooney” “A. Payne” “15/9/11” www.descendants.com “english” f5 “Bourne Identity“ “D. Liman” --- “12/6/12” www.bourneIdentity.com “english” f6 “Ocean’s 12“ --- “R. Howard” “2/5/04” --- --- [Symeonidou et al. 14]
  • 117. N-ALMOST KEYS n-almost key: a set of properties where |EP|≤ n • {HasActor} is a 4-almost key 117 [Symeonidou et al. 14]
  • 118. N-ALMOST KEYS n-almost key: a set of properties where |EP|≤ n • {HasActor} is a 4-almost key n-non key: a set of properties where |EP|≥ n • Using all the maximal n-non keys we can derive all the minimal (n-1)-almost keys 118 [Symeonidou et al. 14]
  • 119. N-ALMOST KEYS n-almost key: a set of properties where |EP|≤ n • {HasActor} is a 4-almost key n-non key: a set of properties where |EP|≥ n • Using all the maximal n-non keys we can derive all the minimal (n-1)-almost keys 119 All combinations of properties [Symeonidou et al. 14]
  • 120. N-ALMOST KEYS n-almost key: a set of properties where |EP|≤ n • {HasActor} is a 4-almost key n-non key: a set of properties where |EP|≥ n • Using all the maximal n-non keys we can derive all the minimal (n-1)-almost keys 120 All sets of properties that contain at least 5 exceptions All combinations of properties 5-non keys [Symeonidou et al. 14]
  • 121. N-ALMOST KEYS n-almost key: a set of properties where |EP|≤ n • {HasActor} is a 4-almost key n-non key: a set of properties where |EP|≥ n • Using all the maximal n-non keys we can derive all the minimal (n-1)-almost keys 121 All sets of properties that contain at least 5 exceptions All combinations of properties All sets of properties that contain less than 5 exceptions 5-non keys 4-almost keys [Symeonidou et al. 14]
  • 122. N-NON KEY DISCOVERY: INITIAL MAP HasActor {{f1, f2}, {f1, f2, f3}, {f2, f3, f4}, {f4}, {f5}} HasDirector {{f1,f2,f3}, {f2, f3, f6}, {f4}} ReleaseDate {{f1}, {f2, f6}, {f3}, {f4}, {f5}} HasName {{f1}, {f2, f6}, {f3}, {f4}, {f5}} HasLanguage {{f4, f5}} HasWebsite {{f1}, {f2}, {f3}, {f4}, {f5}, {f6}} “B. Pitt” “G. Clooney” “D. Liman”“N. Krause”“J. Roberts”“S. Soderbergh” 122 [Symeonidou et al. 14]
  • 123. N-NON KEY DISCOVERY: DATA FILTERING Singleton filtering 123 HasActor {{f1, f2}, {f1, f2, f3}, {f2, f3, f4}, {f4}, {f5}} HasDirector {{f1,f2,f3}, {f2, f3, f6}, {f4}} ReleaseDate {{f1}, {f2, f6}, {f3}, {f4}, {f5}} HasName {{f1}, {f2, f6}, {f3}, {f4}, {f5}} HasLanguage {{f4, f5}} HasWebsite {{f1}, {f2}, {f3}, {f4}, {f5}, {f6}} “B. Pitt” “G. Clooney” “D. Liman”“N. Krause”“J. Roberts”“S. Soderbergh” [Symeonidou et al. 14]
  • 124. N-NON KEY DISCOVERY: DATA FILTERING Singleton filtering 124 HasActor {{f1, f2}, {f1, f2, f3}, {f2, f3, f4}, {f4}, {f5}} HasDirector {{f1,f2,f3}, {f2, f3, f6}, {f4}} ReleaseDate {{f1}, {f2, f6}, {f3}, {f4}, {f5}} HasName {{f1}, {f2, f6}, {f3}, {f4}, {f5}} HasLanguage {{f4, f5}} HasWebsite {{f1}, {f2}, {f3}, {f4}, {f5}, {f6}} “B. Pitt” “G. Clooney” “D. Liman”“N. Krause”“J. Roberts”“S. Soderbergh” [Symeonidou et al. 14]
  • 125. N-NON KEY DISCOVERY: DATA FILTERING Singleton filtering 125 HasActor {{f1, f2}, {f1, f2, f3}, {f2, f3, f4}, {f4}, {f5}} HasDirector {{f1,f2,f3}, {f2, f3, f6}, {f4}} ReleaseDate {{f1}, {f2, f6}, {f3}, {f4}, {f5}} HasName {{f1}, {f2, f6}, {f3}, {f4}, {f5}} HasLanguage {{f4, f5}} HasWebsite {{f1}, {f2}, {f3}, {f4}, {f5}, {f6}} “B. Pitt” “G. Clooney” “D. Liman”“N. Krause”“J. Roberts”“S. Soderbergh” Single key [Symeonidou et al. 14]
  • 126. N-NON KEY DISCOVERY: DATA FILTERING Singleton filtering 126 HasActor {{f1, f2, f3}, {f2, f3, f4}} HasDirector {{f1,f2,f3}, {f2, f3, f6}} ReleaseDate {{f2, f6}} HasName {{f2, f6}} HasLanguage {{f4, f5}} [Symeonidou et al. 14]
  • 127. N-NON KEY DISCOVERY: POTENTIAL N-NON KEYS Combinations of properties not needed be explored • Incomplete data • Properties referring to different classes Lake Mountain NaturalPlace depth mountainRange 127 rdfs:subclassOfrdfs:subclassOf [Symeonidou et al. 14]
  • 128. N-NON KEY DISCOVERY: POTENTIAL N-NON KEYS Combinations of properties not needed be explored • Incomplete data • Properties referring to different classes • Potential n-non keys: Sets of properties that possibly refer to n-non keys 128 Lake Mountain NaturalPlace depth mountainRange rdfs:subclassOfrdfs:subclassOf [Symeonidou et al. 14]
  • 129. N-NON KEY DISCOVERY HasActor • {f1, f2, f3} È {f2, f3, f4} = {f1, f2, f3, f4} => 4-non key Composite n-non keys • Intersections between sets of different properties 129 HasActor {{f1, f2, f3}, {f2, f3, f4}} HasDirector {{f1,f2,f3}, {f2, f3, f6}} ReleaseDate {{f2, f6}} HasName {{f2, f6}} HasLanguage {{f4, f5}} [Symeonidou et al. 14]
  • 131. 131 N-NON KEY DISCOVERY {hasActor, director} è 3-non key [Symeonidou et al. 14]
  • 132. 132 N-NON KEY DISCOVERY {hasActor, director} è 3-non key … [Symeonidou et al. 14]
  • 133. KEY MERGE: SEVERAL DATASETS Goal: Keys valid in both datasets • More sure keys Intuition: Computation of Cartesian product of sets of keys • Keep only minimal keys 133 [firstName, LastName] [hasFriend] [firstName] KeysA KeysB [Symeonidou et al. 14]
  • 134. KEY MERGE: SEVERAL DATASETS Goal: Keys valid in both datasets • More sure keys Intuition: Computation of Cartesian product of sets of keys • Keep only minimal keys 134 [firstName, LastName] [hasFriend] [firstName] KeysA KeysB [firstName, LastName] [hasFriend, firstName] KeysAB [Symeonidou et al. 14]
  • 135. DATA LINKING USING ALMOST KEYS Goal: Compare linking results using almost keys with different n Evaluation of linking using • Recall • Precision • F-Measure Datasets • OAEI 2010 • OAEI 2013 Conclusion • Linking results using n-almost keys are the better than using keys 135 [Symeonidou et al. 14]
  • 136. EXAMPLE: DATA LINKING USING ALMOST KEYS OAEI 2013 - Person • BirthName, BirthDate, award, comment, label, BirthPlace, almaMater, doctoralAdvisor Almost keys Recall Precision F-Measure 0-almost key {BirthDate, award} 9.3% 100% 17% 2-almost key {BirthDate} 32.5% 98.6% 49% # exceptions Recall Precision F-measure 0, 1 25.6% 100% 41% 2, 3 47.6% 98.1% 64.2% 4, 5 47.9% 96.3% 63.9% 6, ..., 16 48.1% 96.3% 64.1% 17 49.3% 82.8% 61.8% 136 [Symeonidou et al. 14]
  • 137. KD2R VS. SAKEY - NON KEY DISCOVERY Class # triples # Instances #Properties KD2R Runtime SAKey Runtime (n=0) DB:Website 8506 2870 66 13min 1s YA:Building 114783 54384 17 26s 9s DB:BodyOfWater 1068428 34000 200 outOfMem. 37s DB:NaturalPlace 1604348 49913 243 outOfMem. 1min10s Dbpedia class= DB:NaturalPlace 137 [Symeonidou et al. 14]
  • 138. KD2R VS. SAKEY - KEY DERIVATION Class # non keys # keys KD2R SAKey (n=0) DB:Lake 50 480 1min10s 1s DB:Mountain 49 821 8min 1s DB:BodyOfWater 220 3846 > 1 day 66s DB:NaturalPlace 302 7011 > 2 days 5min Dbpedia class= DB:BodyOfWater 138 [Symeonidou et al. 14]
  • 139. KEY DISCOVERY APPROACHES CWA approaches • Keys and Pseudo-Keys Detection for Web Datasets Cleaning and Interlinking • ROCKER: A Refinement Operator for Key Discovery OWA approaches • KD2R: a Key Discovery method for reference reconciliation • SAKey: Scalable almost key discovery in RDF data • Linkkey: Data interlinking through robust Linkkey extraction • VICKEY: Conditional key discovery 139
  • 140. LINKKEY 140 Ontology1 Ontology2 Data of classA Data of classB Linkkey Discovery Given a pair of classes in two datasets conforming to two different ontologies: 1) Discover Linkkeys – maximal sets of property pairs that can link instances of two different classes [Atencia et al. 2014]
  • 141. LINKKEY 141 Person1 Literal LastName Literal Literal Person2 Literal LN Literal Literal Given a pair of classes in two datasets conforming to two different ontologies: 1) Discover Linkkeys – maximal sets of property pairs that can link instances of two different classes [Atencia et al. 2014]
  • 142. LINKKEY [ATENCIA ET AL. 2014] 142 Person1 Literal LastName Literal Literal Person2 Literal LN Literal Literal Owl:EquivalentClass Given a pair of classes in two datasets conforming to two different ontologies: 1) Discover Linkkeys – maximal sets of property pairs that can link instances of two different classes
  • 143. LINKKEY [ATENCIA ET AL. 2014] 143 Person1 Literal LastName Literal Literal Person2 Literal LN Literal Literal Owl:EquivalentClass LastName Nationality Profession P11 Tompson Greek P12 Dupont French Researcher LN NL PR P21 Tompson Greek P22 Tompson Greek, French Reseacher Given a pair of classes in two datasets conforming to two different ontologies: 1) Discover Linkkeys – maximal sets of property pairs that can link instances of two different classes
  • 144. LINKKEY [ATENCIA ET AL. 2014] 144 Person1 Literal LastName Literal Literal Person2 Literal LN Literal Literal Owl:EquivalentClass Ex. {<LastName,LN>,<Nationality,NL>}, {<Profession,PR>,<Nationality,NL>} LastName Nationality Profession P11 Tompson Greek P12 Dupont French Researcher LN NL PR P21 Tompson Greek P22 Tompson Greek, French Reseacher Given a pair of classes in two datasets conforming to two different ontologies: 1) Discover Linkkeys – maximal sets of property pairs that can link instances of two different classes
  • 145. LINKKEY [ATENCIA ET AL. 2014] 145 Given a pair of classes in two datasets conforming to two different ontologies: 1) Discover Linkkeys – maximal sets of property pairs that can link instances of two different classes 2) Linkkey quality evaluation: 1) use measures of discriminability and coverage for non supervised linkkey extraction 2) an approximation of precision and recall for the supervised case. Ex. {<LastName,LN>,<Nationality,NL>}, {<Profession,PR>,<Nationality,NL>} LastName Nationality Profession P11 Tompson Greek P12 Dupont French Researcher LN NL PR P21 Tompson Greek P22 Tompson Greek, French Reseacher
  • 146. CONDITIONAL KEYS COLLABORATION WITH: LUIS GALARAGA (INRIA RENNES) NATHALIE PERNELLE (LRI, UNIV. PARIS SUD), FABIAN SUCHANEK (TELECOM PARISTECH) DANAI SYMEONIDOU (INRA, MONTPELLIER) 146 Symeonidou et al. 2017
  • 147. KEYS IN KNOWLEDGE BASES To obtain keys found in Knowledge Bases (KBs) • Different automatic key discovery approaches (eg. SAKey, KD2R) Key limitations • Cases of datasets having no keys • Keys are generic, i.e., they are true for every instance of a class in a dataset FirstName LastName Gender Lab Nationality instance1 Claude Dupont Female Paris-Sud France instance2 Claude Dupont Male Paris-Sud Belgium instance3 Juan Rodríguez Male INRA Spain, Italy instance4 Juan Salvez Male INRA Spain instance5 Anna Georgiou Female INRA Greece, France instance6 Pavlos Markou Male Paris-Sud Greece instance7 Marie Legendre Female INRA France Instances of the class Person Key: {LastName, Gender} 147 Symeonidou et al. 2017
  • 148. PROBLEM STATEMENT Motivating example • In many countries of the world, a student can be supervised by multiple supervisors • In German Universities, a student can be supervised by only one professor • {supervises} key for the instances of the class Professor with “condition” that they are in a German university Approximate keys [Sakey] express keys that do not uniquely identify every instance of a class in a dataset Approximate keys do not express the conditions under which they are true 148 Symeonidou et al. 2017
  • 149. CONDITIONAL KEYS BY VICKEY Conditional key: a key, valid for instances of a class satisfying a specific condition • Condition part: pairs of property and value Eg. {Lab=INRA}, {Gender=Male}, {Gender=Female, Lab=INRA} etc. • Key part: a set of properties FirstName LastName Gender Lab Nationality instance1 Claude Dupont Female Paris-Sud France instance2 Claude Dupont Male Paris-Sud Belgium instance3 Juan Rodríguez Male INRA Spain, Italy instance4 Juan Salvez Male INRA Spain instance5 Anna Georgiou Female INRA Greece, France instance6 Pavlos Markou Male Paris-Sud Greece instance7 Marie Legendre Female INRA France {LastName} is a key under the condition {Lab=INRA} Instances of the class Person 149 Symeonidou et al. 2017
  • 150. CONDITIONAL KEY QUALITY MEASURES Support: #instances both satisfying condition part and instantiating key part Coverage: Support/#all_Instances Support: 4 Coverage: 4/7 = 0.57 FirstName LastName Gender Lab Nationality instance1 Claude Dupont Female Paris-Sud France instance2 Claude Dupont Male Paris-Sud Belgium instance3 Juan Rodríguez Male INRA Spain, Italy instance4 Juan Salvez Male INRA Spain instance5 Anna Georgiou Female INRA Greece, France instance6 Pavlos Markou Male Paris-Sud Greece instance7 Marie Legendre Female INRA France {LastName} is a key under the condition {Lab=INRA} Instances of the class Person 150 Symeonidou et al. 2017
  • 151. VICKEY: MINING EFFICIENTLY CONDITIONAL KEYS Goal of VICKEY • Given all instances of a class in a dataset, a min_support, a min_coverage, mine all minimal conditional keys with • support ≥ min_support • coverage ≥ min_coverage Exponential search space (O(|V||P|)with • V, the set of objects of a given class in the dataset • P, the set of properties of a given class in the dataset 151 Symeonidou et al. 2017 Conditional keys can be efficiently discovered from the non-keys
  • 152. VICKEY: MINING EFFICIENTLY CONDITIONAL KEYS Non-key: a set of properties where two instances share at least one value for each property in the non-key (SAKey) Given a maximal non-key • Condition part: subset of properties of the non-key • Key part: subset of the remaining properties of the non-key FirstName LastName Gender Lab Nationality instance1 Claude Dupont Female Paris-Sud France instance2 Claude Dupont Male Paris-Sud Belgium instance3 Juan Rodríguez Male INRA Spain, Italy instance4 Juan Salvez Male INRA Spain instance5 Anna Georgiou Female INRA Greece, France instance6 Pavlos Markou Male Paris-Sud Greece instance7 Marie Legendre Female INRA France Non-key {FirstName, Gender, Lab, Nationality} Instances of the class Person 152 Symeonidou et al. 2017
  • 153. CONDITIONAL KEY GRAPH EXPLORATION • Given a non-key • Step 1: Discover all minimal conditional keys with condition of size 1 • Step 2: Discover all minimal conditional keys with condition of size 2 • … • Step n: Discover all minimal conditional keys with condition of size n 153 Symeonidou et al. 2017
  • 154. • Given a non-key • Step 1: Discover all minimal conditional keys with condition of size 1 {p=a} • Step 2: Discover all minimal conditional keys with condition of size 2 {p1=a1^p2=a2} • … • Step n: Discover all minimal conditional keys with condition of size n {p1=a1^…^pn=an} FirstName Lab FirstName Lab FirstName Nationality Lab Nationality FirstName Nationalit y Lab Gender=Female Nationality Condition Conditional key graph Non-key {FirstName, Gender, Lab, Nationality} 154 CONDITIONAL KEY GRAPH EXPLORATION Symeonidou et al. 2017
  • 155. • Given a non-key • Step 1: Discover all minimal conditional keys with condition of size 1 {p=a} • Step 2: Discover all minimal conditional keys with condition of size 2 {p1=a1^p2=a2} • … • Step n: Discover all minimal conditional keys with condition of size n {p1=a1^…^pn=an} FirstName Lab FirstName Lab FirstName Nationalit y Lab Nationality FirstName Nationalit y Lab Gender=Female Nationality Condition Conditional key graph Key {FirstName} for cond {Gender=Female} Non-key {FirstName, Gender, Lab, Nationality} 155 CONDITIONAL KEY GRAPH EXPLORATION Symeonidou et al. 2017
  • 156. • Given a non-key • Step 1: Discover all minimal conditional keys with condition of size 1 {p=a} • Step 2: Discover all minimal conditional keys with condition of size 2 {p1=a1^p2=a2} • … • Step n: Discover all minimal conditional keys with condition of size n {p1=a1^…^pn=an} Gender=Female Condition Conditional key graph firstName firstName lab firstName nationalit y lab nationalit y firstName nationalit y lab FirstNam e FirstName Lab FirstName Nationalit y FirstName Nationalit y Lab Lab Nationalit y NationalityLab Key {FirstName} for cond {Gender=Female} Non-key {FirstName, Gender, Lab, Nationality} 156 CONDITIONAL KEY GRAPH EXPLORATION Symeonidou et al. 2017
  • 157. • Given a non-key • Step 1: Discover all minimal conditional keys with condition of size 1 {p=a} • Step 2: Discover all minimal conditional keys with condition of size 2 {p1=a1^p2=a2} • … • Step n: Discover all minimal conditional keys with condition of size n {p1=a1^…^pn=an} Lab=INRA Condition Conditional key graph Key {FirstName} for cond {Gender=Female} FirstNam e FirstName Gender FirstName Nationalit y Gender Nationalit y FirstName Nationalit y Gender NationalityGender Non-key {FirstName, Gender, Lab, Nationality} 157 CONDITIONAL KEY GRAPH EXPLORATION Symeonidou et al. 2017
  • 158. • Given a non-key • Step 1: Discover all minimal conditional keys with condition of size 1 {p=a} • Step 2: Discover all minimal conditional keys with condition of size 2 {p1=a1^p2=a2} • … • Step n: Discover all minimal conditional keys with condition of size n {p1=a1^…^pn=an} Condition Conditional key graph Gender=Female ^ Lab=INRA Nationality 158 CONDITIONAL KEY GRAPH EXPLORATION Symeonidou et al. 2017
  • 159. EXPERIMENTAL EVALUATION Evaluation of VICKEY’s scalability in large datasets Evaluation of conditional keys in data linking 159 Symeonidou et al. 2017
  • 160. VICKEY’S SCALABILITY Run VICKEY in large datasets Compare the runtime results with AMIE - a generic rule mining approach adapted to mine conditional keys • Both approaches take as input non-keys mined by SAKey to reduce the search space of conditional key mining Coverage 1% 160 Symeonidou et al. 2017
  • 161. RUNTIME RESULTS - VICKEY VS. AMIE[2] Class* #Triples #Instances #Properties #NonKeys VICKEY AMIE #ConditionalKeys Actor 57.2k 5.8k 71 137 4.52m 12.58h 311 Album 786.1k 85.3k 39 68 1.53h 3.90h 304 Book 258.4k 30.0k 51 95 11.84h >1d 419 Film 832.1k 82.6k 74 132 1.37h 3.64h 185 Mountain 127.8k 16.4k 58 47 2.86m 23.57m 257 Museum 12.9k 1.9k 65 17 1.46s 6.45s 58 Organization 1.82M 178.7k 553 3221 26.32h > 36h 28 Scientist 258.5k 19.7k 73 309 27.67m > 1d 582 University 85.5k 8.7k 89 140 14.45h >1d 941 *All used classes are obtained from DBpedia 161 Symeonidou et al. 2017
  • 162. DATA LINKING - VICKEY VS. SAKEY[1] Goal: link two datasets using • Classical keys discovered by SAKey[1] • Conditional keys discovered by VICKEY • Supervises(Prof1, stud1)^Supervises(Prof2, stud1)^teachesIn(Prof1, “Gemany”)^teachesIn(Prof2, “Gemany”)=> sameAs(Prof1, Prof2) • Both classical keys and conditional keys Evaluate obtained links using the existing goal-standard with • Recall • Precision • F-Measure 162 Symeonidou et al. 2017
  • 163. DATA LINKING - VICKEY VS. SAKEY[1] Link two knowledge bases containing information of Wikipedia • Yago[3] • DBpedia[4] Used classes of DBpedia and Yago • Actor • Album • Book • Film • Mountain • Museum • Organization • Scientist • University 163 Symeonidou et al. 2017
  • 164. DATA LINKING - VICKEY VS. SAKEY[1] Class Recall Precision F-Measure Actor Keys[1]* 0.27 0.99 0.43 Conditional keys** 0.57 0.99 0.73 Keys[1]+Conditional keys 0.6 0.99 0.75 Album Keys[1] 0 1 0.00 Conditional keys 0.15 0.99 0.26 Keys[1]+Conditional keys 0.15 0.99 0.26 Film Keys[1] 0.04 0.99 0.08 Conditional keys 0.38 0.96 0.54 Keys[1]+Conditional keys 0.39 0.98 0.55 Museum Keys[1] 0.12 1 0.21 Conditional keys 0.25 1 0.40 Keys[1]+Conditional keys 0.31 1 0.47 x 869 x 1.75 x 7.1 x 2.19 *Keys[1] from SAKey **Conditional keys from VICKEY 164 Symeonidou et al. 2017
  • 165. SUMMARY Keys and almost keys: • Better linking results in terms of recall with n-almost keys than with sure keys • Optimistic heuristic gets better results than pessimistic one • Can handle size classes much larger than KD2R (DB: Natural Place more than 16 million triplets) Conditional keys: • Better linking results in terms of recall when conditional keys are used. • Aside data linking, may be used to refine a class hierarchy Improvements: • Numerical values • Dataset heterogeneity When no (almost-)keys neither conditional keys are discoverable from the dataset è Possible solution: Referring expressions 165
  • 166. LINK INVALIDATION / REQUALIFICATION WORK IN COLLABORATION WITH: NATHALIE PERNELLE (LRI, UNIV. PARIS SUD), LAURA PAPALEO (GENOVA PROVINCE, ITALY) JOE RAAD (PHD STUDENT, AGROPARISTECH & UNIVERSITÉ PARIS SUD. ) 166
  • 167. • owl:sameAs, indicates that two different descriptions refer to the same entity • owl:sameAs semantics is too strict • Reflexive, symmetric, transitive and • Property sharing: " X " Y owl:sameAs(X, Y)Ù p(X, Z) Þ p(Y, Z) • Automatic data linking tools do not guarantee 100% precision, because of: • Errors, missing information, data freshness, … • [Halpin et al. 2010] showed that 37% of owl:sameAs links randomly selected among 250 identity links between books were incorrect. • Problem: how to (semi)-automatically invalidate/requalify owl:sameAs links? IDENTITY CRISIS b1 b2 owl:sameAs b4 b3 owl:sameAs b1 b5 owl:sameAs b5 b6 owl:sameAs 167
  • 168. IDENTITY CRISIS: SOME SOLUTIONS b1 b2 owl:sameAs b4 b3 owl:sameAs b1 b5 owl:sameAs b5 b6 owl:sameAs 168 • [Halpin et al. 2010]: propose ontology of identity and invalidation of identity links by crowdsourcing. • [de Melo 2013]: uses the Unique Name Assumption and the transitivity of links to detect inconsistencies in the data. • [Papaleo et al. 2014, Papageorgiou et al. 2017]: exploit some ontology axioms to logically/numerically detect invalid identity links. • [Raad et al. 2017] computes contextual identity links for each pair of instances
  • 169. LINK INVALIDATION [QUALINCA ANR PROJECT (2012-2016)] WORK IN COLLABORATION WITH: - NATHALIE PERNELLE (LRI, UNIV. PARIS SUD) - LAURA PAPLEO(GENOVA PROVINCE, ITALY) 169
  • 170. LOGICAL AND NUMERICAL APPROACH FOR LINK INVALIDATION • Two ontology-based methods to detect invalid sameAs statements: a logical method and a numerical method • We build a contextual graph «around» each one of the two resources involved in the sameAs by exploiting ontology axioms on: • functionality and inverse functionality of properties and • local completeness of some properties (e.g., the author list of a book). • We analyse the descriptions provided in these contextual graphs to eventually detect inconsistencies or high dissimilarities. 170
  • 171. LOGICAL METHOD F is the set of RDF facts enriched by a set of ¬synVals facts in the form ¬synVals(w1, w2) w1 and w2, being literals and different. 171of 20 EXAMPLES: - notSynVals(‘231’,’100’) for a functional property numOfPages -notSynVals(‘New York’, ‘Paris’) for a functional property cityName … knowledge from expert or extracted. 171 [Papaleo et al. 2014]
  • 172. LOGICAL METHOD R the set of rules 172of 20 (inverse) functional properties local complete properties sameAs(x,y) ÙnumOfPages(x,w1) Ù numOfPages(y,w2) à SynVals(w1,w2) sameAs(x,y) Ù hasAuthor(x,w1) à hasAuthor(y,w1) 172 [Papaleo et al. 2014]
  • 173. LOGICAL INVALIDAITON • If the property nb-pages is declared as functional then: nbPages(b1, n1) Ù nbPages(b2, n2) Ù (b1=b2) Þ n1=n2. è owl:sameAs(b1, b2) est faux. b1 b2 a11 a21 owl:sameAs(b1, b2) ? author author A Semantic Web Primer titre A Semantic Web Primer titre 2007 pubYear 2007 pubYear G. Antoniou aName Grigoris Antoniou aName author a12 author a22 208 nbPages 288nbPages Paul Grauth aName P. Grauth aName 173 [Papaleo et al. 2014]
  • 174. b1 e1 b2 a11 e2 a21 owl:sameAs(b1, b2) ? author author publisherpublisher A Semantic Web Primer title A Semantic Web Primer title 2008 pubYear 2007 pubYear G. Antoniou aName Grigoris Antoniou aName author … a12 … author … a22 … r2n r21ref ref …r1m r11 ref ref … NUMERICAL METHOD: EXAMPLE • P= {title, pubYear, publisher, author, aName, pName} • Sim(“A Semantic Web …``, “A Semantic Web …``) = 1, Sim(“ 2007“, “2008“) = 0 • Sim(“MIT Press``, “MIT Press``) = 1, è CSim(e1, e2) = 1 • Sim(“Grigoris Antoniou``, “G. Antoniou``) = 0.5, ... è CSim(a11, a21) = 0.5, …. è CSim(a12, a22) = 0.5 MIT Press MIT Press pName pNam e • CSim(b1, b2) = 0.62 if agregation function is average and • CSim(b1, b2) = 0 if agregation function is minimum 174 [Papageorgio et al. 2017]
  • 175. COMPARAISON LOGICAL/NUMERICAL Logical method [Papaleo, Pernelle and Saïs (2014)] Numerical method (Agregation = average) [Papageorgiou, Pernelle and Saïs 2017] Datasets Precision Recall F-measure Precision Recall F-measure thresh old Person1 0.69 0.98 0.81 1.0 0.98 0.99 0.3 Person2 0.5 1.0 0.67 0.994 0.989 0.99 0.2 Restaurant 0.63 0.97 0.77 0.97 1.0 0.98 0.4 • Average gain of 23% F-measure (significant increase in precision, comparable recall) 175
  • 176. CONTEXTUAL IDENTITY [JOE RAAD PHD LIONES (2015-2018) PROJECT, FUNDED BY CDS PARIS SACLAY] WORK IN COLLABORATION WITH: - NATHALIE PERNELLE (LRI, UNIV. PARIS SUD), - JULIETTE DIBIE (AGROPARISTECH, INRA), - LILIANA IBANESCO (AGROPARISTECH, INRA), - CAROLINE PENICAUD (INRA PARIS) 176
  • 177. • owl:sameAs, indicates that two different descriptions refer to the same entity • owl:sameAs semantics is too strict • Reflexive, symmetric, transitive and • Property sharing: " X " Y owl:sameAs(X, Y)Ù p(X, Z) Þ p(Y, Z) • In some specific domains, the identity depends on the context. Examples: • same-components and same-quantity è same-juice. [Hypocaloric diet context] • proportionally-same-components è same-juice. [Shopping list context] • Problem: how to automatically compute the contextual identity links? CONTEXTUAL IDENTITY 177 [Raad et al. 2014]
  • 178. 178 CONTEXTUAL IDENTITY Proposition of a new contextual identity link • A new predicate to represent identity links valid in explicit contexts • Hierarchy of the contexts and the identity links • Defining identity contexts for each class (local contexts) • Defining identity contexts for each class and its connected graph (global context) [Raad et al. 2014]
  • 179. CONTEXTUAL IDENTITY Contextual Identity Link Example Πa(Juice) = { (Juice, {rdf:Type, expiryDate}, {isComposedOf}), (Banana, {rdf:Type}, {isComposedOf -1}), (Strawberry, {rdf:Type}, {hasAttribute, isComposedOf -1}), (Weight, {rdf:Type, hasValue, hasUnit}, {hasAttribute-1}) } identiConTo<Πa(Juice)>(juice1, juice2) 179 [Raad et al. 2014]
  • 180. CONTEXTUAL IDENTITY Global Context • π(ck): local context of the class ck =(ck, DPck, Opck) • src: a given source class of the ontology • G: connected sub-graph containing src • CG: set of classes belonging to G • Ck: class belonging to CG Π(src) =Uck ∈ CG π(ck) 180 [Raad et al. 2014]
  • 181. CONTEXTUAL IDENTITY Πa(Juice) ≤ Πc(Juice) Πb(Juice) ≤ Πc(Juice) Πa(Juice) Πb(Juice) Πc(Juice) … … each local context in Πc(Juice) is less specific or equal to its corresponding local context in Πa(Juice) 181 [Raad et al. 2014]
  • 182. DECIDE METHOD – DETECTION OF CONTEXTUAL IDENTITY It automatically detects and adds these contextual identity links in the knowledge graph DECIDE DEtection of Contextual IDEntity Knowledge Graph Source Class Unwanted Properties Paired Properties Necessary Properties For each pair of instances (i1, i2) of the source class set of the most specific global contexts in which (i1, i2) are identical 182 [Raad et al. 2014]
  • 183. EXPERIMENTS - LIONES PROJECT Transformation of Micro-organisms Digestion Process ≈ 1000 files ≈ 2000 files Guidelines CellExtraDry - Classes: ≈ 4 700 - Individuals: ≈ 415 000 - Statements: ≈ 1 700 000 Carredas - Classes : ≈ 5 000 - Individuals: ≈ 42 000 - Statements : ≈ 237 000 RDFRDF 183 [Raad et al. 2014]
  • 184. EXPERIMENTS - LIONES PROJECT Mixture DECIDE DEtection of Contextual IDEntity CellExtraDry + Carredas 184 [Raad et al. 2014]
  • 185. EXPERIMENTS - LIONES PROJECT identiConTo<Π(Mixture)25>(i, j): two mixtures have the same nature (e.g. humid) and the same components (water, yeast, mono-potassium, phosphate, and glucose), with each of these components having identical weights (value + unit of measure) Generation of rules for missing observations prediction 185 [Raad et al. 2014]
  • 186. INVALIDATION / REQUALIFICATION: SUMMARY • Contextual identity useful for profiling identity links • Useful for giving explanations for possible experts validation • Allow to generate missing values prediction rules. • Future directions: • Is Contextual difference relevant/useful? • How to make explicit incomplete information for a pair of resources? • Use when the ontologies are different • SameAs link invalidation • Improves the precision without decreasing the recall • May be used for error detection and/or anomalies in ontology axioms • Future directions: • Provide comprehensible explanations for experts • Use when the ontologies are different 186
  • 187. SOME FUTURE CHALLENGES q Data evolution q ontology axiom evolution q link evolution q temporal data linking q Data veracity è what is the truth? q Knowledge discovery in complex data q key graphs q causality rules q Explanation problem q provenance of the data and the data/knowledge processes 187
  • 189. REFERENCES (1) [Atencia et al. 2014] Data interlinking through robust Linkkey extraction. Atencia, Manuel, Jérôme David, and Jérôme Euzenat. ECAI, 2014. [Atencia et al.’12] Keys and Pseudo-Keys Detection for Web Datasets Cleansing and Interlinking. Manuel Atencia, Jérôme David, François Scharffe. In EKAW 2012 [Cohen et al. 2003] A comparison of string distance metrics for name-matching tasks. William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. In IIWEB@AAAI 2003. [Ferrara13] Evaluation of instance matching tools: The experience of OAEI. Alfio Ferrara, Andriy Nikolov, Jan Noessner, François Scharffe. OM@ISWC 2013 [Hu et al. 2011] A Self-Training Approach for Resolving Object Coreference on the Semantic Web. Wei Hu, Jianfeng Chen, Yuzhong Qu. In WWW 2011 [Kang et al. 2008] Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation. Kang, Getoor, Shneiderman, Bilgic, Licamele, In IEEE Trans. Vis. Comput. Graph2008 189
  • 190. REFERENCES (2) [Lenat and Feigenbaum 1991] On the threshold of knowledge. Douglas B. Lenat and Edward A. Feigenbaum In Artificial Intelligence 47 (1991) [Nikolov et al’12] Unsupervised Learning of Link Discovery Configuration Andriy Nikolov, Mathieu d'Aquin, Enrico Motta. In ESWC 2012. [Pernelle et al.’13] An Automatic Key Discovery Approach for Data Linking. Nathalie Pernelle, Fatiha Saïs. and Danai Symeounidou. In Journal of Web Semantics [Saïs et al.07] L2R: a Logical method for Reference Reconciliation. Fatiha Saïs, Nathalie Pernelle and Marie-Christine Rousset. In AAAI 2007. [Saïs et al.09] Combining a Logical and a Numerical Method for Data Reconciliation. Fatiha Saïs., Nathalie Pernelle and Marie-Christine Rousset. In Journal of Data Semantics. [Shvaiko,Euzenat13] Ontology Matching: State of the Art and Future Challenges, Pavel Shvaiko, Jérôme Euzenat. In TKDE 2013 190
  • 191. REFERENCES (3) [Suchanek11] PARIS: Probabilistic Alignment of Relations, Instances, and Schema Fabian Suchanek, Serge Abiteboul, Pierre Senellart. In VLDB 2011. [Soru et al. 2015] ROCKER: a refinement operator for key discovery. Soru, Tommaso, Edgard Marx, and Axel-Cyrille Ngonga Ngomo. In WWW, 2015. [Symeonidou et al. 2014] SAKey: Scalable almost key discovery in RDF data. Symeonidou, Danai, Vincent Armant, Nathalie Pernelle, and Fatiha Saïs. In ISWC 2014. [Symeonidou et al. 2017] VICKEY: Mining Conditional Keys on RDF datasets . Danai Symeonidou, Luis Galarraga, Nathalie Pernelle, Fatiha Saïs and Fabian Suchanek. In ISWC 2017. [Papageorgiou et al. 2017] Approche numérique pour l'invalidation de liens d'identité (owl:SameAs). Dimitrios Christaras Papageorgiou, Nathalie Pernelle and Fatiha Saïs. In IC 2017. 191
  • 192. REFERENCES (4) [Papaleo et al. 2014] Logical Detection of Invalid SameAs Statements in RDF Data, Laura Papaleo, Nathalie Pernelle, Fatiha Saïs and Cyril Dumont. In EKAW 2014 [Volz et al'09] Silk – A Link Discovery Framework for the Web of Data. Julius Volz, Christian Bizer et al. In WWW 2009. [Zheng et al. 2013] Results for OAEI 2013 Qian Zheng, Chao Shao, Juanzi Li, Zhichun Wang and Linmei Hu. OM@ISWC 2010 192
  • 193. THANKS 193 N. Pernelle M.C. Rousset D. Symeonidou J. Raad L. Papaleo N. Niraula