ROCKER is a refinement-operator-based approach for finding keys of a class in an RDF dataset. This presentation was held at the 24th International World Wide Web Conference in Florence, Italy.
4. Unique descriptions of resources.
Entity search.
Data integration.
Linked data compression.
Link discovery.
Question answering.
Data quality.
4
5. Unique descriptions of resources.
Entity search.
Data integration.
Linked data compression.
Link discovery.
Question answering.
Data quality.
4
Keys.
6. Background.
5
A key is a set of properties which can distinguish
all instances of a class in a knowledge base.
7. Background.
5
A key is a set of properties which can distinguish
all instances of a class in a knowledge base.
:Brad_Pitt
:Julia_Roberts
:Oceans_Eleven
:The_Mexican
:hasActor
:hasActor
:hasActor
:hasActor
“Ocean’s Eleven”
“Julia Roberts”
“The Mexican”
“Brad Pitt”
rdfs:label
rdfs:label rdfs:label
rdfs:label
8. 6
A key is a minimal key
if none of its subsets is also a key.
Background.
candidate key distinguishable resources key? min-key?
{rdfs:label} 2 / 2 yes yes
{:hasActor} 1 / 2 no no
{rdfs:label, :hasActor} 2 / 2 yes no
dbpedia-owl:Film
9. 7
A set of properties is called an n-almost-key for a class
if it can distinguish all except n instances of that class.
Background.
:Canada
:Iceland
:United_States
:filmedIn
:Interstellar
:United_States
:United_Kingdom
:filmedIn
:Blade_Runner
:United_States
:United_Kingdom
:filmedIn
:2001_A_Space_Odyssey
:WALL-E
10. 7
A set of properties is called an n-almost-key for a class
if it can distinguish all except n instances of that class.
Background.
:Canada
:Iceland
:United_States
:filmedIn
:Interstellar
:United_States
:United_Kingdom
:filmedIn
:Blade_Runner
:United_States
:United_Kingdom
:filmedIn
:2001_A_Space_Odyssey
:WALL-E
✗
11. 8
ROCKER’s score function.
The score function expresses
the rate of distinguishable instances in a class,
given a set of properties (i.e., a candidate key).
:Interstellar
:Blade_Runner
:2001_A_Space_Odyssey
:WALL-E
✗ } score({: filmedIn})
=
s ∈S :∀ ′s ∈S s ≠ ′s ⇒ discr(s, ′s ,{: filmedIn}){ }
S
= .75
12. 8
ROCKER’s score function.
The score function expresses
the rate of distinguishable instances in a class,
given a set of properties (i.e., a candidate key).
:Interstellar
:Blade_Runner
:2001_A_Space_Odyssey
:WALL-E
✗ }
An n-almost-key has a score of at least .α =
S − n
S
score({: filmedIn})
=
s ∈S :∀ ′s ∈S s ≠ ′s ⇒ discr(s, ′s ,{: filmedIn}){ }
S
= .75
13. Contribution #1
A more complete definition of key.
All object values are considered (e.g., ).
Null values are accepted (e.g., ).
9
:United_States
:WALL-E
:Canada
:Iceland
:United_States
:filmedIn
:Interstellar
:United_States
:United_Kingdom
:filmedIn
:Blade_Runner
:WALL-E
14. 10
Properties of a key.
Key monotonicity.
Adding a property to a key yields another key.
{:p1, :p2, :p3}{:p1, :p2}
⋃ {:p3}
15. 10
Properties of a key.
Key monotonicity.
Adding a property to a key yields another key.
Non-key monotonicity.
Removing a property from a non-key yields another non-key.
{:p1, :p2, :p3}{:p1, :p2}
⋃ {:p3}
{:p1, :p4}{:p1, :p2, :p4}
∖ {:p2}
16. 11
Proposed approach.
We adopt a refinement operator to refine candidates.
{:p1, :p2, :p3}
∅
{:p1, :p3}
{:p1} {:p3}
{:p1, :p2} {:p2, :p3}
{:p2}
17. 12
Proposed approach.
Pro. The score function induces a quasi-ordering ‘≼’
over the set of all candidates.
P≼Q means score(p) ≤ score(q)
18. 12
Proposed approach.
Pro. The score function induces a quasi-ordering ‘≼’
over the set of all candidates.
P≼Q means score(p) ≤ score(q)
Contra. Visiting the refinement tree is an intractable
problem!
n properties
2ⁿ–1 nodes
19. Solutions to intractability.
Prune branches using key monotonicity:
for all descendants of a key;
for all ancestors of a non-key.
Consider only a subset of popular properties.
Provide a “fast search” option which selects one of
the multiple discovery strategies.
13
24. Algorithm.
14
Frontier := {∅}
Top el. score?
< α
≥ α
Halt
Sort by score
Refine pivot,
remove pivot & add
children to frontier
Has children?
25. Algorithm.
14
Frontier := {∅}
Top el. score?
< α
≥ α
Halt
Sort by score
Refine pivot,
remove pivot & add
children to frontier
Has children?
Next child
yes
no
26. Algorithm.
14
Frontier := {∅}
Top el. score?
< α
≥ α
Halt
Sort by score
Refine pivot,
remove pivot & add
children to frontier
Has children?
Next child
Ancestor
of !key?
yes
no
false true
27. Algorithm.
14
Frontier := {∅}
Top el. score?
< α
≥ α
Halt
Sort by score
Refine pivot,
remove pivot & add
children to frontier
Has children?
Next child
Add to !keys
Ancestor
of !key?
yes
no
false true
yes
28. Algorithm.
14
Frontier := {∅}
Top el. score?
< α
≥ α
Halt
Sort by score
Refine pivot,
remove pivot & add
children to frontier
Has children?
Next child
Add to !keys
Ancestor
of !key?
Descendant
of key?
yes
no
false true
noyes
29. Algorithm.
14
Frontier := {∅}
Top el. score?
< α
≥ α
Halt
Sort by score
Refine pivot,
remove pivot & add
children to frontier
Has children?
Next child
Add to keys
Add to !keys
Ancestor
of !key?
Descendant
of key?
yes
no
false true
no
yes
yes
30. Algorithm.
14
Frontier := {∅}
Top el. score?
< α
≥ α
Halt
Sort by score
Refine pivot,
remove pivot & add
children to frontier
Has children?
Next child
Add to keys
Add to !keys
Ancestor
of !key?
Descendant
of key?
Score?
yes
no
false true
no
no
yes
yes
31. Algorithm.
14
Frontier := {∅}
Top el. score?
< α
≥ α
Halt
< α
≥ α
Sort by score
Refine pivot,
remove pivot & add
children to frontier
Has children?
Next child
Add to keys
Add to !keys
Ancestor
of !key?
Descendant
of key?
Score?
yes
no
false true
no
no
yes
yes
40. 16
Related work on key discovery.
Linkkey (Atencia et al., 2014)
• Tool able to retrieve keys.
• Relies on an incomplete definition of key.
• State of the Art for small datasets.
SAKey (Symeonidou et al., 2014)
• Tool able to retrieve keys and n-almost keys.
• Relies on an incomplete definition of key.
• State of the Art on bigger datasets.
KD2R (Symeonidou et al., 2011)
• Tool able to retrieve keys.
• Relies on an incomplete definition of key.
46. 21
Retrieve all candidates whose score is above a threshold α.
Results for dataset dbpedia:Monument.
Runtime by threshold.
47. 21
Retrieve all candidates whose score is above a threshold α.
Results for dataset dbpedia:Monument.
Runtime by threshold.
runtime (ms)
48. 22
Contributions.
Complete definition of keys by considering multi-
object properties and null values.
More scalability in terms of:
Faster execution on larger datasets.
Less memory consumption.
Running ROCKER without restrictions is guaranteed to
return minimal keys.
49. 23
Info and future work.
ROCKER is part of LIMES – link discovery framework. Its
source code is online at http://github.com/AKSW/rocker.
50. 23
Info and future work.
ROCKER is part of LIMES – link discovery framework. Its
source code is online at http://github.com/AKSW/rocker.
A demo is currently under development, to show how ROCKER
can improve data quality by searching for n-almost-keys.
51. 23
Info and future work.
ROCKER is part of LIMES – link discovery framework. Its
source code is online at http://github.com/AKSW/rocker.
A demo is currently under development, to show how ROCKER
can improve data quality by searching for n-almost-keys.
We will evaluate ROCKER inside of the link discovery
workflow, i.e.: How can keys help find good link specifications?
52. Tommaso Soru
PhD student at University of Leipzig
Room P905, Fakultät für Mathematik und Informatik
Augustusplatz 10, D-04109 Leipzig, Germany
!
tsoru@informatik.uni-leipzig.de
http://tommaso-soru.it
!
Proceedings
http://www.www2015.it/documents/proceedings/proceedings/p1025.pdf
24