Of Sampling and Smoothing: Approximating Distributions over Linked Open Data
1. Institute for Web Science & Technologies – WeST
Of Sampling and Smoothing:
Approximating Distributions over
Linked Open Data
Thomas Gottron
May 26th, 2014
PROFILES Workshop, Crete
2. Thomas Gottron PROFILES 26.5.2014, 2Approximating Distributions over LOD
Distributions over Linked Data
Probability to observe a certain pattern k
foaf:knows
Predicates
foaf:Person
rdf:type
RDF class types
?x
Property Sets
?y foaf:Person
dbpedia:Actor
rdf:type
Type Sets
?z
dbpedia:Actor
foaf:knows
ECS
P k( )=?
3. Thomas Gottron PROFILES 26.5.2014, 3Approximating Distributions over LOD
Distributions over Linked Data
Effectively: Estimate a distribution over pattern instances ki
Applications:
Query federation
Data Mining
Schema inferencing
k1 k2 knk3 ...
p
4. Thomas Gottron PROFILES 26.5.2014, 4Approximating Distributions over LOD
Distributions over Linked Data
Using entire LOD cloud becomes less and less feasible
Solution:
Operate on a sample
Challenges:
How to sample?
How to deal with unobserved
instances of a pattern?
k1 k2 knk3 ...
p
5. Thomas Gottron PROFILES 26.5.2014, 5Approximating Distributions over LOD
Sampling Linked Open Data
6. Thomas Gottron PROFILES 26.5.2014, 6Approximating Distributions over LOD
Data Format
Linked Data as N-Quads:
triple – what is the information?
context URI – where does it come from?
s op
c
( )s op c
7. Thomas Gottron PROFILES 26.5.2014, 7Approximating Distributions over LOD
Sampling Strategies
Triple (Edge) Based Sampling
Unique Subject URI (Node) Based Sampling
Context Based Sampling
For all sampling approaches:
Unbiased sampling based on uniform distribution
s op
s
c
8. Thomas Gottron PROFILES 26.5.2014, 8Approximating Distributions over LOD
Smoothing Distributions
9. Thomas Gottron PROFILES 26.5.2014, 9Approximating Distributions over LOD
Obtaining a Distribution from an Index
k1
k2
k3
...
kn
d1,1 d1,2 d1,3 ...
d2,1 d2,2
d3,1 d3,2 d3,3 ...
dn,1 dn,2 dn,3 ...
à D( )K s
https://github.com/gottron/lod-index-models
10. Thomas Gottron PROFILES 26.5.2014, 10Approximating Distributions over LOD
Obtaining a Distribution from an Index
k1
k2
k3
...
kn
4
2
10
8
K s(k)
count
Relative frequencies
...
K
p
P k( )=
s(k)
M
M
11. Thomas Gottron PROFILES 26.5.2014, 11Approximating Distributions over LOD
Unobserved patterns!
Unobserved pattern instance (e.g. predicate, type sets)
Adjusted relative frequencies
k1
k2
k3
...
kn
4
2
10
8
<new> 0
K
p
...
+ λ
+ λ
+ λ
+ λ
+ λ
M +l K
12. Thomas Gottron PROFILES 26.5.2014, 12Approximating Distributions over LOD
Unobserved patterns!
Unobserved pattern instance (e.g. predicate, type sets)
Lidstone-Smoothing with parameter λ
Laplace-Smoothing (Add-One) for λ = 1
k1
k2
k3
...
kn
4
2
10
8
<new> 0
K
p
...
+ λ
+ λ
+ λ
+ λ
+ λ
M +l K
14. Thomas Gottron PROFILES 26.5.2014, 14Approximating Distributions over LOD
Experimental Evaluation
Obtain different
distributions based on:
Sampling:
• Strategy (triple, USU, context)
• Rate: (5% - 90%)
Smoothing:
• Laplace
• Lidstone with λ = 0.5, λ = 0.1 and λ = 0.01
Compare to full data set
10 iterations
15. Thomas Gottron PROFILES 26.5.2014, 15Approximating Distributions over LOD
Comparing Distributions
Information theoretic measure for comparing distributions:
???
p q
DKL P,Q( )= H(P,Q)-H(P)
H P,Q( )= - P(x)ld(Q(x))
x
å
Cross-Entropy of P and Q
Kullback-Leibler Divergence
16. Thomas Gottron PROFILES 26.5.2014, 16Approximating Distributions over LOD
Experimental Setup
Index construction / Estimation of distributions
...
...
5% 10% 20% 30% Full (100%)
...
90%
5%
„deviation“
10% 20% 30% 100%90%
17. Thomas Gottron PROFILES 26.5.2014, 17Approximating Distributions over LOD
RDF class typesPredicates
Impact of Sampling Strategy
Property sets Type sets
18. Thomas Gottron PROFILES 26.5.2014, 18Approximating Distributions over LOD
Impact of Smoothing
Predicates, context sampling Predicates, triple sampling
ECS, context sampling ECS, USU sampling
19. Thomas Gottron PROFILES 26.5.2014, 19Approximating Distributions over LOD
Conclusion
Summary
Baseline for sampling and smoothing techniques
Little difference between classical smoothing techniques
Quality of context-based sampling as realistic scenario
Other samplings suitable for generating VoID descriptions
Future Work
Smarter smoothing techniques
Inspired by Language Modelling
Specific for LOD
20. Thomas Gottron PROFILES 26.5.2014, 20Approximating Distributions over LOD
Thanks!
Contact:
Thomas Gottron
WeST – Institute for Web Science and Technologies
Universität Koblenz-Landau
gottron@uni-koblenz.de