SlideShare ist ein Scribd-Unternehmen logo
1 von 93
Downloaden Sie, um offline zu lesen
Representing and comparing probabilities: Part 2
Arthur Gretton
Gatsby Computational Neuroscience Unit,
University College London
UAI, 2017
1/52
Testing against a probabilistic model
2/52
Statistical model criticism
MMD@P;QA a kf £k2 a supkf kF 1‘EQ f  Epf “
-4 -2 2 4
-0.3
-0.2
-0.1
0.1
0.2
0.3
0.4
p(x)
q(x)
f *
(x)
f £@xA is the witness function
Can we compute MMD with samples from Q and a model P?
Problem: usualy can’t compute Epf in closed form.
3/52
Stein idea
To get rid of Epf in
sup
kf kF 1
‘Eq f  Epf “
we define the Stein operator
‘Tpf “@xA a 1
p@xA
d
dx
@f @xAp@xAA
Then
EP TP f a 0
subject to appropriate boundary conditions. (Oates, Girolami, Chopin, 2016)
4/52
Stein idea: proof
Ep ‘Tpf “ a
Z 
1
p@xA
d
dx
@f @xAp@xAA

p@xAdx
Z 
d
dx
@f @xAp@xAA

dx
a ‘f @xAp@xA“I
 I
a 0
5/52
Stein idea: proof
Ep ‘Tpf “ a
Z 
1
¨¨¨p@xA
d
dx
@f @xAp@xAA

¨
¨¨p@xAdx
Z 
d
dx
@f @xAp@xAA

dx
a ‘f @xAp@xA“I
 I
a 0
5/52
Stein idea: proof
Ep ‘Tpf “ a
Z 
1
¨¨¨p@xA
d
dx
@f @xAp@xAA

¨
¨¨p@xAdx
Z 
d
dx
@f @xAp@xAA

dx
a ‘f @xAp@xA“I
 I
a 0
5/52
Stein idea: proof
Ep ‘Tpf “ a
Z 
1
¨¨¨p@xA
d
dx
@f @xAp@xAA

¨
¨¨p@xAdx
Z 
d
dx
@f @xAp@xAA

dx
a ‘f @xAp@xA“I
 I
a 0
5/52
Stein idea: proof
Ep ‘Tpf “ a
Z 
1
¨¨¨p@xA
d
dx
@f @xAp@xAA

¨
¨¨p@xAdx
Z 
d
dx
@f @xAp@xAA

dx
a ‘f @xAp@xA“I
 I
a 0
5/52
Kernel Stein Discrepancy
Stein operator
Tpf a @x f Cf @x @log pA
Kernel Stein Discrepancy (KSD)
KSD@p;q;pA a sup
kgkF 1
Eq Tpg  EpTpg
6/52
Kernel Stein Discrepancy
Stein operator
Tpf a @x f Cf @x @log pA
Kernel Stein Discrepancy (KSD)
KSD@p;q;pA a sup
kgkF 1
Eq Tpg  $$$$EpTpg a sup
kgkF 1
Eq Tpg
6/52
Kernel Stein Discrepancy
Stein operator
Tpf a @x f Cf @x @log pA
Kernel Stein Discrepancy (KSD)
KSD@p;q;pA a sup
kgkF 1
Eq Tpg  $$$$EpTpg a sup
kgkF 1
Eq Tpg
-4 -2 2 4
-0.6
-0.4
-0.2
0.2
0.4
p(x)
q(x)
g*
(x)
6/52
Kernel Stein Discrepancy
Stein operator
Tpf a @x f Cf @x @log pA
Kernel Stein Discrepancy (KSD)
KSD@p;q;pA a sup
kgkF 1
Eq Tpg  $$$$EpTpg a sup
kgkF 1
Eq Tpg
-4 -2 2 4
0.1
0.2
0.3
0.4
p(x)
q(x)
g*
(x)
6/52
Kernel stein discrepancy
Closed-form expression for KSD: given Z;ZH $ q, then
(Chwialkowski, Strathmann, G., ICML 2016) (Liu, Lee, Jordan ICML 2016)
KSD@p;q;pA a Eq hp@Z;ZHA
where
hp@x;yA Xa @x log p@xA@x log p@yAk@x;yA
C@y log p@yA@x k@x;yA
C@x log p@xA@y k@x;yA
C@x @y k@x;yA
and k is RKHS kernel for p
Only depends on kernel and @x log p(x). Do not need to
normalize p, or sample from it.
7/52
Statistical model criticism
Chicago crime data
8/52
Statistical model criticism
Chicago crime data
Model is Gaussian mixture with two components.
8/52
Statistical model criticism
Chicago crime data
Model is Gaussian mixture with two components
Stein witness function
8/52
Statistical model criticism
Chicago crime data
Model is Gaussian mixture with ten components.
8/52
Statistical model criticism
Chicago crime data
Model is Gaussian mixture with ten components
Stein witness function
Code: https://github.com/karlnapf/kernel_goodness_of_fit 8/52
Kernel stein discrepancy
Further applications:
Evaluation of approximate MCMC methods.
(Chwialkowski, Strathmann, G., ICML 2016; Gorham, Mackey, ICML 2017)
What kernel to use?
The inverse multiquadric kernel,
k@x;yA a

c Ckx  yk2
2
for
P @ 1;0A.
9/52
Testing statistical dependence
10/52
Dependence testing
Given: Samples from a distribution PX Y
Goal: Are X and Y independent?
Their	noses	guide	them	
through	life,	and	they're	
never	happier	than	when	
following	an	interesting	scent.	
A	large	animal	who	slings	slobber,	
exudes	a	distinctive	houndy odor,	
and	wants	nothing	more	than	to	
follow	his	nose.	
Text	from	dogtime.com and	petfinder.com
A responsive,		interactive	
pet,	one	that	will	blow	in	
your	ear	and	follow	you	
everywhere.
YX
11/52
MMD as a dependence measure?
Could we use MMD?
MMD@PXY
| {z }
P
;PX PY
| {z }
Q
;rA
We don’t have samples from Q Xa PX PY , only pairs
f@xi ;yi gn
i=1
i:i:d:
$ PXY
 Solution: simulate Q with pairs (xi ;yj ) for j T= i.
What kernel  to use for the RKHS r?
12/52
MMD as a dependence measure?
Could we use MMD?
MMD@PXY
| {z }
P
;PX PY
| {z }
Q
;rA
We don’t have samples from Q Xa PX PY , only pairs
f@xi ;yi gn
i=1
i:i:d:
$ PXY
 Solution: simulate Q with pairs (xi ;yj ) for j T= i.
What kernel  to use for the RKHS r?
12/52
MMD as a dependence measure?
Could we use MMD?
MMD@PXY
| {z }
P
;PX PY
| {z }
Q
;rA
We don’t have samples from Q Xa PX PY , only pairs
f@xi ;yi gn
i=1
i:i:d:
$ PXY
 Solution: simulate Q with pairs (xi ;yj ) for j T= i.
What kernel  to use for the RKHS r?
12/52
MMD as a dependence measure
Kernel k on images with feature space p,
Kernel l on captions with feature space q,
13/52
MMD as a dependence measure
Kernel k on images with feature space p,
Kernel l on captions with feature space q,
Kernel  on image-text pairs: are images and captions similar?
13/52
MMD as a dependence measure
Given: Samples from a distribution PX Y
Goal: Are X and Y independent?
MMD2
@bPXY ; bPX
bPY ;rA Xa 1
n2
trace@KLA
( K, L column centered)
14/52
MMD as a dependence measure
Given: Samples from a distribution PX Y
Goal: Are X and Y independent?
MMD2
@bPXY ; bPX
bPY ;rA Xa 1
n2
trace@KLA
14/52
MMD as a dependence measure
Two questions:
Why the product kernel? Many ways to combine kernels - why not
eg a sum?
Is there a more interpretable way of defining this dependence
measure?
15/52
Finding covariance with smooth transformations
Illustration: two variables with no correlation but strong dependence.
-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
Correlation: 0.00
16/52
Finding covariance with smooth transformations
Illustration: two variables with no correlation but strong dependence.
-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
-2 0 2
-1
-0.5
0
0.5
16/52
Finding covariance with smooth transformations
Illustration: two variables with no correlation but strong dependence.
-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
-2 0 2
-1
-0.5
0
0.5
-1 -0.5 0 0.5
-1
-0.5
0
0.5
Correlation: 0.90
16/52
Define two spaces, one for each witness
Function in p
f @xA a
IX
j =1
fj 'j @xA
Feature map
Kernel for RKHS p on ˆ:
k@x;xHA a h'@xA;'@xHAip
Function in q
g@yA a
IX
j =1
gj j @yA
Feature map
Kernel for RKHS q on ‰:
l@x;xHA a h@yA;@yHAiq
17/52
The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
cov‘f @xAg@yA“
-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
-2 0 2
-1
-0.5
0
0.5
-1 -0.5 0 0.5
-1
-0.5
0
0.5
Correlation: 0.90
18/52
The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
cov
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
18/52
The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
Exy
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
Fine print: feature mappings '(x) and (y) assumed to have zero mean.
18/52
The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
Exy
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
Fine print: feature mappings '(x) and (y) assumed to have zero mean.
Rewriting:
Exy ‘f @xAg@yA“
a
2
6
6
4
f1
f2
...
3
7
7
5
b
Exy
0
B
B
@
2
6
6
4
'1@xA
'2@xA
...
3
7
7
5
h
1@yA 2@yA :::
i
1
C
C
A
| {z }
C'(x)(y)
2
6
6
4
g1
g2
...
3
7
7
5
18/52
The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
Exy
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
Fine print: feature mappings '(x) and (y) assumed to have zero mean.
Rewriting:
Exy ‘f @xAg@yA“
a
2
6
6
4
f1
f2
...
3
7
7
5
b
Exy
0
B
B
@
2
6
6
4
'1@xA
'2@xA
...
3
7
7
5
h
1@yA 2@yA :::
i
1
C
C
A
| {z }
C'(x)(y)
2
6
6
4
g1
g2
...
3
7
7
5
COCO: max singular value of feature covariance C'(x)(y)
18/52
Computing COCO in practice
Given sample f@xi ;yi Agn
i=1
i:i:d:
$ PXY , what is empirical COCO ?
19/52
Computing COCO in practice
Given sample f@xi ;yi Agn
i=1
i:i:d:
$ PXY , what is empirical COCO ?
COCO is largest eigenvalue 
max of

0 1
n KL
1
n LK 0
#
#
a 

K 0
0 L
#
#
:
Kij a k@xi ;xj A and Lij a l@yi ;yj A.
Fine print: kernels are computed with empirically centered features '(x)   1
n
Pn
i=1
'(xi )
and (y)   1
n
Pn
i=1
(yi ).
AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B.
Schoelkopf, and N. Logothetis, AISTATS’05
19/52
Computing COCO in practice
Given sample f@xi ;yi Agn
i=1
i:i:d:
$ PXY , what is empirical COCO ?
COCO is largest eigenvalue 
max of

0 1
n KL
1
n LK 0
#
#
a 

K 0
0 L
#
#
:
Kij a k@xi ;xj A and Lij a l@yi ;yj A.
Witness functions (singular vectors):
f @xA G
mX
i=1
i k@xi ;xA g@yA G
nX
i=1
i l@yi ;yA
Fine print: kernels are computed with empirically centered features '(x)   1
n
Pn
i=1
'(xi )
and (y)   1
n
Pn
i=1
(yi ).
AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B.
Schoelkopf, and N. Logothetis, AISTATS’05
19/52
What is a large dependence with COCO?
−2 0 2
−3
−2
−1
0
1
2
3
X
Y
Smooth density
−4 −2 0 2 4
−4
−2
0
2
4
X
Y
500 Samples, smooth density
−2 0 2
−3
−2
−1
0
1
2
3
X
Y
Rough density
−4 −2 0 2 4
−4
−2
0
2
4
X
Y
500 samples, rough density
Density takes the form:
PXY G 1Csin@!xAsin@!yA
Which of these is the more “dependent”?
20/52
Finding covariance with smooth transformations
Case of ! a 1:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.31
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
Correlation: 0.50 COCO: 0.09
21/52
Finding covariance with smooth transformations
Case of ! a 2:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.02
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
Correlation: 0.54 COCO: 0.07
22/52
Finding covariance with smooth transformations
Case of ! a 3:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.03
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
Correlation: 0.44 COCO: 0.04
23/52
Finding covariance with smooth transformations
Case of ! a 4:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.05
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
Correlation: 0.25 COCO: 0.02
24/52
Finding covariance with smooth transformations
Case of ! acc:
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.01
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
Correlation: 0.14 COCO: 0.02
25/52
Finding covariance with smooth transformations
Case of ! a 0: uniform noise! (shows bias)
-4 -2 0 2 4
-4
-2
0
2
4
Correlation: 0.01
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-0.5 0 0.5
-0.5
0
0.5
Correlation: 0.14 COCO: 0.02
26/52
Dependence largest when at “low” frequencies
As dependence is encoded at higher frequencies, the smooth
mappings f ;g achieve lower linear dependence.
Even for independent variables, COCO will not be zero at finite
sample sizes, since some mild linear dependence will be found by f,g
(bias)
This bias will decrease with increasing sample size.
27/52
Can we do better than COCO?
A second example with zero correlation.
First singular value of feature covariance C'(x)(y):
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
-2 0 2
-1
-0.5
0
0.5
-1 -0.5 0 0.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
Correlation: 0.80 COCO1
: 0.11
28/52
Can we do better than COCO?
A second example with zero correlation.
Second singular value of feature covariance C'(x)(y):
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
28/52
Can we do better than COCO?
A second example with zero correlation.
Second singular value of feature covariance C'(x)(y):
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-0.5
0
0.5
Correlation: 0.37 COCO2
: 0.06
28/52
The Hilbert-Schmidt Independence Criterion
Writing the ith singular value of the feature covariance C'(x)(y) as

i Xa COCOi @PXY Yp;qA;
define Hilbert-Schmidt Independence Criterion (HSIC)
HSIC2
@PXY Yp;qA a
IX
i=1

2
i :
AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B.
Schoelkopf., and A. Smola, NIPS 2007,.
29/52
The Hilbert-Schmidt Independence Criterion
Writing the ith singular value of the feature covariance C'(x)(y) as

i Xa COCOi @PXY Yp;qA;
define Hilbert-Schmidt Independence Criterion (HSIC)
HSIC2
@PXY Yp;qA a
IX
i=1

2
i :
AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B.
Schoelkopf., and A. Smola, NIPS 2007,.
HSIC is MMD with product kernel!
HSIC2
@PXY Yp;qA a MMD2
@PXY ;PX PY YrA
where @@x;yA;@xH;yHAA a k@x;xHAl@y;yHA.
29/52
Asymptotics of HSIC under independence
Given sample f@xi ;yi gn
i=1
i:i:d:
$ PXY , what is empirical HSIC?
Empirical HSIC (biased)
HSIC a 1
n2
trace@KLA
Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with
empirically centered features)
Statistical testing: given PXY a PX PY , what is the threshold c
such that P@HSIC  cA   for small ?
Asymptotics of HSIC when PXY a PX PY :
n HSIC
D
3
IX
l=1
l z2
l ; zl $ x@0;1Ai:i:d:
where l l (zj ) =
R
hijqr l (zi )dFi;q;r ; hijqr = 1
4!
P(i;j ;q;r)
(t;u;v;w)
ktu ltu + ktu lvw  2ktu ltv
30/52
Asymptotics of HSIC under independence
Given sample f@xi ;yi gn
i=1
i:i:d:
$ PXY , what is empirical HSIC?
Empirical HSIC (biased)
HSIC a 1
n2
trace@KLA
Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with
empirically centered features)
Statistical testing: given PXY a PX PY , what is the threshold c
such that P@HSIC  cA   for small ?
Asymptotics of HSIC when PXY a PX PY :
n HSIC
D
3
IX
l=1
l z2
l ; zl $ x@0;1Ai:i:d:
where l l (zj ) =
R
hijqr l (zi )dFi;q;r ; hijqr = 1
4!
P(i;j ;q;r)
(t;u;v;w)
ktu ltu + ktu lvw  2ktu ltv
30/52
Asymptotics of HSIC under independence
Given sample f@xi ;yi gn
i=1
i:i:d:
$ PXY , what is empirical HSIC?
Empirical HSIC (biased)
HSIC a 1
n2
trace@KLA
Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with
empirically centered features)
Statistical testing: given PXY a PX PY , what is the threshold c
such that P@HSIC  cA   for small ?
Asymptotics of HSIC when PXY a PX PY :
n HSIC
D
3
IX
l=1
l z2
l ; zl $ x@0;1Ai:i:d:
where l l (zj ) =
R
hijqr l (zi )dFi;q;r ; hijqr = 1
4!
P(i;j ;q;r)
(t;u;v;w)
ktu ltu + ktu lvw  2ktu ltv
30/52
Asymptotics of HSIC under independence
Given sample f@xi ;yi gn
i=1
i:i:d:
$ PXY , what is empirical HSIC?
Empirical HSIC (biased)
HSIC a 1
n2
trace@KLA
Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with
empirically centered features)
Statistical testing: given PXY a PX PY , what is the threshold c
such that P@HSIC  cA   for small ?
Asymptotics of HSIC when PXY a PX PY :
n HSIC
D
3
IX
l=1
l z2
l ; zl $ x@0;1Ai:i:d:
where l l (zj ) =
R
hijqr l (zi )dFi;q;r ; hijqr = 1
4!
P(i;j ;q;r)
(t;u;v;w)
ktu ltu + ktu lvw  2ktu ltv
30/52
A statistical test
Given PXY a PX PY , what is the threshold c such that
P@HSIC  cA   for small  (prob. of false positive)?
Original time series:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
Permutation:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10
Null distribution via permutation
 Compute HSIC for fxi ;y(i)gn
i=1 for random permutation  of indices
f1;:::;ng. This gives HSIC for independent variables.
 Repeat for many different permutations, get empirical CDF
 Threshold c is 1   quantile of empirical CDF 31/52
A statistical test
Given PXY a PX PY , what is the threshold c such that
P@HSIC  cA   for small  (prob. of false positive)?
Original time series:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
Permutation:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10
Null distribution via permutation
 Compute HSIC for fxi ;y(i)gn
i=1 for random permutation  of indices
f1;:::;ng. This gives HSIC for independent variables.
 Repeat for many different permutations, get empirical CDF
 Threshold c is 1   quantile of empirical CDF 31/52
A statistical test
Given PXY a PX PY , what is the threshold c such that
P@HSIC  cA   for small  (prob. of false positive)?
Original time series:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
Permutation:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10
Null distribution via permutation
 Compute HSIC for fxi ;y(i)gn
i=1 for random permutation  of indices
f1;:::;ng. This gives HSIC for independent variables.
 Repeat for many different permutations, get empirical CDF
 Threshold c is 1   quantile of empirical CDF 31/52
Application: dependence detection across languages
Testing task: detect dependence between English and French text
Les	ordres	de	gouvernements	
provinciaux	et	municipaux	
subissent	de	fortes	pressions
Honourable	senators,	I	have	a	
question	for	the	Leader	of	the	
Government	in	the	Senate
Text	from	the	aligned	hansards of	the	36th parliament	of	canada,
https://www.isi.edu/natural-language/download/hansard/
YX
Honorables	sénateurs,	ma	question	
s’adresse	au	leader	du	
gouvernement	au	Sénat
Au	contraire,	nous	avons	augmenté	
le	financement	fédéral	pour	le	
développement	des	jeunes	
No	doubt	there	is	great	pressure	
on	provincial	and	municipal	
governments	
In	fact,	we	have	increased	
federal	investments	for	early	
childhood	development.	
...
...
32/52
Application: dependence detection across languages
Testing task: detect dependence between English and French text
k-spectrum kernel, k a 10, sample size n a 10
HSIC a 1
n2
trace@KLA
(K and L column centered) 33/52
Application:Dependence detection across languages
Results (for  a 0:05)
k-spectrum kernel: average Type II error 0
Bag of words kernel: average Type II error 0.18
Settings: Five line extracts, averaged over 300 repetitions, for
“Agriculture” transcripts. Similar results for Fisheries and
Immigration transcripts.
34/52
Testing higher order interactions
35/52
Detecting higher order interaction
How to detect V-structures with pairwise weak individual
dependence?
X Y
Z
36/52
Detecting higher order interaction
How to detect V-structures with pairwise weak individual
dependence?
36/52
Detecting higher order interaction
How to detect V-structures with pairwise weak individual
dependence?
X cc Y ;Y cc Z;X cc Z
X1 vs Y1 Y1 vs Z1
X1 vs Z1 X1*Y1 vs Z1
X Y
Z
X ;Y
i:i:d:
$ x@0;1A
Zj X ;Y $ sign@XY AExp@ 1p
2
A
Fine print: Faithfulness violated here!
36/52
V-structure discovery
X Y
Z
Assume X cc Y has been established.
V-structure can then be detected by:
Consistent CI test: H0 X X cc Y jZ [Fukumizu et al. 2008, Zhang et al. 2011]
Factorisation test: H0 X @X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X
(multiple standard two-variable tests)
How well do these work?
37/52
Detecting higher order interaction
Generalise earlier example to p dimensions
X cc Y ;Y cc Z;X cc Z
X1 vs Y1 Y1 vs Z1
X1 vs Z1 X1*Y1 vs Z1
X Y
Z
X ;Y
i:i:d:
$ x@0;1A
Zj X ;Y $ sign@XY AExp@ 1p
2
A
X2:p;Y2:p;Z2:p
i:i:d:
$ x@0;Ip 1A
Fine print: Faithfulness violated here!
38/52
V-structure discovery
CI test for X cc Y jZ from Zhang et al. (2011), and a factorisation
test, n a 500
39/52
Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY  PX PY
40/52
Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY  PX PY
D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ
40/52
Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY  PX PY
D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ
X Y
Z
X Y
Z
X Y
Z
X Y
Z
PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ
∆LP =
40/52
Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY  PX PY
D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ
X Y
Z
X Y
Z
X Y
Z
X Y
Z
PXY Z −PXPY Z −PXZPY −PXY PZ +2PXPY PZ
∆LP = 0
Case of PX cc PYZ
40/52
Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY  PX PY
D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ
@X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X A ¡LP a 0:
...so what might be missed?
40/52
Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY  PX PY
D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ
¡LP a 0 ;@X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X
Example:
P(0;0;0) = 0:2 P(0;0;1) = 0:1 P(1;0;0) = 0:1 P(1;0;1) = 0:1
P(0;1;0) = 0:1 P(0;1;1) = 0:1 P(1;1;0) = 0:1 P(1;1;1) = 0:2
40/52
A kernel test statistic using Lancaster Measure
Construct a test by estimating k @¡LPAk2
r ; where  a k l m:
k@PXYZ  PXY PZ  ¡¡¡Ak2
r a
hPXYZ ;PXYZ ir  2 hPXYZ ;PXY PZ ir ¡¡¡
41/52
A kernel test statistic using Lancaster Measure
Table: V -statistic estimators of h;Hir
(without terms PX PY PZ ). H
is centering matrix I  n 1
Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13
k @¡LPAk2
r a 1
n2
@HKH HLH HMHA++ :
Empirical joint central moment in the feature space
42/52
A kernel test statistic using Lancaster Measure
Table: V -statistic estimators of h;Hir
(without terms PX PY PZ ). H
is centering matrix I  n 1
Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13
k @¡LPAk2
r a 1
n2
@HKH HLH HMHA++ :
Empirical joint central moment in the feature space
42/52
V-structure discovery
Lancaster test, CI test for X cc Y jZ from Zhang et al. (2011), and a
factorisation test, n a 500 43/52
Interaction for D  4
Interaction measure valid for all D:
(Streitberg, 1990)
¡S P a
X

@ 1Ajj 1
@jj 1A3JP
 For a partition , J associates to the
joint the corresponding factorisation,
e.g., J13j2j4P = PX1X3
PX2
PX4
:
44/52
Interaction for D  4
Interaction measure valid for all D:
(Streitberg, 1990)
¡S P a
X

@ 1Ajj 1
@jj 1A3JP
 For a partition , J associates to the
joint the corresponding factorisation,
e.g., J13j2j4P = PX1X3
PX2
PX4
:
44/52
Interaction for D  4
Interaction measure valid for all D:
(Streitberg, 1990)
¡S P a
X

@ 1Ajj 1
@jj 1A3JP
 For a partition , J associates to the
joint the corresponding factorisation,
e.g., J13j2j4P = PX1X3
PX2
PX4
:
1e+04
1e+09
1e+14
1e+19
1 3 5 7 9 11 13 15 17 19 21 23 25
D
Numberofpartitionsof{1,...,D}
Bell numbers growth
44/52
Part 4: Advanced topics
45/52

Weitere ähnliche Inhalte

Was ist angesagt?

Pushdown automata
Pushdown automataPushdown automata
Pushdown automataparmeet834
 
On Resolution Proofs for Combinational Equivalence
On Resolution Proofs for Combinational EquivalenceOn Resolution Proofs for Combinational Equivalence
On Resolution Proofs for Combinational Equivalencesatrajit
 
A New Enhanced Method of Non Parametric power spectrum Estimation.
A New Enhanced Method of Non Parametric power spectrum Estimation.A New Enhanced Method of Non Parametric power spectrum Estimation.
A New Enhanced Method of Non Parametric power spectrum Estimation.CSCJournals
 
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...PadmaGadiyar
 
Reducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology MappingReducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology Mappingsatrajit
 
TMPA-2017: Generating Cost Aware Covering Arrays For Free
TMPA-2017: Generating Cost Aware Covering Arrays For Free TMPA-2017: Generating Cost Aware Covering Arrays For Free
TMPA-2017: Generating Cost Aware Covering Arrays For Free Iosif Itkin
 
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsBayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsMatt Moores
 
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...Matt Moores
 
20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrisonComputer Science Club
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...Matt Moores
 
Proof of Kraft-McMillan theorem
Proof of Kraft-McMillan theoremProof of Kraft-McMillan theorem
Proof of Kraft-McMillan theoremVu Hung Nguyen
 
Path Contraction Faster than 2^n
Path Contraction Faster than 2^nPath Contraction Faster than 2^n
Path Contraction Faster than 2^nAkankshaAgrawal55
 
A new transformation into State Transition Algorithm for finding the global m...
A new transformation into State Transition Algorithm for finding the global m...A new transformation into State Transition Algorithm for finding the global m...
A new transformation into State Transition Algorithm for finding the global m...Michael_Chou
 
RDFS with Attribute Equations via SPARQL Rewriting
RDFS with Attribute Equations via SPARQL RewritingRDFS with Attribute Equations via SPARQL Rewriting
RDFS with Attribute Equations via SPARQL RewritingStefan Bischof
 
new optimization algorithm for topology optimization
new optimization algorithm for topology optimizationnew optimization algorithm for topology optimization
new optimization algorithm for topology optimizationSeonho Park
 
Efficient Edge-Skeleton Computation for Polytopes Defined by Oracles
Efficient Edge-Skeleton Computation for Polytopes Defined by OraclesEfficient Edge-Skeleton Computation for Polytopes Defined by Oracles
Efficient Edge-Skeleton Computation for Polytopes Defined by OraclesVissarion Fisikopoulos
 

Was ist angesagt? (20)

Pushdown automata
Pushdown automataPushdown automata
Pushdown automata
 
On Resolution Proofs for Combinational Equivalence
On Resolution Proofs for Combinational EquivalenceOn Resolution Proofs for Combinational Equivalence
On Resolution Proofs for Combinational Equivalence
 
A New Enhanced Method of Non Parametric power spectrum Estimation.
A New Enhanced Method of Non Parametric power spectrum Estimation.A New Enhanced Method of Non Parametric power spectrum Estimation.
A New Enhanced Method of Non Parametric power spectrum Estimation.
 
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
 
Reducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology MappingReducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology Mapping
 
AI Lesson 05
AI Lesson 05AI Lesson 05
AI Lesson 05
 
TMPA-2017: Generating Cost Aware Covering Arrays For Free
TMPA-2017: Generating Cost Aware Covering Arrays For Free TMPA-2017: Generating Cost Aware Covering Arrays For Free
TMPA-2017: Generating Cost Aware Covering Arrays For Free
 
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsBayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse Problems
 
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
bayesImageS: Bayesian computation for medical Image Segmentation using a hidd...
 
20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison
 
Lecture26
Lecture26Lecture26
Lecture26
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
 
Proof of Kraft-McMillan theorem
Proof of Kraft-McMillan theoremProof of Kraft-McMillan theorem
Proof of Kraft-McMillan theorem
 
Path Contraction Faster than 2^n
Path Contraction Faster than 2^nPath Contraction Faster than 2^n
Path Contraction Faster than 2^n
 
Fine Grained Complexity
Fine Grained ComplexityFine Grained Complexity
Fine Grained Complexity
 
A new transformation into State Transition Algorithm for finding the global m...
A new transformation into State Transition Algorithm for finding the global m...A new transformation into State Transition Algorithm for finding the global m...
A new transformation into State Transition Algorithm for finding the global m...
 
RDFS with Attribute Equations via SPARQL Rewriting
RDFS with Attribute Equations via SPARQL RewritingRDFS with Attribute Equations via SPARQL Rewriting
RDFS with Attribute Equations via SPARQL Rewriting
 
new optimization algorithm for topology optimization
new optimization algorithm for topology optimizationnew optimization algorithm for topology optimization
new optimization algorithm for topology optimization
 
Randomization
RandomizationRandomization
Randomization
 
Efficient Edge-Skeleton Computation for Polytopes Defined by Oracles
Efficient Edge-Skeleton Computation for Polytopes Defined by OraclesEfficient Edge-Skeleton Computation for Polytopes Defined by Oracles
Efficient Edge-Skeleton Computation for Polytopes Defined by Oracles
 

Ähnlich wie Representing and comparing probabilities: Part 2

Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelstuxette
 
Nec 602 unit ii Random Variables and Random process
Nec 602 unit ii Random Variables and Random processNec 602 unit ii Random Variables and Random process
Nec 602 unit ii Random Variables and Random processDr Naim R Kidwai
 
Nonparametric testing for exogeneity with discrete regressors and instruments
Nonparametric testing for exogeneity with discrete regressors and instrumentsNonparametric testing for exogeneity with discrete regressors and instruments
Nonparametric testing for exogeneity with discrete regressors and instrumentsGRAPE
 
SYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE AND LIE ALGEBRA
SYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE  AND LIE ALGEBRASYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE  AND LIE ALGEBRA
SYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE AND LIE ALGEBRABRNSS Publication Hub
 
[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural ProcessesDeep Learning JP
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processesKazuki Fujikawa
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...Nesma
 
A Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description LogicsA Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description LogicsJie Bao
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationFeynman Liang
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideWooSung Choi
 
A new integer programming model for hp problem
A new integer programming model for hp problemA new integer programming model for hp problem
A new integer programming model for hp problemwonther
 
MarcoCeze_defense
MarcoCeze_defenseMarcoCeze_defense
MarcoCeze_defenseMarco Ceze
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryFrancesco Tudisco
 
PAGOdA poster
PAGOdA posterPAGOdA poster
PAGOdA posterDBOnto
 
Robust Image Denoising in RKHS via Orthogonal Matching Pursuit
Robust Image Denoising in RKHS via Orthogonal Matching PursuitRobust Image Denoising in RKHS via Orthogonal Matching Pursuit
Robust Image Denoising in RKHS via Orthogonal Matching PursuitPantelis Bouboulis
 
Sep logic slide
Sep logic slideSep logic slide
Sep logic sliderainoftime
 

Ähnlich wie Representing and comparing probabilities: Part 2 (20)

Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernels
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Nec 602 unit ii Random Variables and Random process
Nec 602 unit ii Random Variables and Random processNec 602 unit ii Random Variables and Random process
Nec 602 unit ii Random Variables and Random process
 
Nonparametric testing for exogeneity with discrete regressors and instruments
Nonparametric testing for exogeneity with discrete regressors and instrumentsNonparametric testing for exogeneity with discrete regressors and instruments
Nonparametric testing for exogeneity with discrete regressors and instruments
 
SYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE AND LIE ALGEBRA
SYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE  AND LIE ALGEBRASYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE  AND LIE ALGEBRA
SYMMETRIC BILINEAR CRYPTOGRAPHY ON ELLIPTIC CURVE AND LIE ALGEBRA
 
[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
 
A Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description LogicsA Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description Logics
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference Compilation
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
 
Lesson 26
Lesson 26Lesson 26
Lesson 26
 
AI Lesson 26
AI Lesson 26AI Lesson 26
AI Lesson 26
 
A new integer programming model for hp problem
A new integer programming model for hp problemA new integer programming model for hp problem
A new integer programming model for hp problem
 
MarcoCeze_defense
MarcoCeze_defenseMarcoCeze_defense
MarcoCeze_defense
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-periphery
 
PAGOdA poster
PAGOdA posterPAGOdA poster
PAGOdA poster
 
Robust Image Denoising in RKHS via Orthogonal Matching Pursuit
Robust Image Denoising in RKHS via Orthogonal Matching PursuitRobust Image Denoising in RKHS via Orthogonal Matching Pursuit
Robust Image Denoising in RKHS via Orthogonal Matching Pursuit
 
Sep logic slide
Sep logic slideSep logic slide
Sep logic slide
 
Dcs unit 2
Dcs unit 2Dcs unit 2
Dcs unit 2
 

Mehr von MLReview

Bayesian Non-parametric Models for Data Science using PyMC
 Bayesian Non-parametric Models for Data Science using PyMC Bayesian Non-parametric Models for Data Science using PyMC
Bayesian Non-parametric Models for Data Science using PyMCMLReview
 
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
  Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...  Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...MLReview
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative ModelsMLReview
 
PixelGAN Autoencoders
  PixelGAN Autoencoders  PixelGAN Autoencoders
PixelGAN AutoencodersMLReview
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNINGMLReview
 
Theoretical Neuroscience and Deep Learning Theory
Theoretical Neuroscience and Deep Learning TheoryTheoretical Neuroscience and Deep Learning Theory
Theoretical Neuroscience and Deep Learning TheoryMLReview
 
2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue SystemsMLReview
 
Deep Learning for Semantic Composition
Deep Learning for Semantic CompositionDeep Learning for Semantic Composition
Deep Learning for Semantic CompositionMLReview
 
Near human performance in question answering?
Near human performance in question answering?Near human performance in question answering?
Near human performance in question answering?MLReview
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksMLReview
 
Real-time Edge-aware Image Processing with the Bilateral Grid
Real-time Edge-aware Image Processing with the Bilateral GridReal-time Edge-aware Image Processing with the Bilateral Grid
Real-time Edge-aware Image Processing with the Bilateral GridMLReview
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
 

Mehr von MLReview (12)

Bayesian Non-parametric Models for Data Science using PyMC
 Bayesian Non-parametric Models for Data Science using PyMC Bayesian Non-parametric Models for Data Science using PyMC
Bayesian Non-parametric Models for Data Science using PyMC
 
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
  Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...  Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 
PixelGAN Autoencoders
  PixelGAN Autoencoders  PixelGAN Autoencoders
PixelGAN Autoencoders
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 
Theoretical Neuroscience and Deep Learning Theory
Theoretical Neuroscience and Deep Learning TheoryTheoretical Neuroscience and Deep Learning Theory
Theoretical Neuroscience and Deep Learning Theory
 
2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems
 
Deep Learning for Semantic Composition
Deep Learning for Semantic CompositionDeep Learning for Semantic Composition
Deep Learning for Semantic Composition
 
Near human performance in question answering?
Near human performance in question answering?Near human performance in question answering?
Near human performance in question answering?
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
 
Real-time Edge-aware Image Processing with the Bilateral Grid
Real-time Edge-aware Image Processing with the Bilateral GridReal-time Edge-aware Image Processing with the Bilateral Grid
Real-time Edge-aware Image Processing with the Bilateral Grid
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 

Kürzlich hochgeladen

Application of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxApplication of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxRahulVishwakarma71547
 
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...Sérgio Sacani
 
Exploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & ResearchExploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & ResearchPrachya Adhyayan
 
Pests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPirithiRaju
 
TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)chatterjeesoumili50
 
Controlling Parameters of Carbonate platform Environment
Controlling Parameters of Carbonate platform EnvironmentControlling Parameters of Carbonate platform Environment
Controlling Parameters of Carbonate platform EnvironmentRahulVishwakarma71547
 
Main Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearMain Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearmarwaahmad357
 
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Sérgio Sacani
 
Applied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxApplied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxmarwaahmad357
 
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfSUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfsantiagojoderickdoma
 
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptxTHE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptxAkinrotimiOluwadunsi
 
soft skills question paper set for bba ca
soft skills question paper set for bba casoft skills question paper set for bba ca
soft skills question paper set for bba caohsadfeeling
 
IB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxIB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxUalikhanKalkhojayev1
 
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...Sérgio Sacani
 
Human brain.. It's parts and function.
Human brain.. It's parts and function. Human brain.. It's parts and function.
Human brain.. It's parts and function. MUKTA MANJARI SAHOO
 
Substances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestSubstances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestAkashDTejwani
 
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPirithiRaju
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsHassan Jolany
 
Unit 3, Herbal Drug Technology, B Pharmacy 6th Sem
Unit 3, Herbal Drug Technology, B Pharmacy 6th SemUnit 3, Herbal Drug Technology, B Pharmacy 6th Sem
Unit 3, Herbal Drug Technology, B Pharmacy 6th SemHUHam1
 
Alternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabusAlternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabusPradnya Wadekar
 

Kürzlich hochgeladen (20)

Application of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxApplication of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptx
 
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
 
Exploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & ResearchExploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & Research
 
Pests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPR
 
TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)
 
Controlling Parameters of Carbonate platform Environment
Controlling Parameters of Carbonate platform EnvironmentControlling Parameters of Carbonate platform Environment
Controlling Parameters of Carbonate platform Environment
 
Main Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearMain Exam Applied biochemistry final year
Main Exam Applied biochemistry final year
 
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
 
Applied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxApplied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docx
 
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfSUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
 
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptxTHE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
 
soft skills question paper set for bba ca
soft skills question paper set for bba casoft skills question paper set for bba ca
soft skills question paper set for bba ca
 
IB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxIB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptx
 
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
 
Human brain.. It's parts and function.
Human brain.. It's parts and function. Human brain.. It's parts and function.
Human brain.. It's parts and function.
 
Substances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestSubstances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening Test
 
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbits
 
Unit 3, Herbal Drug Technology, B Pharmacy 6th Sem
Unit 3, Herbal Drug Technology, B Pharmacy 6th SemUnit 3, Herbal Drug Technology, B Pharmacy 6th Sem
Unit 3, Herbal Drug Technology, B Pharmacy 6th Sem
 
Alternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabusAlternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabus
 

Representing and comparing probabilities: Part 2

  • 1. Representing and comparing probabilities: Part 2 Arthur Gretton Gatsby Computational Neuroscience Unit, University College London UAI, 2017 1/52
  • 2. Testing against a probabilistic model 2/52
  • 3. Statistical model criticism MMD@P;QA a kf £k2 a supkf kF 1‘EQ f  Epf “ -4 -2 2 4 -0.3 -0.2 -0.1 0.1 0.2 0.3 0.4 p(x) q(x) f * (x) f £@xA is the witness function Can we compute MMD with samples from Q and a model P? Problem: usualy can’t compute Epf in closed form. 3/52
  • 4. Stein idea To get rid of Epf in sup kf kF 1 ‘Eq f  Epf “ we define the Stein operator ‘Tpf “@xA a 1 p@xA d dx @f @xAp@xAA Then EP TP f a 0 subject to appropriate boundary conditions. (Oates, Girolami, Chopin, 2016) 4/52
  • 5. Stein idea: proof Ep ‘Tpf “ a Z 1 p@xA d dx @f @xAp@xAA p@xAdx Z d dx @f @xAp@xAA dx a ‘f @xAp@xA“I  I a 0 5/52
  • 6. Stein idea: proof Ep ‘Tpf “ a Z 1 ¨¨¨p@xA d dx @f @xAp@xAA ¨ ¨¨p@xAdx Z d dx @f @xAp@xAA dx a ‘f @xAp@xA“I  I a 0 5/52
  • 7. Stein idea: proof Ep ‘Tpf “ a Z 1 ¨¨¨p@xA d dx @f @xAp@xAA ¨ ¨¨p@xAdx Z d dx @f @xAp@xAA dx a ‘f @xAp@xA“I  I a 0 5/52
  • 8. Stein idea: proof Ep ‘Tpf “ a Z 1 ¨¨¨p@xA d dx @f @xAp@xAA ¨ ¨¨p@xAdx Z d dx @f @xAp@xAA dx a ‘f @xAp@xA“I  I a 0 5/52
  • 9. Stein idea: proof Ep ‘Tpf “ a Z 1 ¨¨¨p@xA d dx @f @xAp@xAA ¨ ¨¨p@xAdx Z d dx @f @xAp@xAA dx a ‘f @xAp@xA“I  I a 0 5/52
  • 10. Kernel Stein Discrepancy Stein operator Tpf a @x f Cf @x @log pA Kernel Stein Discrepancy (KSD) KSD@p;q;pA a sup kgkF 1 Eq Tpg  EpTpg 6/52
  • 11. Kernel Stein Discrepancy Stein operator Tpf a @x f Cf @x @log pA Kernel Stein Discrepancy (KSD) KSD@p;q;pA a sup kgkF 1 Eq Tpg  $$$$EpTpg a sup kgkF 1 Eq Tpg 6/52
  • 12. Kernel Stein Discrepancy Stein operator Tpf a @x f Cf @x @log pA Kernel Stein Discrepancy (KSD) KSD@p;q;pA a sup kgkF 1 Eq Tpg  $$$$EpTpg a sup kgkF 1 Eq Tpg -4 -2 2 4 -0.6 -0.4 -0.2 0.2 0.4 p(x) q(x) g* (x) 6/52
  • 13. Kernel Stein Discrepancy Stein operator Tpf a @x f Cf @x @log pA Kernel Stein Discrepancy (KSD) KSD@p;q;pA a sup kgkF 1 Eq Tpg  $$$$EpTpg a sup kgkF 1 Eq Tpg -4 -2 2 4 0.1 0.2 0.3 0.4 p(x) q(x) g* (x) 6/52
  • 14. Kernel stein discrepancy Closed-form expression for KSD: given Z;ZH $ q, then (Chwialkowski, Strathmann, G., ICML 2016) (Liu, Lee, Jordan ICML 2016) KSD@p;q;pA a Eq hp@Z;ZHA where hp@x;yA Xa @x log p@xA@x log p@yAk@x;yA C@y log p@yA@x k@x;yA C@x log p@xA@y k@x;yA C@x @y k@x;yA and k is RKHS kernel for p Only depends on kernel and @x log p(x). Do not need to normalize p, or sample from it. 7/52
  • 16. Statistical model criticism Chicago crime data Model is Gaussian mixture with two components. 8/52
  • 17. Statistical model criticism Chicago crime data Model is Gaussian mixture with two components Stein witness function 8/52
  • 18. Statistical model criticism Chicago crime data Model is Gaussian mixture with ten components. 8/52
  • 19. Statistical model criticism Chicago crime data Model is Gaussian mixture with ten components Stein witness function Code: https://github.com/karlnapf/kernel_goodness_of_fit 8/52
  • 20. Kernel stein discrepancy Further applications: Evaluation of approximate MCMC methods. (Chwialkowski, Strathmann, G., ICML 2016; Gorham, Mackey, ICML 2017) What kernel to use? The inverse multiquadric kernel, k@x;yA a c Ckx  yk2 2
  • 21. for
  • 24. Dependence testing Given: Samples from a distribution PX Y Goal: Are X and Y independent? Their noses guide them through life, and they're never happier than when following an interesting scent. A large animal who slings slobber, exudes a distinctive houndy odor, and wants nothing more than to follow his nose. Text from dogtime.com and petfinder.com A responsive, interactive pet, one that will blow in your ear and follow you everywhere. YX 11/52
  • 25. MMD as a dependence measure? Could we use MMD? MMD@PXY | {z } P ;PX PY | {z } Q ;rA We don’t have samples from Q Xa PX PY , only pairs f@xi ;yi gn i=1 i:i:d: $ PXY Solution: simulate Q with pairs (xi ;yj ) for j T= i. What kernel to use for the RKHS r? 12/52
  • 26. MMD as a dependence measure? Could we use MMD? MMD@PXY | {z } P ;PX PY | {z } Q ;rA We don’t have samples from Q Xa PX PY , only pairs f@xi ;yi gn i=1 i:i:d: $ PXY Solution: simulate Q with pairs (xi ;yj ) for j T= i. What kernel to use for the RKHS r? 12/52
  • 27. MMD as a dependence measure? Could we use MMD? MMD@PXY | {z } P ;PX PY | {z } Q ;rA We don’t have samples from Q Xa PX PY , only pairs f@xi ;yi gn i=1 i:i:d: $ PXY Solution: simulate Q with pairs (xi ;yj ) for j T= i. What kernel to use for the RKHS r? 12/52
  • 28. MMD as a dependence measure Kernel k on images with feature space p, Kernel l on captions with feature space q, 13/52
  • 29. MMD as a dependence measure Kernel k on images with feature space p, Kernel l on captions with feature space q, Kernel on image-text pairs: are images and captions similar? 13/52
  • 30. MMD as a dependence measure Given: Samples from a distribution PX Y Goal: Are X and Y independent? MMD2 @bPXY ; bPX bPY ;rA Xa 1 n2 trace@KLA ( K, L column centered) 14/52
  • 31. MMD as a dependence measure Given: Samples from a distribution PX Y Goal: Are X and Y independent? MMD2 @bPXY ; bPX bPY ;rA Xa 1 n2 trace@KLA 14/52
  • 32. MMD as a dependence measure Two questions: Why the product kernel? Many ways to combine kernels - why not eg a sum? Is there a more interpretable way of defining this dependence measure? 15/52
  • 33. Finding covariance with smooth transformations Illustration: two variables with no correlation but strong dependence. -2 -1 0 1 2 -1.5 -1 -0.5 0 0.5 1 1.5 Correlation: 0.00 16/52
  • 34. Finding covariance with smooth transformations Illustration: two variables with no correlation but strong dependence. -2 -1 0 1 2 -1.5 -1 -0.5 0 0.5 1 1.5 Correlation: 0.00 -2 0 2 -1 -0.5 0 0.5 -2 0 2 -1 -0.5 0 0.5 16/52
  • 35. Finding covariance with smooth transformations Illustration: two variables with no correlation but strong dependence. -2 -1 0 1 2 -1.5 -1 -0.5 0 0.5 1 1.5 Correlation: 0.00 -2 0 2 -1 -0.5 0 0.5 -2 0 2 -1 -0.5 0 0.5 -1 -0.5 0 0.5 -1 -0.5 0 0.5 Correlation: 0.90 16/52
  • 36. Define two spaces, one for each witness Function in p f @xA a IX j =1 fj 'j @xA Feature map Kernel for RKHS p on ˆ: k@x;xHA a h'@xA;'@xHAip Function in q g@yA a IX j =1 gj j @yA Feature map Kernel for RKHS q on ‰: l@x;xHA a h@yA;@yHAiq 17/52
  • 37. The constrained covariance The constrained covariance is COCO@PXY A a sup kf kp 1 kgkq 1 cov‘f @xAg@yA“ -2 -1 0 1 2 -1.5 -1 -0.5 0 0.5 1 1.5 Correlation: 0.00 -2 0 2 -1 -0.5 0 0.5 -2 0 2 -1 -0.5 0 0.5 -1 -0.5 0 0.5 -1 -0.5 0 0.5 Correlation: 0.90 18/52
  • 38. The constrained covariance The constrained covariance is COCO@PXY A a sup kf kp 1 kgkq 1 cov 2 4 0 @ IX j =1 fj 'j @xA 1 A 0 @ IX j =1 gj j @yA 1 A 3 5 18/52
  • 39. The constrained covariance The constrained covariance is COCO@PXY A a sup kf kp 1 kgkq 1 Exy 2 4 0 @ IX j =1 fj 'j @xA 1 A 0 @ IX j =1 gj j @yA 1 A 3 5 Fine print: feature mappings '(x) and (y) assumed to have zero mean. 18/52
  • 40. The constrained covariance The constrained covariance is COCO@PXY A a sup kf kp 1 kgkq 1 Exy 2 4 0 @ IX j =1 fj 'j @xA 1 A 0 @ IX j =1 gj j @yA 1 A 3 5 Fine print: feature mappings '(x) and (y) assumed to have zero mean. Rewriting: Exy ‘f @xAg@yA“ a 2 6 6 4 f1 f2 ... 3 7 7 5 b Exy 0 B B @ 2 6 6 4 '1@xA '2@xA ... 3 7 7 5 h 1@yA 2@yA ::: i 1 C C A | {z } C'(x)(y) 2 6 6 4 g1 g2 ... 3 7 7 5 18/52
  • 41. The constrained covariance The constrained covariance is COCO@PXY A a sup kf kp 1 kgkq 1 Exy 2 4 0 @ IX j =1 fj 'j @xA 1 A 0 @ IX j =1 gj j @yA 1 A 3 5 Fine print: feature mappings '(x) and (y) assumed to have zero mean. Rewriting: Exy ‘f @xAg@yA“ a 2 6 6 4 f1 f2 ... 3 7 7 5 b Exy 0 B B @ 2 6 6 4 '1@xA '2@xA ... 3 7 7 5 h 1@yA 2@yA ::: i 1 C C A | {z } C'(x)(y) 2 6 6 4 g1 g2 ... 3 7 7 5 COCO: max singular value of feature covariance C'(x)(y) 18/52
  • 42. Computing COCO in practice Given sample f@xi ;yi Agn i=1 i:i:d: $ PXY , what is empirical COCO ? 19/52
  • 43. Computing COCO in practice Given sample f@xi ;yi Agn i=1 i:i:d: $ PXY , what is empirical COCO ? COCO is largest eigenvalue max of 0 1 n KL 1 n LK 0 #
  • 44. # a K 0 0 L #
  • 45. # : Kij a k@xi ;xj A and Lij a l@yi ;yj A. Fine print: kernels are computed with empirically centered features '(x)   1 n Pn i=1 '(xi ) and (y)   1 n Pn i=1 (yi ). AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schoelkopf, and N. Logothetis, AISTATS’05 19/52
  • 46. Computing COCO in practice Given sample f@xi ;yi Agn i=1 i:i:d: $ PXY , what is empirical COCO ? COCO is largest eigenvalue max of 0 1 n KL 1 n LK 0 #
  • 47. # a K 0 0 L #
  • 48. # : Kij a k@xi ;xj A and Lij a l@yi ;yj A. Witness functions (singular vectors): f @xA G mX i=1 i k@xi ;xA g@yA G nX i=1
  • 49. i l@yi ;yA Fine print: kernels are computed with empirically centered features '(x)   1 n Pn i=1 '(xi ) and (y)   1 n Pn i=1 (yi ). AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schoelkopf, and N. Logothetis, AISTATS’05 19/52
  • 50. What is a large dependence with COCO? −2 0 2 −3 −2 −1 0 1 2 3 X Y Smooth density −4 −2 0 2 4 −4 −2 0 2 4 X Y 500 Samples, smooth density −2 0 2 −3 −2 −1 0 1 2 3 X Y Rough density −4 −2 0 2 4 −4 −2 0 2 4 X Y 500 samples, rough density Density takes the form: PXY G 1Csin@!xAsin@!yA Which of these is the more “dependent”? 20/52
  • 51. Finding covariance with smooth transformations Case of ! a 1: -4 -2 0 2 4 -4 -2 0 2 4 Correlation: 0.31 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 -0.5 0 0.5 -0.5 0 0.5 Correlation: 0.50 COCO: 0.09 21/52
  • 52. Finding covariance with smooth transformations Case of ! a 2: -4 -2 0 2 4 -4 -2 0 2 4 Correlation: 0.02 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 -0.5 0 0.5 -0.5 0 0.5 Correlation: 0.54 COCO: 0.07 22/52
  • 53. Finding covariance with smooth transformations Case of ! a 3: -4 -2 0 2 4 -4 -2 0 2 4 Correlation: 0.03 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 -0.5 0 0.5 -0.5 0 0.5 Correlation: 0.44 COCO: 0.04 23/52
  • 54. Finding covariance with smooth transformations Case of ! a 4: -4 -2 0 2 4 -4 -2 0 2 4 Correlation: 0.05 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 -0.5 0 0.5 -0.5 0 0.5 Correlation: 0.25 COCO: 0.02 24/52
  • 55. Finding covariance with smooth transformations Case of ! acc: -4 -2 0 2 4 -4 -2 0 2 4 Correlation: 0.01 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 -0.5 0 0.5 -0.5 0 0.5 Correlation: 0.14 COCO: 0.02 25/52
  • 56. Finding covariance with smooth transformations Case of ! a 0: uniform noise! (shows bias) -4 -2 0 2 4 -4 -2 0 2 4 Correlation: 0.01 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 -0.5 0 0.5 -0.5 0 0.5 Correlation: 0.14 COCO: 0.02 26/52
  • 57. Dependence largest when at “low” frequencies As dependence is encoded at higher frequencies, the smooth mappings f ;g achieve lower linear dependence. Even for independent variables, COCO will not be zero at finite sample sizes, since some mild linear dependence will be found by f,g (bias) This bias will decrease with increasing sample size. 27/52
  • 58. Can we do better than COCO? A second example with zero correlation. First singular value of feature covariance C'(x)(y): -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Correlation: 0.00 -2 0 2 -1 -0.5 0 0.5 -2 0 2 -1 -0.5 0 0.5 -1 -0.5 0 0.5 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 Correlation: 0.80 COCO1 : 0.11 28/52
  • 59. Can we do better than COCO? A second example with zero correlation. Second singular value of feature covariance C'(x)(y): -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Correlation: 0.00 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 28/52
  • 60. Can we do better than COCO? A second example with zero correlation. Second singular value of feature covariance C'(x)(y): -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Correlation: 0.00 -2 0 2 -1 -0.5 0 0.5 1 -2 0 2 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -0.5 0 0.5 Correlation: 0.37 COCO2 : 0.06 28/52
  • 61. The Hilbert-Schmidt Independence Criterion Writing the ith singular value of the feature covariance C'(x)(y) as i Xa COCOi @PXY Yp;qA; define Hilbert-Schmidt Independence Criterion (HSIC) HSIC2 @PXY Yp;qA a IX i=1 2 i : AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B. Schoelkopf., and A. Smola, NIPS 2007,. 29/52
  • 62. The Hilbert-Schmidt Independence Criterion Writing the ith singular value of the feature covariance C'(x)(y) as i Xa COCOi @PXY Yp;qA; define Hilbert-Schmidt Independence Criterion (HSIC) HSIC2 @PXY Yp;qA a IX i=1 2 i : AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B. Schoelkopf., and A. Smola, NIPS 2007,. HSIC is MMD with product kernel! HSIC2 @PXY Yp;qA a MMD2 @PXY ;PX PY YrA where @@x;yA;@xH;yHAA a k@x;xHAl@y;yHA. 29/52
  • 63. Asymptotics of HSIC under independence Given sample f@xi ;yi gn i=1 i:i:d: $ PXY , what is empirical HSIC? Empirical HSIC (biased) HSIC a 1 n2 trace@KLA Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with empirically centered features) Statistical testing: given PXY a PX PY , what is the threshold c such that P@HSIC cA for small ? Asymptotics of HSIC when PXY a PX PY : n HSIC D 3 IX l=1 l z2 l ; zl $ x@0;1Ai:i:d: where l l (zj ) = R hijqr l (zi )dFi;q;r ; hijqr = 1 4! P(i;j ;q;r) (t;u;v;w) ktu ltu + ktu lvw  2ktu ltv 30/52
  • 64. Asymptotics of HSIC under independence Given sample f@xi ;yi gn i=1 i:i:d: $ PXY , what is empirical HSIC? Empirical HSIC (biased) HSIC a 1 n2 trace@KLA Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with empirically centered features) Statistical testing: given PXY a PX PY , what is the threshold c such that P@HSIC cA for small ? Asymptotics of HSIC when PXY a PX PY : n HSIC D 3 IX l=1 l z2 l ; zl $ x@0;1Ai:i:d: where l l (zj ) = R hijqr l (zi )dFi;q;r ; hijqr = 1 4! P(i;j ;q;r) (t;u;v;w) ktu ltu + ktu lvw  2ktu ltv 30/52
  • 65. Asymptotics of HSIC under independence Given sample f@xi ;yi gn i=1 i:i:d: $ PXY , what is empirical HSIC? Empirical HSIC (biased) HSIC a 1 n2 trace@KLA Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with empirically centered features) Statistical testing: given PXY a PX PY , what is the threshold c such that P@HSIC cA for small ? Asymptotics of HSIC when PXY a PX PY : n HSIC D 3 IX l=1 l z2 l ; zl $ x@0;1Ai:i:d: where l l (zj ) = R hijqr l (zi )dFi;q;r ; hijqr = 1 4! P(i;j ;q;r) (t;u;v;w) ktu ltu + ktu lvw  2ktu ltv 30/52
  • 66. Asymptotics of HSIC under independence Given sample f@xi ;yi gn i=1 i:i:d: $ PXY , what is empirical HSIC? Empirical HSIC (biased) HSIC a 1 n2 trace@KLA Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with empirically centered features) Statistical testing: given PXY a PX PY , what is the threshold c such that P@HSIC cA for small ? Asymptotics of HSIC when PXY a PX PY : n HSIC D 3 IX l=1 l z2 l ; zl $ x@0;1Ai:i:d: where l l (zj ) = R hijqr l (zi )dFi;q;r ; hijqr = 1 4! P(i;j ;q;r) (t;u;v;w) ktu ltu + ktu lvw  2ktu ltv 30/52
  • 67. A statistical test Given PXY a PX PY , what is the threshold c such that P@HSIC cA for small (prob. of false positive)? Original time series: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Permutation: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10 Null distribution via permutation Compute HSIC for fxi ;y(i)gn i=1 for random permutation of indices f1;:::;ng. This gives HSIC for independent variables. Repeat for many different permutations, get empirical CDF Threshold c is 1   quantile of empirical CDF 31/52
  • 68. A statistical test Given PXY a PX PY , what is the threshold c such that P@HSIC cA for small (prob. of false positive)? Original time series: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Permutation: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10 Null distribution via permutation Compute HSIC for fxi ;y(i)gn i=1 for random permutation of indices f1;:::;ng. This gives HSIC for independent variables. Repeat for many different permutations, get empirical CDF Threshold c is 1   quantile of empirical CDF 31/52
  • 69. A statistical test Given PXY a PX PY , what is the threshold c such that P@HSIC cA for small (prob. of false positive)? Original time series: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Permutation: X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10 Null distribution via permutation Compute HSIC for fxi ;y(i)gn i=1 for random permutation of indices f1;:::;ng. This gives HSIC for independent variables. Repeat for many different permutations, get empirical CDF Threshold c is 1   quantile of empirical CDF 31/52
  • 70. Application: dependence detection across languages Testing task: detect dependence between English and French text Les ordres de gouvernements provinciaux et municipaux subissent de fortes pressions Honourable senators, I have a question for the Leader of the Government in the Senate Text from the aligned hansards of the 36th parliament of canada, https://www.isi.edu/natural-language/download/hansard/ YX Honorables sénateurs, ma question s’adresse au leader du gouvernement au Sénat Au contraire, nous avons augmenté le financement fédéral pour le développement des jeunes No doubt there is great pressure on provincial and municipal governments In fact, we have increased federal investments for early childhood development. ... ... 32/52
  • 71. Application: dependence detection across languages Testing task: detect dependence between English and French text k-spectrum kernel, k a 10, sample size n a 10 HSIC a 1 n2 trace@KLA (K and L column centered) 33/52
  • 72. Application:Dependence detection across languages Results (for a 0:05) k-spectrum kernel: average Type II error 0 Bag of words kernel: average Type II error 0.18 Settings: Five line extracts, averaged over 300 repetitions, for “Agriculture” transcripts. Similar results for Fisheries and Immigration transcripts. 34/52
  • 73. Testing higher order interactions 35/52
  • 74. Detecting higher order interaction How to detect V-structures with pairwise weak individual dependence? X Y Z 36/52
  • 75. Detecting higher order interaction How to detect V-structures with pairwise weak individual dependence? 36/52
  • 76. Detecting higher order interaction How to detect V-structures with pairwise weak individual dependence? X cc Y ;Y cc Z;X cc Z X1 vs Y1 Y1 vs Z1 X1 vs Z1 X1*Y1 vs Z1 X Y Z X ;Y i:i:d: $ x@0;1A Zj X ;Y $ sign@XY AExp@ 1p 2 A Fine print: Faithfulness violated here! 36/52
  • 77. V-structure discovery X Y Z Assume X cc Y has been established. V-structure can then be detected by: Consistent CI test: H0 X X cc Y jZ [Fukumizu et al. 2008, Zhang et al. 2011] Factorisation test: H0 X @X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X (multiple standard two-variable tests) How well do these work? 37/52
  • 78. Detecting higher order interaction Generalise earlier example to p dimensions X cc Y ;Y cc Z;X cc Z X1 vs Y1 Y1 vs Z1 X1 vs Z1 X1*Y1 vs Z1 X Y Z X ;Y i:i:d: $ x@0;1A Zj X ;Y $ sign@XY AExp@ 1p 2 A X2:p;Y2:p;Z2:p i:i:d: $ x@0;Ip 1A Fine print: Faithfulness violated here! 38/52
  • 79. V-structure discovery CI test for X cc Y jZ from Zhang et al. (2011), and a factorisation test, n a 500 39/52
  • 80. Lancaster interaction measure Lancaster interaction measure of @X1;:::;XD A $ P is a signed measure ¡P that vanishes whenever P can be factorised non-trivially. D a 2 X ¡LP a PXY  PX PY 40/52
  • 81. Lancaster interaction measure Lancaster interaction measure of @X1;:::;XD A $ P is a signed measure ¡P that vanishes whenever P can be factorised non-trivially. D a 2 X ¡LP a PXY  PX PY D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ 40/52
  • 82. Lancaster interaction measure Lancaster interaction measure of @X1;:::;XD A $ P is a signed measure ¡P that vanishes whenever P can be factorised non-trivially. D a 2 X ¡LP a PXY  PX PY D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ X Y Z X Y Z X Y Z X Y Z PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ ∆LP = 40/52
  • 83. Lancaster interaction measure Lancaster interaction measure of @X1;:::;XD A $ P is a signed measure ¡P that vanishes whenever P can be factorised non-trivially. D a 2 X ¡LP a PXY  PX PY D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ X Y Z X Y Z X Y Z X Y Z PXY Z −PXPY Z −PXZPY −PXY PZ +2PXPY PZ ∆LP = 0 Case of PX cc PYZ 40/52
  • 84. Lancaster interaction measure Lancaster interaction measure of @X1;:::;XD A $ P is a signed measure ¡P that vanishes whenever P can be factorised non-trivially. D a 2 X ¡LP a PXY  PX PY D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ @X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X A ¡LP a 0: ...so what might be missed? 40/52
  • 85. Lancaster interaction measure Lancaster interaction measure of @X1;:::;XD A $ P is a signed measure ¡P that vanishes whenever P can be factorised non-trivially. D a 2 X ¡LP a PXY  PX PY D a 3 X ¡LP a PXYZ  PX PYZ  PY PXZ  PZ PXY C2PX PY PZ ¡LP a 0 ;@X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X Example: P(0;0;0) = 0:2 P(0;0;1) = 0:1 P(1;0;0) = 0:1 P(1;0;1) = 0:1 P(0;1;0) = 0:1 P(0;1;1) = 0:1 P(1;1;0) = 0:1 P(1;1;1) = 0:2 40/52
  • 86. A kernel test statistic using Lancaster Measure Construct a test by estimating k @¡LPAk2 r ; where a k l m: k@PXYZ  PXY PZ  ¡¡¡Ak2 r a hPXYZ ;PXYZ ir  2 hPXYZ ;PXY PZ ir ¡¡¡ 41/52
  • 87. A kernel test statistic using Lancaster Measure Table: V -statistic estimators of h;Hir (without terms PX PY PZ ). H is centering matrix I  n 1 Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13 k @¡LPAk2 r a 1 n2 @HKH HLH HMHA++ : Empirical joint central moment in the feature space 42/52
  • 88. A kernel test statistic using Lancaster Measure Table: V -statistic estimators of h;Hir (without terms PX PY PZ ). H is centering matrix I  n 1 Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13 k @¡LPAk2 r a 1 n2 @HKH HLH HMHA++ : Empirical joint central moment in the feature space 42/52
  • 89. V-structure discovery Lancaster test, CI test for X cc Y jZ from Zhang et al. (2011), and a factorisation test, n a 500 43/52
  • 90. Interaction for D 4 Interaction measure valid for all D: (Streitberg, 1990) ¡S P a X @ 1Ajj 1 @jj 1A3JP For a partition , J associates to the joint the corresponding factorisation, e.g., J13j2j4P = PX1X3 PX2 PX4 : 44/52
  • 91. Interaction for D 4 Interaction measure valid for all D: (Streitberg, 1990) ¡S P a X @ 1Ajj 1 @jj 1A3JP For a partition , J associates to the joint the corresponding factorisation, e.g., J13j2j4P = PX1X3 PX2 PX4 : 44/52
  • 92. Interaction for D 4 Interaction measure valid for all D: (Streitberg, 1990) ¡S P a X @ 1Ajj 1 @jj 1A3JP For a partition , J associates to the joint the corresponding factorisation, e.g., J13j2j4P = PX1X3 PX2 PX4 : 1e+04 1e+09 1e+14 1e+19 1 3 5 7 9 11 13 15 17 19 21 23 25 D Numberofpartitionsof{1,...,D} Bell numbers growth 44/52
  • 93. Part 4: Advanced topics 45/52
  • 94. Advanced topics testing on time series testing for conditional dependence regression and conditional mean embedding 46/52
  • 99. Measures of divergence Sriperumbudur, Fukumizu, G, Schoelkopf, Lanckriet (2012) 51/52
  • 100. Co-authors From Gatsby: Kacper Chwialkowski Wittawat Jitkrittum Bharath Sriperumbudur Heiko Strathmann Dougal Sutherland Zoltan Szabo Wenkai Xu External collaborators: Kenji Fukumizu Bernhard Schoelkopf Alex Smola Questions? 52/52