1) The document discusses using kernel methods to test statistical dependence between random variables. It proposes a measure called constrained covariance (COCO) that finds the maximum covariance between two random variables after transforming them using reproducing kernel Hilbert spaces.
2) COCO is defined as the supremum of the covariance between smooth transformations of the random variables, where the transformations live in unit balls of RKHSs. This captures dependencies that may not be detected by simple correlation.
3) Empirically, COCO is estimated as the largest singular value of the covariance between kernel embeddings of samples from the two random variables. This provides a way to compute the proposed dependence measure in practice from data samples.
3. Statistical model criticism
MMD@P;QA a kf £k2 a supkf kF 1‘EQ f Epf “
-4 -2 2 4
-0.3
-0.2
-0.1
0.1
0.2
0.3
0.4
p(x)
q(x)
f *
(x)
f £@xA is the witness function
Can we compute MMD with samples from Q and a model P?
Problem: usualy can’t compute Epf in closed form.
3/52
4. Stein idea
To get rid of Epf in
sup
kf kF 1
‘Eq f Epf “
we define the Stein operator
‘Tpf “@xA a 1
p@xA
d
dx
@f @xAp@xAA
Then
EP TP f a 0
subject to appropriate boundary conditions. (Oates, Girolami, Chopin, 2016)
4/52
5. Stein idea: proof
Ep ‘Tpf “ a
Z
1
p@xA
d
dx
@f @xAp@xAA
p@xAdx
Z
d
dx
@f @xAp@xAA
dx
a ‘f @xAp@xA“I
I
a 0
5/52
6. Stein idea: proof
Ep ‘Tpf “ a
Z
1
¨¨¨p@xA
d
dx
@f @xAp@xAA
¨
¨¨p@xAdx
Z
d
dx
@f @xAp@xAA
dx
a ‘f @xAp@xA“I
I
a 0
5/52
7. Stein idea: proof
Ep ‘Tpf “ a
Z
1
¨¨¨p@xA
d
dx
@f @xAp@xAA
¨
¨¨p@xAdx
Z
d
dx
@f @xAp@xAA
dx
a ‘f @xAp@xA“I
I
a 0
5/52
8. Stein idea: proof
Ep ‘Tpf “ a
Z
1
¨¨¨p@xA
d
dx
@f @xAp@xAA
¨
¨¨p@xAdx
Z
d
dx
@f @xAp@xAA
dx
a ‘f @xAp@xA“I
I
a 0
5/52
9. Stein idea: proof
Ep ‘Tpf “ a
Z
1
¨¨¨p@xA
d
dx
@f @xAp@xAA
¨
¨¨p@xAdx
Z
d
dx
@f @xAp@xAA
dx
a ‘f @xAp@xA“I
I
a 0
5/52
10. Kernel Stein Discrepancy
Stein operator
Tpf a @x f Cf @x @log pA
Kernel Stein Discrepancy (KSD)
KSD@p;q;pA a sup
kgkF 1
Eq Tpg EpTpg
6/52
11. Kernel Stein Discrepancy
Stein operator
Tpf a @x f Cf @x @log pA
Kernel Stein Discrepancy (KSD)
KSD@p;q;pA a sup
kgkF 1
Eq Tpg $$$$EpTpg a sup
kgkF 1
Eq Tpg
6/52
12. Kernel Stein Discrepancy
Stein operator
Tpf a @x f Cf @x @log pA
Kernel Stein Discrepancy (KSD)
KSD@p;q;pA a sup
kgkF 1
Eq Tpg $$$$EpTpg a sup
kgkF 1
Eq Tpg
-4 -2 2 4
-0.6
-0.4
-0.2
0.2
0.4
p(x)
q(x)
g*
(x)
6/52
13. Kernel Stein Discrepancy
Stein operator
Tpf a @x f Cf @x @log pA
Kernel Stein Discrepancy (KSD)
KSD@p;q;pA a sup
kgkF 1
Eq Tpg $$$$EpTpg a sup
kgkF 1
Eq Tpg
-4 -2 2 4
0.1
0.2
0.3
0.4
p(x)
q(x)
g*
(x)
6/52
14. Kernel stein discrepancy
Closed-form expression for KSD: given Z;ZH $ q, then
(Chwialkowski, Strathmann, G., ICML 2016) (Liu, Lee, Jordan ICML 2016)
KSD@p;q;pA a Eq hp@Z;ZHA
where
hp@x;yA Xa @x log p@xA@x log p@yAk@x;yA
C@y log p@yA@x k@x;yA
C@x log p@xA@y k@x;yA
C@x @y k@x;yA
and k is RKHS kernel for p
Only depends on kernel and @x log p(x). Do not need to
normalize p, or sample from it.
7/52
19. Statistical model criticism
Chicago crime data
Model is Gaussian mixture with ten components
Stein witness function
Code: https://github.com/karlnapf/kernel_goodness_of_fit 8/52
20. Kernel stein discrepancy
Further applications:
Evaluation of approximate MCMC methods.
(Chwialkowski, Strathmann, G., ICML 2016; Gorham, Mackey, ICML 2017)
What kernel to use?
The inverse multiquadric kernel,
k@x;yA a
c Ckx yk2
2
24. Dependence testing
Given: Samples from a distribution PX Y
Goal: Are X and Y independent?
Their noses guide them
through life, and they're
never happier than when
following an interesting scent.
A large animal who slings slobber,
exudes a distinctive houndy odor,
and wants nothing more than to
follow his nose.
Text from dogtime.com and petfinder.com
A responsive, interactive
pet, one that will blow in
your ear and follow you
everywhere.
YX
11/52
25. MMD as a dependence measure?
Could we use MMD?
MMD@PXY
| {z }
P
;PX PY
| {z }
Q
;rA
We don’t have samples from Q Xa PX PY , only pairs
f@xi ;yi gn
i=1
i:i:d:
$ PXY
Solution: simulate Q with pairs (xi ;yj ) for j T= i.
What kernel to use for the RKHS r?
12/52
26. MMD as a dependence measure?
Could we use MMD?
MMD@PXY
| {z }
P
;PX PY
| {z }
Q
;rA
We don’t have samples from Q Xa PX PY , only pairs
f@xi ;yi gn
i=1
i:i:d:
$ PXY
Solution: simulate Q with pairs (xi ;yj ) for j T= i.
What kernel to use for the RKHS r?
12/52
27. MMD as a dependence measure?
Could we use MMD?
MMD@PXY
| {z }
P
;PX PY
| {z }
Q
;rA
We don’t have samples from Q Xa PX PY , only pairs
f@xi ;yi gn
i=1
i:i:d:
$ PXY
Solution: simulate Q with pairs (xi ;yj ) for j T= i.
What kernel to use for the RKHS r?
12/52
28. MMD as a dependence measure
Kernel k on images with feature space p,
Kernel l on captions with feature space q,
13/52
29. MMD as a dependence measure
Kernel k on images with feature space p,
Kernel l on captions with feature space q,
Kernel on image-text pairs: are images and captions similar?
13/52
30. MMD as a dependence measure
Given: Samples from a distribution PX Y
Goal: Are X and Y independent?
MMD2
@bPXY ; bPX
bPY ;rA Xa 1
n2
trace@KLA
( K, L column centered)
14/52
31. MMD as a dependence measure
Given: Samples from a distribution PX Y
Goal: Are X and Y independent?
MMD2
@bPXY ; bPX
bPY ;rA Xa 1
n2
trace@KLA
14/52
32. MMD as a dependence measure
Two questions:
Why the product kernel? Many ways to combine kernels - why not
eg a sum?
Is there a more interpretable way of defining this dependence
measure?
15/52
33. Finding covariance with smooth transformations
Illustration: two variables with no correlation but strong dependence.
-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
Correlation: 0.00
16/52
34. Finding covariance with smooth transformations
Illustration: two variables with no correlation but strong dependence.
-2 -1 0 1 2
-1.5
-1
-0.5
0
0.5
1
1.5
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
-2 0 2
-1
-0.5
0
0.5
16/52
36. Define two spaces, one for each witness
Function in p
f @xA a
IX
j =1
fj 'j @xA
Feature map
Kernel for RKHS p on ˆ:
k@x;xHA a h'@xA;'@xHAip
Function in q
g@yA a
IX
j =1
gj j @yA
Feature map
Kernel for RKHS q on ‰:
l@x;xHA a h@yA;@yHAiq
17/52
38. The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
cov
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
18/52
39. The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
Exy
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
Fine print: feature mappings '(x) and (y) assumed to have zero mean.
18/52
40. The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
Exy
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
Fine print: feature mappings '(x) and (y) assumed to have zero mean.
Rewriting:
Exy ‘f @xAg@yA“
a
2
6
6
4
f1
f2
...
3
7
7
5
b
Exy
0
B
B
@
2
6
6
4
'1@xA
'2@xA
...
3
7
7
5
h
1@yA 2@yA :::
i
1
C
C
A
| {z }
C'(x)(y)
2
6
6
4
g1
g2
...
3
7
7
5
18/52
41. The constrained covariance
The constrained covariance is
COCO@PXY A a sup
kf kp 1
kgkq 1
Exy
2
4
0
@
IX
j =1
fj 'j @xA
1
A
0
@
IX
j =1
gj j @yA
1
A
3
5
Fine print: feature mappings '(x) and (y) assumed to have zero mean.
Rewriting:
Exy ‘f @xAg@yA“
a
2
6
6
4
f1
f2
...
3
7
7
5
b
Exy
0
B
B
@
2
6
6
4
'1@xA
'2@xA
...
3
7
7
5
h
1@yA 2@yA :::
i
1
C
C
A
| {z }
C'(x)(y)
2
6
6
4
g1
g2
...
3
7
7
5
COCO: max singular value of feature covariance C'(x)(y)
18/52
42. Computing COCO in practice
Given sample f@xi ;yi Agn
i=1
i:i:d:
$ PXY , what is empirical COCO ?
19/52
43. Computing COCO in practice
Given sample f@xi ;yi Agn
i=1
i:i:d:
$ PXY , what is empirical COCO ?
COCO is largest eigenvalue
max of
0 1
n KL
1
n LK 0
#
45. #
:
Kij a k@xi ;xj A and Lij a l@yi ;yj A.
Fine print: kernels are computed with empirically centered features '(x) 1
n
Pn
i=1
'(xi )
and (y) 1
n
Pn
i=1
(yi ).
AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B.
Schoelkopf, and N. Logothetis, AISTATS’05
19/52
46. Computing COCO in practice
Given sample f@xi ;yi Agn
i=1
i:i:d:
$ PXY , what is empirical COCO ?
COCO is largest eigenvalue
max of
0 1
n KL
1
n LK 0
#
48. #
:
Kij a k@xi ;xj A and Lij a l@yi ;yj A.
Witness functions (singular vectors):
f @xA G
mX
i=1
i k@xi ;xA g@yA G
nX
i=1
49. i l@yi ;yA
Fine print: kernels are computed with empirically centered features '(x) 1
n
Pn
i=1
'(xi )
and (y) 1
n
Pn
i=1
(yi ).
AG., A. Smola., O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B.
Schoelkopf, and N. Logothetis, AISTATS’05
19/52
50. What is a large dependence with COCO?
−2 0 2
−3
−2
−1
0
1
2
3
X
Y
Smooth density
−4 −2 0 2 4
−4
−2
0
2
4
X
Y
500 Samples, smooth density
−2 0 2
−3
−2
−1
0
1
2
3
X
Y
Rough density
−4 −2 0 2 4
−4
−2
0
2
4
X
Y
500 samples, rough density
Density takes the form:
PXY G 1Csin@!xAsin@!yA
Which of these is the more “dependent”?
20/52
57. Dependence largest when at “low” frequencies
As dependence is encoded at higher frequencies, the smooth
mappings f ;g achieve lower linear dependence.
Even for independent variables, COCO will not be zero at finite
sample sizes, since some mild linear dependence will be found by f,g
(bias)
This bias will decrease with increasing sample size.
27/52
58. Can we do better than COCO?
A second example with zero correlation.
First singular value of feature covariance C'(x)(y):
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
-2 0 2
-1
-0.5
0
0.5
-1 -0.5 0 0.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
Correlation: 0.80 COCO1
: 0.11
28/52
59. Can we do better than COCO?
A second example with zero correlation.
Second singular value of feature covariance C'(x)(y):
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
28/52
60. Can we do better than COCO?
A second example with zero correlation.
Second singular value of feature covariance C'(x)(y):
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Correlation: 0.00
-2 0 2
-1
-0.5
0
0.5
1
-2 0 2
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-0.5
0
0.5
Correlation: 0.37 COCO2
: 0.06
28/52
61. The Hilbert-Schmidt Independence Criterion
Writing the ith singular value of the feature covariance C'(x)(y) as
i Xa COCOi @PXY Yp;qA;
define Hilbert-Schmidt Independence Criterion (HSIC)
HSIC2
@PXY Yp;qA a
IX
i=1
2
i :
AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B.
Schoelkopf., and A. Smola, NIPS 2007,.
29/52
62. The Hilbert-Schmidt Independence Criterion
Writing the ith singular value of the feature covariance C'(x)(y) as
i Xa COCOi @PXY Yp;qA;
define Hilbert-Schmidt Independence Criterion (HSIC)
HSIC2
@PXY Yp;qA a
IX
i=1
2
i :
AG, O. Bousquet , A. Smola., and B. Schoelkopf, ALT2005; AG,.,K. Fukumizu„C.H. Teo., L. Song., B.
Schoelkopf., and A. Smola, NIPS 2007,.
HSIC is MMD with product kernel!
HSIC2
@PXY Yp;qA a MMD2
@PXY ;PX PY YrA
where @@x;yA;@xH;yHAA a k@x;xHAl@y;yHA.
29/52
63. Asymptotics of HSIC under independence
Given sample f@xi ;yi gn
i=1
i:i:d:
$ PXY , what is empirical HSIC?
Empirical HSIC (biased)
HSIC a 1
n2
trace@KLA
Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with
empirically centered features)
Statistical testing: given PXY a PX PY , what is the threshold c
such that P@HSIC cA for small ?
Asymptotics of HSIC when PXY a PX PY :
n HSIC
D
3
IX
l=1
l z2
l ; zl $ x@0;1Ai:i:d:
where l l (zj ) =
R
hijqr l (zi )dFi;q;r ; hijqr = 1
4!
P(i;j ;q;r)
(t;u;v;w)
ktu ltu + ktu lvw 2ktu ltv
30/52
64. Asymptotics of HSIC under independence
Given sample f@xi ;yi gn
i=1
i:i:d:
$ PXY , what is empirical HSIC?
Empirical HSIC (biased)
HSIC a 1
n2
trace@KLA
Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with
empirically centered features)
Statistical testing: given PXY a PX PY , what is the threshold c
such that P@HSIC cA for small ?
Asymptotics of HSIC when PXY a PX PY :
n HSIC
D
3
IX
l=1
l z2
l ; zl $ x@0;1Ai:i:d:
where l l (zj ) =
R
hijqr l (zi )dFi;q;r ; hijqr = 1
4!
P(i;j ;q;r)
(t;u;v;w)
ktu ltu + ktu lvw 2ktu ltv
30/52
65. Asymptotics of HSIC under independence
Given sample f@xi ;yi gn
i=1
i:i:d:
$ PXY , what is empirical HSIC?
Empirical HSIC (biased)
HSIC a 1
n2
trace@KLA
Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with
empirically centered features)
Statistical testing: given PXY a PX PY , what is the threshold c
such that P@HSIC cA for small ?
Asymptotics of HSIC when PXY a PX PY :
n HSIC
D
3
IX
l=1
l z2
l ; zl $ x@0;1Ai:i:d:
where l l (zj ) =
R
hijqr l (zi )dFi;q;r ; hijqr = 1
4!
P(i;j ;q;r)
(t;u;v;w)
ktu ltu + ktu lvw 2ktu ltv
30/52
66. Asymptotics of HSIC under independence
Given sample f@xi ;yi gn
i=1
i:i:d:
$ PXY , what is empirical HSIC?
Empirical HSIC (biased)
HSIC a 1
n2
trace@KLA
Kij a k@xi ;xj A and Lij a l@yi yj A (K and L computed with
empirically centered features)
Statistical testing: given PXY a PX PY , what is the threshold c
such that P@HSIC cA for small ?
Asymptotics of HSIC when PXY a PX PY :
n HSIC
D
3
IX
l=1
l z2
l ; zl $ x@0;1Ai:i:d:
where l l (zj ) =
R
hijqr l (zi )dFi;q;r ; hijqr = 1
4!
P(i;j ;q;r)
(t;u;v;w)
ktu ltu + ktu lvw 2ktu ltv
30/52
67. A statistical test
Given PXY a PX PY , what is the threshold c such that
P@HSIC cA for small (prob. of false positive)?
Original time series:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
Permutation:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10
Null distribution via permutation
Compute HSIC for fxi ;y(i)gn
i=1 for random permutation of indices
f1;:::;ng. This gives HSIC for independent variables.
Repeat for many different permutations, get empirical CDF
Threshold c is 1 quantile of empirical CDF 31/52
68. A statistical test
Given PXY a PX PY , what is the threshold c such that
P@HSIC cA for small (prob. of false positive)?
Original time series:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
Permutation:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10
Null distribution via permutation
Compute HSIC for fxi ;y(i)gn
i=1 for random permutation of indices
f1;:::;ng. This gives HSIC for independent variables.
Repeat for many different permutations, get empirical CDF
Threshold c is 1 quantile of empirical CDF 31/52
69. A statistical test
Given PXY a PX PY , what is the threshold c such that
P@HSIC cA for small (prob. of false positive)?
Original time series:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
Permutation:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Y7 Y3 Y9 Y2 Y4 Y8 Y5 Y1 Y6 Y10
Null distribution via permutation
Compute HSIC for fxi ;y(i)gn
i=1 for random permutation of indices
f1;:::;ng. This gives HSIC for independent variables.
Repeat for many different permutations, get empirical CDF
Threshold c is 1 quantile of empirical CDF 31/52
70. Application: dependence detection across languages
Testing task: detect dependence between English and French text
Les ordres de gouvernements
provinciaux et municipaux
subissent de fortes pressions
Honourable senators, I have a
question for the Leader of the
Government in the Senate
Text from the aligned hansards of the 36th parliament of canada,
https://www.isi.edu/natural-language/download/hansard/
YX
Honorables sénateurs, ma question
s’adresse au leader du
gouvernement au Sénat
Au contraire, nous avons augmenté
le financement fédéral pour le
développement des jeunes
No doubt there is great pressure
on provincial and municipal
governments
In fact, we have increased
federal investments for early
childhood development.
...
...
32/52
71. Application: dependence detection across languages
Testing task: detect dependence between English and French text
k-spectrum kernel, k a 10, sample size n a 10
HSIC a 1
n2
trace@KLA
(K and L column centered) 33/52
72. Application:Dependence detection across languages
Results (for a 0:05)
k-spectrum kernel: average Type II error 0
Bag of words kernel: average Type II error 0.18
Settings: Five line extracts, averaged over 300 repetitions, for
“Agriculture” transcripts. Similar results for Fisheries and
Immigration transcripts.
34/52
74. Detecting higher order interaction
How to detect V-structures with pairwise weak individual
dependence?
X Y
Z
36/52
75. Detecting higher order interaction
How to detect V-structures with pairwise weak individual
dependence?
36/52
76. Detecting higher order interaction
How to detect V-structures with pairwise weak individual
dependence?
X cc Y ;Y cc Z;X cc Z
X1 vs Y1 Y1 vs Z1
X1 vs Z1 X1*Y1 vs Z1
X Y
Z
X ;Y
i:i:d:
$ x@0;1A
Zj X ;Y $ sign@XY AExp@ 1p
2
A
Fine print: Faithfulness violated here!
36/52
77. V-structure discovery
X Y
Z
Assume X cc Y has been established.
V-structure can then be detected by:
Consistent CI test: H0 X X cc Y jZ [Fukumizu et al. 2008, Zhang et al. 2011]
Factorisation test: H0 X @X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X
(multiple standard two-variable tests)
How well do these work?
37/52
78. Detecting higher order interaction
Generalise earlier example to p dimensions
X cc Y ;Y cc Z;X cc Z
X1 vs Y1 Y1 vs Z1
X1 vs Z1 X1*Y1 vs Z1
X Y
Z
X ;Y
i:i:d:
$ x@0;1A
Zj X ;Y $ sign@XY AExp@ 1p
2
A
X2:p;Y2:p;Z2:p
i:i:d:
$ x@0;Ip 1A
Fine print: Faithfulness violated here!
38/52
80. Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY PX PY
40/52
81. Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY PX PY
D a 3 X ¡LP a PXYZ PX PYZ PY PXZ PZ PXY C2PX PY PZ
40/52
82. Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY PX PY
D a 3 X ¡LP a PXYZ PX PYZ PY PXZ PZ PXY C2PX PY PZ
X Y
Z
X Y
Z
X Y
Z
X Y
Z
PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ
∆LP =
40/52
83. Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY PX PY
D a 3 X ¡LP a PXYZ PX PYZ PY PXZ PZ PXY C2PX PY PZ
X Y
Z
X Y
Z
X Y
Z
X Y
Z
PXY Z −PXPY Z −PXZPY −PXY PZ +2PXPY PZ
∆LP = 0
Case of PX cc PYZ
40/52
84. Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY PX PY
D a 3 X ¡LP a PXYZ PX PYZ PY PXZ PZ PXY C2PX PY PZ
@X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X A ¡LP a 0:
...so what might be missed?
40/52
85. Lancaster interaction measure
Lancaster interaction measure of @X1;:::;XD A $ P is a signed
measure ¡P that vanishes whenever P can be factorised non-trivially.
D a 2 X ¡LP a PXY PX PY
D a 3 X ¡LP a PXYZ PX PYZ PY PXZ PZ PXY C2PX PY PZ
¡LP a 0 ;@X ;Y A cc Z • @X ;ZA cc Y • @Y ;ZA cc X
Example:
P(0;0;0) = 0:2 P(0;0;1) = 0:1 P(1;0;0) = 0:1 P(1;0;1) = 0:1
P(0;1;0) = 0:1 P(0;1;1) = 0:1 P(1;1;0) = 0:1 P(1;1;1) = 0:2
40/52
86. A kernel test statistic using Lancaster Measure
Construct a test by estimating k @¡LPAk2
r ; where a k l m:
k@PXYZ PXY PZ ¡¡¡Ak2
r a
hPXYZ ;PXYZ ir 2 hPXYZ ;PXY PZ ir ¡¡¡
41/52
87. A kernel test statistic using Lancaster Measure
Table: V -statistic estimators of h;Hir
(without terms PX PY PZ ). H
is centering matrix I n 1
Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13
k @¡LPAk2
r a 1
n2
@HKH HLH HMHA++ :
Empirical joint central moment in the feature space
42/52
88. A kernel test statistic using Lancaster Measure
Table: V -statistic estimators of h;Hir
(without terms PX PY PZ ). H
is centering matrix I n 1
Lancaster interaction statistic: D. Sejdinovic, AG, W. Bergsma, NIPS13
k @¡LPAk2
r a 1
n2
@HKH HLH HMHA++ :
Empirical joint central moment in the feature space
42/52
90. Interaction for D 4
Interaction measure valid for all D:
(Streitberg, 1990)
¡S P a
X
@ 1Ajj 1
@jj 1A3JP
For a partition , J associates to the
joint the corresponding factorisation,
e.g., J13j2j4P = PX1X3
PX2
PX4
:
44/52
91. Interaction for D 4
Interaction measure valid for all D:
(Streitberg, 1990)
¡S P a
X
@ 1Ajj 1
@jj 1A3JP
For a partition , J associates to the
joint the corresponding factorisation,
e.g., J13j2j4P = PX1X3
PX2
PX4
:
44/52
92. Interaction for D 4
Interaction measure valid for all D:
(Streitberg, 1990)
¡S P a
X
@ 1Ajj 1
@jj 1A3JP
For a partition , J associates to the
joint the corresponding factorisation,
e.g., J13j2j4P = PX1X3
PX2
PX4
:
1e+04
1e+09
1e+14
1e+19
1 3 5 7 9 11 13 15 17 19 21 23 25
D
Numberofpartitionsof{1,...,D}
Bell numbers growth
44/52