This document discusses using the maximum mean discrepancy (MMD) to compare probability distributions and samples. The MMD is defined as the distance between the mean embeddings of distributions in a reproducing kernel Hilbert space. It can be used for two sample testing to determine if samples come from the same distribution, goodness of fit testing to assess how well a model fits some data, and testing for independence. The document outlines applications of the MMD, including troubleshooting generative adversarial networks, and discusses how to construct empirical witness functions for computing the MMD from samples.
3. An example: two-sample tests
Have: Two collections of samples X;Y from unknown distributions
P and Q.
Goal: do P and Q differ?
MNIST samples Samples from a GAN
Significant difference in GAN and MNIST?
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, Xi Chen, NIPS 2016. 3/51
4. Testing goodness of fit
Given: A model P and samples and Q.
Goal: is P a good fit for Q?
Chicago crime data
Model is Gaussian mixture with two components.
4/51
5. Testing independence
Given: Samples from a distribution PX Y
Goal: Are X and Y independent?
Their noses guide them
through life, and they're
never happier than when
following an interesting scent.
A large animal who slings slobber,
exudes a distinctive houndy odor,
and wants nothing more than to
follow his nose.
Text from dogtime.com and petfinder.com
A responsive, interactive
pet, one that will blow in
your ear and follow you
everywhere.
YX
5/51
6. Outline: part 1
Two sample testing
Test statistic: Maximum Mean Discrepancy (MMD)...
...as a difference in feature means
...as an integral probability metric (not just a technicality!)
Statistical testing with the MMD
Troubleshooting GANs with MMD
6/51
7. Outline: part 2
Goodness of fit testing
The kernel Stein discrepancy
Dependence testing
Dependence using the MMD
Depenence using feature covariances
Statistical testing
Additional topics
7/51
8. Outline: part 2
Goodness of fit testing
The kernel Stein discrepancy
Dependence testing
Dependence using the MMD
Depenence using feature covariances
Statistical testing
Additional topics
7/51
10. Feature mean difference
Simple example: 2 Gaussians with different means
Answer: t-test
−6 −4 −2 0 2 4 6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different means
X
Prob.density
P
X
Q
X
9/51
11. Feature mean difference
Two Gaussians with same means, different variance
Idea: look at difference in means of features of the RVs
In Gaussian case: second order features of form '@xA a x2
−6 −4 −2 0 2 4 6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different variances
Prob.density
X
PX
Q
X
10/51
12. Feature mean difference
Two Gaussians with same means, different variance
Idea: look at difference in means of features of the RVs
In Gaussian case: second order features of form '@xA a x2
−6 −4 −2 0 2 4 6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different variances
Prob.density
X
PX
Q
X
10
−1
10
0
10
1
10
2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Densities of feature X
2
X2
Prob.density
P
X
Q
X
10/51
13. Feature mean difference
Gaussian and Laplace distributions
Same mean and same variance
Difference in means using higher order features...RKHS
−4 −3 −2 −1 0 1 2 3 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Gaussian and Laplace densitiesProb.density
X
P
X
Q
X
11/51
14. Infinitely many features using kernels
Kernels: dot products
of features
Feature map '@xA P p,
'@xA a ‘:::'i @xA:::“ P `2
For positive definite k,
k@x;xHA a h'@xA;'@xHAip
Infinitely many features
'@xA, dot product in
closed form!
12/51
15. Infinitely many features using kernels
Kernels: dot products
of features
Feature map '@xA P p,
'@xA a ‘:::'i @xA:::“ P `2
For positive definite k,
k@x;xHA a h'@xA;'@xHAip
Infinitely many features
'@xA, dot product in
closed form!
Exponentiated quadratic kernel
k@x;xHA a exp
kx xHk2
Features: Gaussian Processes for Machine learning, Ras-
mussen and Williams, Ch. 4. 12/51
16. Infinitely many features of distributions
Given P a Borel probability measure on ˆ, define feature map of
probability P,
P a ‘:::EP ‘'i @X A“:::“
For positive definite k@x;xHA,
hP ;Q ip a EP;Q k@x;yA
for x $ P and y $ Q.
Fine print: feature map '(x) must be Bochner integrable for all probability measures considered.
Always true if kernel bounded.
13/51
17. Infinitely many features of distributions
Given P a Borel probability measure on ˆ, define feature map of
probability P,
P a ‘:::EP ‘'i @X A“:::“
For positive definite k@x;xHA,
hP ;Q ip a EP;Q k@x;yA
for x $ P and y $ Q.
Fine print: feature map '(x) must be Bochner integrable for all probability measures considered.
Always true if kernel bounded.
13/51
18. The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature
means:
MMD2
@P;QA a kP Q k2
p
a hP ;P ip C hQ ;Q ip 2 hP ;Q ip
a EP k@X ;X HA| {z }
(a)
C EQ k@Y ;Y HA| {z }
(a)
2EP;Q k@X ;Y A| {z }
(b)
14/51
19. The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature
means:
MMD2
@P;QA a kP Q k2
p
a hP ;P ip C hQ ;Q ip 2 hP ;Q ip
a EP k@X ;X HA| {z }
(a)
C EQ k@Y ;Y HA| {z }
(a)
2EP;Q k@X ;Y A| {z }
(b)
14/51
20. The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature
means:
MMD2
@P;QA a kP Q k2
p
a hP ;P ip C hQ ;Q ip 2 hP ;Q ip
a EP k@X ;X HA| {z }
(a)
C EQ k@Y ;Y HA| {z }
(a)
2EP;Q k@X ;Y A| {z }
(b)
(a)= within distrib. similarity, (b)= cross-distrib. similarity.
14/51
21. Illustration of MMD
Dogs @a PA and fish @a QA example revisited
Each entry is one of k@dogi ;dogj A, k@dogi ;fishj A, or k@fishi ;fishj A
15/51
22. Illustration of MMD
The maximum mean discrepancy:
MMD
2
=
1
n(n 1)
ˆ
i6=j
k(dogi ;dogj ) +
1
n(n 1)
ˆ
i6=j
k(fishi ;fishj )
2
n2
ˆ
i;j
k(dogi ;fishj )
16/51
23. MMD as an integral probability metric
Are P and Q different?
0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
1
Samples from P and Q
17/51
24. MMD as an integral probability metric
Are P and Q different?
0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
1
Samples from P and Q
18/51
25. MMD as an integral probability metric
Integral probability metric:
Find a well behaved function f @xA to maximize
EP f @X A EQ f @Y A
0 0.2 0.4 0.6 0.8 1
x
-1
-0.5
0
0.5
1
f(x)
Smooth function
19/51
26. MMD as an integral probability metric
Integral probability metric:
Find a well behaved function f @xA to maximize
EP f @X A EQ f @Y A
0 0.2 0.4 0.6 0.8 1
x
-1
-0.5
0
0.5
1
f(x)
Smooth function
20/51
27. MMD as an integral probability metric
Maximum mean discrepancy: smooth function for P vs Q
MMD@P;QYFA Xa sup
kf k 1
‘EP f @X A EQ f @Y A“
(F a unit ball in RKHS p)
21/51
28. MMD as an integral probability metric
Maximum mean discrepancy: smooth function for P vs Q
MMD@P;QYFA Xa sup
kf k 1
‘EP f @X A EQ f @Y A“
(F a unit ball in RKHS p)
−6 −4 −2 0 2 4 6
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Witness f for Gauss and Laplace densities
X
Prob.densityandf
f
Gauss
Laplace
21/51
29. MMD as an integral probability metric
Maximum mean discrepancy: smooth function for P vs Q
MMD@P;QYFA Xa sup
kf k 1
‘EP f @X A EQ f @Y A“
(F a unit ball in RKHS p)
Functions are linear combinations of features:
21/51
30. MMD as an integral probability metric
Maximum mean discrepancy: smooth function for P vs Q
MMD@P;QYFA Xa sup
kf k 1
‘EP f @X A EQ f @Y A“
(F a unit ball in RKHS p)
Expectations of functions are linear combinations
of expected features
EP @f @X AA a hf ;EP '@X Aip a hf ;P ip
(always true if kernel is bounded)
21/51
31. MMD as an integral probability metric
Maximum mean discrepancy: smooth function for P vs Q
MMD@P;QYFA Xa sup
kf k 1
‘EP f @X A EQ f @Y A“
(F a unit ball in RKHS p)
For characteristic RKHS p, MMD@P;QYFA a 0 iff P a Q
Other choices for witness function class:
Bounded continuous [Dudley, 2002]
Bounded varation 1 (Kolmogorov metric) [Müller, 1997]
Bounded Lipschitz (Wasserstein distances) [Dudley, 2002]
21/51
32. Integral prob. metric vs feature difference
The MMD:
MMD@P;QYFA
a sup
f PF
‘EP f @X A EQ f @Y A“
−6 −4 −2 0 2 4 6
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Witness f for Gauss and Laplace densities
X
Prob.densityandf
f
Gauss
Laplace
22/51
33. Integral prob. metric vs feature difference
The MMD:
MMD@P;QYFA
a sup
f PF
‘EP f @X A EQ f @Y A“
a sup
f PF
hf ;P Q ip
use
EP f @X A a hP ;f ip
22/51
34. Integral prob. metric vs feature difference
The MMD:
MMD@P;QYFA
a sup
f PF
‘EP f @X A EQ f @Y A“
a sup
f PF
hf ;P Q ip
f
22/51
35. Integral prob. metric vs feature difference
The MMD:
MMD@P;QYFA
a sup
f PF
‘EP f @X A EQ f @Y A“
a sup
f PF
hf ;P Q ip
f
22/51
36. Integral prob. metric vs feature difference
The MMD:
MMD@P;QYFA
a sup
f PF
‘EP f @X A EQ f @Y A“
a sup
f PF
hf ;P Q ip
f*
22/51
37. Integral prob. metric vs feature difference
The MMD:
MMD@P;QYFA
a sup
f PF
‘EP f @X A EQ f @Y A“
a sup
f PF
hf ;P Q ip
a kP Q k
Function view and feature view equivalent
22/51
38. Construction of MMD witness
Construction of empirical witness function (proof: next slide!)
Observe X = fx1;:::;xn g $ P
Observe Y = fy1;:::;yn g $ Q
23/51
39. Construction of MMD witness
Construction of empirical witness function (proof: next slide!)
23/51
40. Construction of MMD witness
Construction of empirical witness function (proof: next slide!)
v
23/51
41. Construction of MMD witness
Construction of empirical witness function (proof: next slide!)
v
witness(v)
| {z }
23/51
42. Derivation of empirical witness function
Recall the witness function expression
f £ G P Q
24/51
43. Derivation of empirical witness function
Recall the witness function expression
f £ G P Q
The empirical feature mean for P
˜P Xa 1
n
nˆ
i=1
'@xi A
24/51
44. Derivation of empirical witness function
Recall the witness function expression
f £ G P Q
The empirical feature mean for P
˜P Xa 1
n
nˆ
i=1
'@xi A
The empirical witness function at v
f £@vA a hf £;'@vAip
24/51
45. Derivation of empirical witness function
Recall the witness function expression
f £ G P Q
The empirical feature mean for P
˜P Xa 1
n
nˆ
i=1
'@xi A
The empirical witness function at v
f £@vA a hf £;'@vAip
G h˜P ˜Q ;'@vAip
24/51
46. Derivation of empirical witness function
Recall the witness function expression
f £ G P Q
The empirical feature mean for P
˜P Xa 1
n
nˆ
i=1
'@xi A
The empirical witness function at v
f £@vA a hf £;'@vAip
G h˜P ˜Q ;'@vAip
a 1
n
nˆ
i=1
k@xi ;vA 1
n
nˆ
i=1
k@yi ;vA
Don’t need explicit feature coefficients f £ Xa
h
f £
1 f £
2 :::
i
24/51
48. A statistical test using MMD
The empirical MMD:
MMD
2
=
1
n(n 1)
ˆ
i6=j
k(xi ;xj ) +
1
n(n 1)
ˆ
i6=j
k(yi ;yj )
2
n2
ˆ
i;j
k(xi ;yj )
How does this help decide whether P a Q?
26/51
49. A statistical test using MMD
The empirical MMD:
MMD
2
=
1
n(n 1)
ˆ
i6=j
k(xi ;xj ) +
1
n(n 1)
ˆ
i6=j
k(yi ;yj )
2
n2
ˆ
i;j
k(xi ;yj )
Perspective from statistical hypothesis testing:
Null hypothesis r0 when P a Q
should see MMD
2
“close to zero”.
Alternative hypothesis r1 when P Ta Q
should see MMD
2
“far from zero”
26/51
50. A statistical test using MMD
The empirical MMD:
MMD
2
=
1
n(n 1)
ˆ
i6=j
k(xi ;xj ) +
1
n(n 1)
ˆ
i6=j
k(yi ;yj )
2
n2
ˆ
i;j
k(xi ;yj )
Perspective from statistical hypothesis testing:
Null hypothesis r0 when P a Q
should see MMD
2
“close to zero”.
Alternative hypothesis r1 when P Ta Q
should see MMD
2
“far from zero”
Want Threshold c for MMD
2
to get false positive rate 26/51
51. Behaviour of MMD
2
when P T= Q
Draw n a 200 i.i.d samples from P and Q
Laplace with different y-variance.
pn ¢ MMD
2
a 1:2
-2 0 2
-10
-8
-6
-4
-2
0
2
4
6
8
10
P
Q
27/51
52. Behaviour of MMD
2
when P T= Q
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
-2 0 2
-10
-8
-6
-4
-2
0
2
4
6
8
10
P
Q
27/51
53. Behaviour of MMD
2
when P T= Q
Draw n a 200 new samples from P and Q
Laplace with different y-variance.
pn ¢ MMD
2
a 1:5
-2 0 2
-10
-8
-6
-4
-2
0
2
4
6
8
10
P
Q
27/51
54. Behaviour of MMD
2
when P T= Q
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
-2 0 2
-10
-8
-6
-4
-2
0
2
4
6
8
10
P
Q
27/51
55. Behaviour of MMD
2
when P T= Q
Repeat this 150 times :::
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
28/51
56. Behaviour of MMD
2
when P T= Q
Repeat this 300 times :::
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
28/51
57. Behaviour of MMD
2
when P T= Q
Repeat this 3000 times :::
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
28/51
58. Asymptotics of MMD
2
when P T= Q
When P Ta Q, statistic is asymptotically normal,
MMD
2
MMD@P;QAp
Vn @P;QA
D
3 x@0;1A;
where variance Vn @P;QA a O
n 1
¡
.
0 0.5 1 1.5 2 2.5 3 3.5
0
0.5
1
1.5
Empirical PDF
Gaussian fit
−6 −4 −2 0 2 4 6
0
0.5
1
1.5
Two Laplace distributions with different variances
X
Prob.density
P
X
Q
X
29/51
60. Behaviour of MMD
2
when P = Q
Case of P a Q a x@0;1A
-2 0 2 4 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
31/51
61. Behaviour of MMD
2
when P = Q
Case of P a Q a x@0;1A
-2 0 2 4 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
31/51
62. Behaviour of MMD
2
when P = Q
Case of P a Q a x@0;1A
-2 0 2 4 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
31/51
63. Behaviour of MMD
2
when P = Q
Case of P a Q a x@0;1A
-2 0 2 4 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
31/51
64. Behaviour of MMD
2
when P = Q
Case of P a Q a x@0;1A
-2 0 2 4 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
31/51
65. Asymptotics of MMD
2
when P = Q
Where P a Q, statistic has asymptotic distribution
n MMD
2
$
Iˆ
l=1
l
h
z2
l 2
i
-2 0 2 4 6
0
0.2
0.4
0.6
where
i i (xH) =
ˆ
~k(x;xH)
| {z }
centred
i (x)dP(x)
zl $ x@0;2A i:i:d:
32/51
66. A statistical test
A summary of the asymptotics:
-2 -1 0 1 2 3 4 5 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
33/51
67. A statistical test
Test construction: (G., Borgwardt, Rasch, Schoelkopf, and Smola, JMLR 2012)
-2 -1 0 1 2 3 4 5 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
33/51
68. How do we get test threshold c?
Original empirical MMD for dogs and fish:
MMD
2
=
1
n(n 1)
ˆ
i6=j
k(xi ;xj )
+
1
n(n 1)
ˆ
i6=j
k(yi ;yj )
2
n2
ˆ
i;j
k(xi ;yj )
k(xi, yj)k(xi, xj)
k(yi, yj)
34/51
69. How do we get test threshold c?
Permuted dog and fish samples (merdogs):
35/51
70. How do we get test threshold c?
Permuted dog and fish samples (merdogs):
MMD
2
=
1
n(n 1)
ˆ
i6=j
k(~xi ; ~xj )
+
1
n(n 1)
ˆ
i6=j
k(~yi ;~yj )
2
n2
ˆ
i;j
k(~xi ;~yj )
Permutation simulates
P a Q
k(˜xi, ˜yj)k(˜xi, ˜xj)
k(˜yi, ˜yj)
35/51
71. Demonstration of permutation estimate of null
Null distribution estimated from 500 permutations
P a Q a x@0;1A
-2 0 2 4 6
0
0.2
0.4
0.6
36/51
72. Demonstration of permutation estimate of null
Null distribution estimated from 500 permutations
P a Q a x@0;1A
-2 0 2 4 6
0
0.2
0.4
0.6
Use 1 quantile of permutation distribution for
test threshold c 36/51
74. Optimizing kernel for test power
The power of our test (Pr1 denotes probability under P Ta Q):
Pr1
n MMD
2
”c
38/51
75. Optimizing kernel for test power
The power of our test (Pr1 denotes probability under P Ta Q):
Pr1
n MMD
2
”c
3 ¨
2
nMMD2
@P;QAp
Vn @P;QA
cp
Vn @P;QA
3
where
¨ is the CDF of the standard normal distribution.
”c is an estimate of c test threshold.
38/51
76. Optimizing kernel for test power
The power of our test (Pr1 denotes probability under P Ta Q):
Pr1
n MMD
2
”c
3 ¨
2
MMD2
@P;QAp
Vn @P;QA| {z }
O(n1=2)
c
n
p
Vn @P;QA| {z }
O(n 1=2)
3
Variance under r1 decreases as
p
Vn @P;QA $ O@n 1=2A
For large n, second term negligible!
38/51
77. Optimizing kernel for test power
The power of our test (Pr1 denotes probability under P Ta Q):
Pr1
n MMD
2
”c
3 ¨
2
MMD2
@P;QAp
Vn @P;QA
c
n
p
Vn @P;QA
3
To maximize test power, maximize
MMD2
@P;QAp
Vn @P;QA
(Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., ICLR 2017)
Code: github.com/dougalsutherland/opt-mmd
38/51
80. Troubleshooting for generative adversarial networks
MNIST samples Samples from a GAN
ARD map
Power for optimzed ARD
kernel: 1.00 at a 0:01
Power for optimized RBF
kernel: 0.57 at a 0:01
40/51
82. MMD for GAN critic
Can you use MMD as a critic to train GANs?
Can you train convolutional features as input to the MMD critic?
From ICML 2015:
From UAI 2015:
42/51
90. Distinguishing Feature(s)
New test statistic: witness2
at a single v£;
Linear time in number n of samples
....but how to choose best feature v£?
v
46/51
97. Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
48/51
98. Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
48/51
99. Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
P(x)
Q(y)
witness2
(v)
48/51
100. Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
witness2
(v)
varianceX(v)
48/51
101. Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
witness2
(v)
varianceY(v)
48/51
102. Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
witness2
(v)
variance of v
48/51
103. Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
v ¤
^¸n(v)
Best location is v£ that maximizes ”n .
Improve performance using multiple locations fv£
j gJ
j =1
48/51
104. Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
Rejection threshold is T = (1 )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
49/51
105. Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
Rejection threshold is T = (1 )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
0 20 40 60 80 100
^¸n
Â2
(J)
Runtime: y@nA for both testing and optimization.
49/51
106. Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
Rejection threshold is T = (1 )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
0 20 40 60 80 100
^¸n
Â2
(J)
T®
Runtime: y@nA for both testing and optimization.
49/51
107. Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
Rejection threshold is T = (1 )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
0 20 40 60 80 100
^¸n
Â2
(J)
T®
ℙH1
(^¸n)
Runtime: y@nA for both testing and optimization.
49/51
108. Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
Rejection threshold is T = (1 )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
0 20 40 60 80 100
^¸n
Â2
(J)
T®
ℙH1
(^¸n)
Runtime: y@nA for both testing and optimization.
49/51
109. Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
Rejection threshold is T = (1 )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
0 20 40 60 80 100
^¸n
Â2
(J)
T®
ℙH1
(^¸n)
Theorem: Under H1, optimization of † (by maximizing ”n ) increases
the (lower bound of) test power.
Runtime: y@nA for both testing and optimization.
49/51
110. Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
Rejection threshold is T = (1 )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
0 20 40 60 80 100
^¸n
Â2
(J)
T®
ℙH1
(^¸n)
Theorem: Under H1, optimization of † (by maximizing ”n ) increases
the (lower bound of) test power.
Runtime: y@nA for both testing and optimization.
49/51
111. Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
Rejection threshold is T = (1 )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
0 20 40 60 80 100
^¸n
Â2
(J)
T®
ℙH1
(^¸n)
Theorem: Under H1, optimization of † (by maximizing ”n ) increases
the (lower bound of) test power.
Runtime: y@nA for both testing and optimization.
49/51
112. Distinguishing Positive/Negative Emotions
+ :
happy neutral surprised
:
afraid angry disgusted
35 females and 35 males
(Lundqvist et al., 1998).
48 ¢34 a 1632 dimensions.
Pixel features.
Sample size: 402.
The proposed test achieves maximum test power in time O@nA.
Informative features: differences at the nose, and smile lines.
50/51
113. Distinguishing Positive/Negative Emotions
+ :
happy neutral surprised
:
afraid angry disgusted
+ vs. -
0.0
0.5
1.0
Power⟶
Random feature
The proposed test achieves maximum test power in time O@nA.
Informative features: differences at the nose, and smile lines.
50/51
114. Distinguishing Positive/Negative Emotions
+ :
happy neutral surprised
:
afraid angry disgusted
+ vs. -
0.0
0.5
1.0
Power⟶
Random feature
Proposed
The proposed test achieves maximum test power in time O@nA.
Informative features: differences at the nose, and smile lines.
50/51
115. Distinguishing Positive/Negative Emotions
+ :
happy neutral surprised
:
afraid angry disgusted
+ vs. -
0.0
0.5
1.0
Power⟶
Random feature
Proposed
MMD (quadratic time)
The proposed test achieves maximum test power in time O@nA.
Informative features: differences at the nose, and smile lines.
50/51
116. Distinguishing Positive/Negative Emotions
+ :
happy neutral surprised
:
afraid angry disgusted
Learned feature
The proposed test achieves maximum test power in time O@nA.
Informative features: differences at the nose, and smile lines.
50/51
117. Distinguishing Positive/Negative Emotions
+ :
happy neutral surprised
:
afraid angry disgusted
Learned feature
The proposed test achieves maximum test power in time O@nA.
Informative features: differences at the nose, and smile lines.
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
Code: https://github.com/wittawatj/interpretable-test
50/51
118. Co-authors
From Gatsby:
Kacper Chwialkowski
Wittawat Jitkrittum
Bharath Sriperumbudur
Heiko Strathmann
Dougal Sutherland
Zoltan Szabo
Wenkai Xu
External collaborators:
Kenji Fukumizu
Bernhard Schoelkopf
Alex Smola
Questions?
51/51