SlideShare ist ein Scribd-Unternehmen logo
1 von 118
Downloaden Sie, um offline zu lesen
Representing and comparing probabilities
Arthur Gretton
Gatsby Computational Neuroscience Unit,
University College London
UAI, 2017
1/51
Comparing two samples
Given: Samples from unknown distributions P and Q.
Goal: do P and Q differ?
2/51
An example: two-sample tests
Have: Two collections of samples X;Y from unknown distributions
P and Q.
Goal: do P and Q differ?
MNIST samples Samples from a GAN
Significant difference in GAN and MNIST?
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, Xi Chen, NIPS 2016. 3/51
Testing goodness of fit
Given: A model P and samples and Q.
Goal: is P a good fit for Q?
Chicago crime data
Model is Gaussian mixture with two components.
4/51
Testing independence
Given: Samples from a distribution PX Y
Goal: Are X and Y independent?
Their	noses	guide	them	
through	life,	and	they're	
never	happier	than	when	
following	an	interesting	scent.	
A	large	animal	who	slings	slobber,	
exudes	a	distinctive	houndy odor,	
and	wants	nothing	more	than	to	
follow	his	nose.	
Text	from	dogtime.com and	petfinder.com
A responsive,		interactive	
pet,	one	that	will	blow	in	
your	ear	and	follow	you	
everywhere.
YX
5/51
Outline: part 1
Two sample testing
Test statistic: Maximum Mean Discrepancy (MMD)...
 ...as a difference in feature means
 ...as an integral probability metric (not just a technicality!)
Statistical testing with the MMD
Troubleshooting GANs with MMD
6/51
Outline: part 2
Goodness of fit testing
The kernel Stein discrepancy
Dependence testing
Dependence using the MMD
Depenence using feature covariances
Statistical testing
Additional topics
7/51
Outline: part 2
Goodness of fit testing
The kernel Stein discrepancy
Dependence testing
Dependence using the MMD
Depenence using feature covariances
Statistical testing
Additional topics
7/51
Maximum Mean Discrepancy
8/51
Feature mean difference
Simple example: 2 Gaussians with different means
Answer: t-test
−6 −4 −2 0 2 4 6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different means
X
Prob.density
P
X
Q
X
9/51
Feature mean difference
Two Gaussians with same means, different variance
Idea: look at difference in means of features of the RVs
In Gaussian case: second order features of form '@xA a x2
−6 −4 −2 0 2 4 6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different variances
Prob.density
X
PX
Q
X
10/51
Feature mean difference
Two Gaussians with same means, different variance
Idea: look at difference in means of features of the RVs
In Gaussian case: second order features of form '@xA a x2
−6 −4 −2 0 2 4 6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different variances
Prob.density
X
PX
Q
X
10
−1
10
0
10
1
10
2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Densities of feature X
2
X2
Prob.density
P
X
Q
X
10/51
Feature mean difference
Gaussian and Laplace distributions
Same mean and same variance
Difference in means using higher order features...RKHS
−4 −3 −2 −1 0 1 2 3 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Gaussian and Laplace densitiesProb.density
X
P
X
Q
X
11/51
Infinitely many features using kernels
Kernels: dot products
of features
Feature map '@xA P p,
'@xA a ‘:::'i @xA:::“ P `2
For positive definite k,
k@x;xHA a h'@xA;'@xHAip
Infinitely many features
'@xA, dot product in
closed form!
12/51
Infinitely many features using kernels
Kernels: dot products
of features
Feature map '@xA P p,
'@xA a ‘:::'i @xA:::“ P `2
For positive definite k,
k@x;xHA a h'@xA;'@xHAip
Infinitely many features
'@xA, dot product in
closed form!
Exponentiated quadratic kernel
k@x;xHA a exp

 
 kx  xHk2

Features: Gaussian Processes for Machine learning, Ras-
mussen and Williams, Ch. 4. 12/51
Infinitely many features of distributions
Given P a Borel probability measure on ˆ, define feature map of
probability P,
P a ‘:::EP ‘'i @X A“:::“
For positive definite k@x;xHA,
hP ;Q ip a EP;Q k@x;yA
for x $ P and y $ Q.
Fine print: feature map '(x) must be Bochner integrable for all probability measures considered.
Always true if kernel bounded.
13/51
Infinitely many features of distributions
Given P a Borel probability measure on ˆ, define feature map of
probability P,
P a ‘:::EP ‘'i @X A“:::“
For positive definite k@x;xHA,
hP ;Q ip a EP;Q k@x;yA
for x $ P and y $ Q.
Fine print: feature map '(x) must be Bochner integrable for all probability measures considered.
Always true if kernel bounded.
13/51
The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature
means:
MMD2
@P;QA a kP  Q k2
p
a hP ;P ip C hQ ;Q ip  2 hP ;Q ip
a EP k@X ;X HA| {z }
(a)
C EQ k@Y ;Y HA| {z }
(a)
 2EP;Q k@X ;Y A| {z }
(b)
14/51
The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature
means:
MMD2
@P;QA a kP  Q k2
p
a hP ;P ip C hQ ;Q ip  2 hP ;Q ip
a EP k@X ;X HA| {z }
(a)
C EQ k@Y ;Y HA| {z }
(a)
 2EP;Q k@X ;Y A| {z }
(b)
14/51
The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature
means:
MMD2
@P;QA a kP  Q k2
p
a hP ;P ip C hQ ;Q ip  2 hP ;Q ip
a EP k@X ;X HA| {z }
(a)
C EQ k@Y ;Y HA| {z }
(a)
 2EP;Q k@X ;Y A| {z }
(b)
(a)= within distrib. similarity, (b)= cross-distrib. similarity.
14/51
Illustration of MMD
Dogs @a PA and fish @a QA example revisited
Each entry is one of k@dogi ;dogj A, k@dogi ;fishj A, or k@fishi ;fishj A
15/51
Illustration of MMD
The maximum mean discrepancy:
MMD
2
=
1
n(n  1)
ˆ
i6=j
k(dogi ;dogj ) +
1
n(n  1)
ˆ
i6=j
k(fishi ;fishj )
  2
n2
ˆ
i;j
k(dogi ;fishj )
16/51
MMD as an integral probability metric
Are P and Q different?
0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
1
Samples from P and Q
17/51
MMD as an integral probability metric
Are P and Q different?
0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
1
Samples from P and Q
18/51
MMD as an integral probability metric
Integral probability metric:
Find a well behaved function f @xA to maximize
EP f @X A  EQ f @Y A
0 0.2 0.4 0.6 0.8 1
x
-1
-0.5
0
0.5
1
f(x)
Smooth function
19/51
MMD as an integral probability metric
Integral probability metric:
Find a well behaved function f @xA to maximize
EP f @X A  EQ f @Y A
0 0.2 0.4 0.6 0.8 1
x
-1
-0.5
0
0.5
1
f(x)
Smooth function
20/51
MMD as an integral probability metric
Maximum mean discrepancy: smooth function for P vs Q
MMD@P;QYFA Xa sup
kf k 1
‘EP f @X A  EQ f @Y A“
(F a unit ball in RKHS p)
21/51
MMD as an integral probability metric
Maximum mean discrepancy: smooth function for P vs Q
MMD@P;QYFA Xa sup
kf k 1
‘EP f @X A  EQ f @Y A“
(F a unit ball in RKHS p)
−6 −4 −2 0 2 4 6
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Witness f for Gauss and Laplace densities
X
Prob.densityandf
f
Gauss
Laplace
21/51
MMD as an integral probability metric
Maximum mean discrepancy: smooth function for P vs Q
MMD@P;QYFA Xa sup
kf k 1
‘EP f @X A  EQ f @Y A“
(F a unit ball in RKHS p)
Functions are linear combinations of features:
21/51
MMD as an integral probability metric
Maximum mean discrepancy: smooth function for P vs Q
MMD@P;QYFA Xa sup
kf k 1
‘EP f @X A  EQ f @Y A“
(F a unit ball in RKHS p)
Expectations of functions are linear combinations
of expected features
EP @f @X AA a hf ;EP '@X Aip a hf ;P ip
(always true if kernel is bounded)
21/51
MMD as an integral probability metric
Maximum mean discrepancy: smooth function for P vs Q
MMD@P;QYFA Xa sup
kf k 1
‘EP f @X A  EQ f @Y A“
(F a unit ball in RKHS p)
For characteristic RKHS p, MMD@P;QYFA a 0 iff P a Q
Other choices for witness function class:
Bounded continuous [Dudley, 2002]
Bounded varation 1 (Kolmogorov metric) [Müller, 1997]
Bounded Lipschitz (Wasserstein distances) [Dudley, 2002]
21/51
Integral prob. metric vs feature difference
The MMD:
MMD@P;QYFA
a sup
f PF
‘EP f @X A  EQ f @Y A“
−6 −4 −2 0 2 4 6
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Witness f for Gauss and Laplace densities
X
Prob.densityandf
f
Gauss
Laplace
22/51
Integral prob. metric vs feature difference
The MMD:
MMD@P;QYFA
a sup
f PF
‘EP f @X A  EQ f @Y A“
a sup
f PF
hf ;P  Q ip
use
EP f @X A a hP ;f ip
22/51
Integral prob. metric vs feature difference
The MMD:
MMD@P;QYFA
a sup
f PF
‘EP f @X A  EQ f @Y A“
a sup
f PF
hf ;P  Q ip
f
22/51
Integral prob. metric vs feature difference
The MMD:
MMD@P;QYFA
a sup
f PF
‘EP f @X A  EQ f @Y A“
a sup
f PF
hf ;P  Q ip
f
22/51
Integral prob. metric vs feature difference
The MMD:
MMD@P;QYFA
a sup
f PF
‘EP f @X A  EQ f @Y A“
a sup
f PF
hf ;P  Q ip
f*
22/51
Integral prob. metric vs feature difference
The MMD:
MMD@P;QYFA
a sup
f PF
‘EP f @X A  EQ f @Y A“
a sup
f PF
hf ;P  Q ip
a kP  Q k
Function view and feature view equivalent
22/51
Construction of MMD witness
Construction of empirical witness function (proof: next slide!)
Observe X = fx1;:::;xn g $ P
Observe Y = fy1;:::;yn g $ Q
23/51
Construction of MMD witness
Construction of empirical witness function (proof: next slide!)
23/51
Construction of MMD witness
Construction of empirical witness function (proof: next slide!)
v
23/51
Construction of MMD witness
Construction of empirical witness function (proof: next slide!)
v
witness(v)
| {z }
23/51
Derivation of empirical witness function
Recall the witness function expression
f £ G P  Q
24/51
Derivation of empirical witness function
Recall the witness function expression
f £ G P  Q
The empirical feature mean for P
˜P Xa 1
n
nˆ
i=1
'@xi A
24/51
Derivation of empirical witness function
Recall the witness function expression
f £ G P  Q
The empirical feature mean for P
˜P Xa 1
n
nˆ
i=1
'@xi A
The empirical witness function at v
f £@vA a hf £;'@vAip
24/51
Derivation of empirical witness function
Recall the witness function expression
f £ G P  Q
The empirical feature mean for P
˜P Xa 1
n
nˆ
i=1
'@xi A
The empirical witness function at v
f £@vA a hf £;'@vAip
G h˜P   ˜Q ;'@vAip
24/51
Derivation of empirical witness function
Recall the witness function expression
f £ G P  Q
The empirical feature mean for P
˜P Xa 1
n
nˆ
i=1
'@xi A
The empirical witness function at v
f £@vA a hf £;'@vAip
G h˜P   ˜Q ;'@vAip
a 1
n
nˆ
i=1
k@xi ;vA   1
n
nˆ
i=1
k@yi ;vA
Don’t need explicit feature coefficients f £ Xa
h
f £
1 f £
2 :::
i
24/51
Two-Sample Testing
25/51
A statistical test using MMD
The empirical MMD:
MMD
2
=
1
n(n  1)
ˆ
i6=j
k(xi ;xj ) +
1
n(n  1)
ˆ
i6=j
k(yi ;yj )
  2
n2
ˆ
i;j
k(xi ;yj )
How does this help decide whether P a Q?
26/51
A statistical test using MMD
The empirical MMD:
MMD
2
=
1
n(n  1)
ˆ
i6=j
k(xi ;xj ) +
1
n(n  1)
ˆ
i6=j
k(yi ;yj )
  2
n2
ˆ
i;j
k(xi ;yj )
Perspective from statistical hypothesis testing:
Null hypothesis r0 when P a Q
 should see MMD
2
“close to zero”.
Alternative hypothesis r1 when P Ta Q
 should see MMD
2
“far from zero”
26/51
A statistical test using MMD
The empirical MMD:
MMD
2
=
1
n(n  1)
ˆ
i6=j
k(xi ;xj ) +
1
n(n  1)
ˆ
i6=j
k(yi ;yj )
  2
n2
ˆ
i;j
k(xi ;yj )
Perspective from statistical hypothesis testing:
Null hypothesis r0 when P a Q
 should see MMD
2
“close to zero”.
Alternative hypothesis r1 when P Ta Q
 should see MMD
2
“far from zero”
Want Threshold c for MMD
2
to get false positive rate  26/51
Behaviour of MMD
2
when P T= Q
Draw n a 200 i.i.d samples from P and Q
Laplace with different y-variance.
pn ¢ MMD
2
a 1:2
-2 0 2
-10
-8
-6
-4
-2
0
2
4
6
8
10
P
Q
27/51
Behaviour of MMD
2
when P T= Q
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
-2 0 2
-10
-8
-6
-4
-2
0
2
4
6
8
10
P
Q
27/51
Behaviour of MMD
2
when P T= Q
Draw n a 200 new samples from P and Q
Laplace with different y-variance.
pn ¢ MMD
2
a 1:5
-2 0 2
-10
-8
-6
-4
-2
0
2
4
6
8
10
P
Q
27/51
Behaviour of MMD
2
when P T= Q
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
-2 0 2
-10
-8
-6
-4
-2
0
2
4
6
8
10
P
Q
27/51
Behaviour of MMD
2
when P T= Q
Repeat this 150 times :::
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
28/51
Behaviour of MMD
2
when P T= Q
Repeat this 300 times :::
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
28/51
Behaviour of MMD
2
when P T= Q
Repeat this 3000 times :::
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
28/51
Asymptotics of MMD
2
when P T= Q
When P Ta Q, statistic is asymptotically normal,
MMD
2
 MMD@P;QAp
Vn @P;QA
D
 3 x@0;1A;
where variance Vn @P;QA a O
 
n 1
¡
.
0 0.5 1 1.5 2 2.5 3 3.5
0
0.5
1
1.5
Empirical PDF
Gaussian fit
−6 −4 −2 0 2 4 6
0
0.5
1
1.5
Two Laplace distributions with different variances
X
Prob.density
P
X
Q
X
29/51
Behaviour of MMD
2
when P = Q
What happens when P and Q are the same?
30/51
Behaviour of MMD
2
when P = Q
Case of P a Q a x@0;1A
-2 0 2 4 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
31/51
Behaviour of MMD
2
when P = Q
Case of P a Q a x@0;1A
-2 0 2 4 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
31/51
Behaviour of MMD
2
when P = Q
Case of P a Q a x@0;1A
-2 0 2 4 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
31/51
Behaviour of MMD
2
when P = Q
Case of P a Q a x@0;1A
-2 0 2 4 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
31/51
Behaviour of MMD
2
when P = Q
Case of P a Q a x@0;1A
-2 0 2 4 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
31/51
Asymptotics of MMD
2
when P = Q
Where P a Q, statistic has asymptotic distribution
n MMD
2
$
Iˆ
l=1
l
h
z2
l  2
i
-2 0 2 4 6
0
0.2
0.4
0.6
where
i i (xH) =

ˆ
~k(x;xH)
| {z }
centred
i (x)dP(x)
zl $ x@0;2A i:i:d:
32/51
A statistical test
A summary of the asymptotics:
-2 -1 0 1 2 3 4 5 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
33/51
A statistical test
Test construction: (G., Borgwardt, Rasch, Schoelkopf, and Smola, JMLR 2012)
-2 -1 0 1 2 3 4 5 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
33/51
How do we get test threshold c?
Original empirical MMD for dogs and fish:
MMD
2
=
1
n(n  1)
ˆ
i6=j
k(xi ;xj )
+
1
n(n  1)
ˆ
i6=j
k(yi ;yj )
  2
n2
ˆ
i;j
k(xi ;yj )
k(xi, yj)k(xi, xj)
k(yi, yj)
34/51
How do we get test threshold c?
Permuted dog and fish samples (merdogs):
35/51
How do we get test threshold c?
Permuted dog and fish samples (merdogs):
MMD
2
=
1
n(n  1)
ˆ
i6=j
k(~xi ; ~xj )
+
1
n(n  1)
ˆ
i6=j
k(~yi ;~yj )
  2
n2
ˆ
i;j
k(~xi ;~yj )
Permutation simulates
P a Q
k(˜xi, ˜yj)k(˜xi, ˜xj)
k(˜yi, ˜yj)
35/51
Demonstration of permutation estimate of null
Null distribution estimated from 500 permutations
P a Q a x@0;1A
-2 0 2 4 6
0
0.2
0.4
0.6
36/51
Demonstration of permutation estimate of null
Null distribution estimated from 500 permutations
P a Q a x@0;1A
-2 0 2 4 6
0
0.2
0.4
0.6
Use 1   quantile of permutation distribution for
test threshold c 36/51
How to choose the best kernel
37/51
Optimizing kernel for test power
The power of our test (Pr1 denotes probability under P Ta Q):
Pr1

n MMD
2
 ”c

38/51
Optimizing kernel for test power
The power of our test (Pr1 denotes probability under P Ta Q):
Pr1

n MMD
2
 ”c

3 ¨
2
nMMD2
@P;QAp
Vn @P;QA
  cp
Vn @P;QA
3
where
¨ is the CDF of the standard normal distribution.
”c is an estimate of c test threshold.
38/51
Optimizing kernel for test power
The power of our test (Pr1 denotes probability under P Ta Q):
Pr1

n MMD
2
 ”c

3 ¨
2
MMD2
@P;QAp
Vn @P;QA| {z }
O(n1=2)
  c
n
p
Vn @P;QA| {z }
O(n 1=2)
3
Variance under r1 decreases as
p
Vn @P;QA $ O@n 1=2A
For large n, second term negligible!
38/51
Optimizing kernel for test power
The power of our test (Pr1 denotes probability under P Ta Q):
Pr1

n MMD
2
 ”c

3 ¨
2
MMD2
@P;QAp
Vn @P;QA
  c
n
p
Vn @P;QA
3
To maximize test power, maximize
MMD2
@P;QAp
Vn @P;QA
(Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., ICLR 2017)
Code: github.com/dougalsutherland/opt-mmd
38/51
Graphical illustration
Reminder: maximising test power same as minimizing false negatives
-2 -1 0 1 2 3 4 5 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
39/51
Troubleshooting for generative adversarial networks
MNIST samples Samples from a GAN
40/51
Troubleshooting for generative adversarial networks
MNIST samples Samples from a GAN
ARD map
Power for optimzed ARD
kernel: 1.00 at  a 0:01
Power for optimized RBF
kernel: 0.57 at  a 0:01
40/51
Troubleshooting generative adversarial networks
41/51
MMD for GAN critic
Can you use MMD as a critic to train GANs?
Can you train convolutional features as input to the MMD critic?
From ICML 2015:
From UAI 2015:
42/51
MMD for GAN critic: 2017 update
43/51
MMD for GAN critic: 2017 update
43/51
MMD for GAN critic: 2017 update
43/51
MMD for GAN critic: 2017 update
“Energy	Distance Kernel”
43/51
An adaptive, linear time distribution
metric
44/51
Reminder: witness function for MMD
v
^P (v): mean embedding of P
^Q (v): mean embedding of Q
witness(v) = ^P (v)   ^Q (v)
| {z }
45/51
Distinguishing Feature(s)
witness2
(v)
Take square of witness (only worry about
amplitude)
v
46/51
Distinguishing Feature(s)
New test statistic: witness2
at a single v£;
Linear time in number n of samples
....but how to choose best feature v£?
v
46/51
Distinguishing Feature(s)
Best feature =
v£ that maximizes witness2
(v)??
v£
v
46/51
Distinguishing Feature(s)
v
Sample size n = 3
witness2
(v)
47/51
Distinguishing Feature(s)
v
Sample size n = 50
47/51
Distinguishing Feature(s)
v
Sample size n = 500
47/51
Distinguishing Feature(s)
P(x)
Q(y)
witness2
(v)
v
Population witness2
function
47/51
Distinguishing Feature(s)
P(x)
Q(y)
witness2
(v)
v
v£? v£?
47/51
Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
48/51
Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
48/51
Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
P(x)
Q(y)
witness2
(v)
48/51
Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
witness2
(v)
varianceX(v)
48/51
Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
witness2
(v)
varianceY(v)
48/51
Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
witness2
(v)
variance of v
48/51
Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
v ¤
^¸n(v)
Best location is v£ that maximizes ”n .
Improve performance using multiple locations fv£
j gJ
j =1
48/51
Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
 Rejection threshold is T = (1  )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
 But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
49/51
Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
 Rejection threshold is T = (1  )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
 But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
0 20 40 60 80 100
^¸n
Â2
(J)
Runtime: y@nA for both testing and optimization.
49/51
Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
 Rejection threshold is T = (1  )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
 But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
0 20 40 60 80 100
^¸n
Â2
(J)
T®
Runtime: y@nA for both testing and optimization.
49/51
Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
 Rejection threshold is T = (1  )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
 But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
0 20 40 60 80 100
^¸n
Â2
(J)
T®
ℙH1
(^¸n)
Runtime: y@nA for both testing and optimization.
49/51
Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
 Rejection threshold is T = (1  )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
 But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
0 20 40 60 80 100
^¸n
Â2
(J)
T®
ℙH1
(^¸n)
Runtime: y@nA for both testing and optimization.
49/51
Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
 Rejection threshold is T = (1  )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
 But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
0 20 40 60 80 100
^¸n
Â2
(J)
T®
ℙH1
(^¸n)
Theorem: Under H1, optimization of † (by maximizing ”n ) increases
the (lower bound of) test power.
Runtime: y@nA for both testing and optimization.
49/51
Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
 Rejection threshold is T = (1  )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
 But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
0 20 40 60 80 100
^¸n
Â2
(J)
T®
ℙH1
(^¸n)
Theorem: Under H1, optimization of † (by maximizing ”n ) increases
the (lower bound of) test power.
Runtime: y@nA for both testing and optimization.
49/51
Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
 Rejection threshold is T = (1  )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
 But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
0 20 40 60 80 100
^¸n
Â2
(J)
T®
ℙH1
(^¸n)
Theorem: Under H1, optimization of † (by maximizing ”n ) increases
the (lower bound of) test power.
Runtime: y@nA for both testing and optimization.
49/51
Distinguishing Positive/Negative Emotions
+ :
happy neutral surprised
  :
afraid angry disgusted
35 females and 35 males
(Lundqvist et al., 1998).
48 ¢34 a 1632 dimensions.
Pixel features.
Sample size: 402.
The proposed test achieves maximum test power in time O@nA.
Informative features: differences at the nose, and smile lines.
50/51
Distinguishing Positive/Negative Emotions
+ :
happy neutral surprised
  :
afraid angry disgusted
+ vs. -
0.0
0.5
1.0
Power⟶
Random feature
The proposed test achieves maximum test power in time O@nA.
Informative features: differences at the nose, and smile lines.
50/51
Distinguishing Positive/Negative Emotions
+ :
happy neutral surprised
  :
afraid angry disgusted
+ vs. -
0.0
0.5
1.0
Power⟶
Random feature
Proposed
The proposed test achieves maximum test power in time O@nA.
Informative features: differences at the nose, and smile lines.
50/51
Distinguishing Positive/Negative Emotions
+ :
happy neutral surprised
  :
afraid angry disgusted
+ vs. -
0.0
0.5
1.0
Power⟶
Random feature
Proposed
MMD (quadratic time)
The proposed test achieves maximum test power in time O@nA.
Informative features: differences at the nose, and smile lines.
50/51
Distinguishing Positive/Negative Emotions
+ :
happy neutral surprised
  :
afraid angry disgusted
Learned feature
The proposed test achieves maximum test power in time O@nA.
Informative features: differences at the nose, and smile lines.
50/51
Distinguishing Positive/Negative Emotions
+ :
happy neutral surprised
  :
afraid angry disgusted
Learned feature
The proposed test achieves maximum test power in time O@nA.
Informative features: differences at the nose, and smile lines.
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
Code: https://github.com/wittawatj/interpretable-test
50/51
Co-authors
From Gatsby:
Kacper Chwialkowski
Wittawat Jitkrittum
Bharath Sriperumbudur
Heiko Strathmann
Dougal Sutherland
Zoltan Szabo
Wenkai Xu
External collaborators:
Kenji Fukumizu
Bernhard Schoelkopf
Alex Smola
Questions?
51/51

Weitere ähnliche Inhalte

Was ist angesagt?

2014.01.23 prml勉強会4.2確率的生成モデル
2014.01.23 prml勉強会4.2確率的生成モデル2014.01.23 prml勉強会4.2確率的生成モデル
2014.01.23 prml勉強会4.2確率的生成モデル
Takeshi Sakaki
 

Was ist angesagt? (20)

統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
 
Flow based generative models
Flow based generative modelsFlow based generative models
Flow based generative models
 
PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」
 
PRML 4.1.6-4.2.2
PRML 4.1.6-4.2.2PRML 4.1.6-4.2.2
PRML 4.1.6-4.2.2
 
[PRML勉強会資料] パターン認識と機械学習 第3章 線形回帰モデル (章頭-3.1.5)(p.135-145)
[PRML勉強会資料] パターン認識と機械学習 第3章 線形回帰モデル (章頭-3.1.5)(p.135-145)[PRML勉強会資料] パターン認識と機械学習 第3章 線形回帰モデル (章頭-3.1.5)(p.135-145)
[PRML勉強会資料] パターン認識と機械学習 第3章 線形回帰モデル (章頭-3.1.5)(p.135-145)
 
PRML輪読#11
PRML輪読#11PRML輪読#11
PRML輪読#11
 
Prml 3 3.3
Prml 3 3.3Prml 3 3.3
Prml 3 3.3
 
Off policy evaluation
Off policy evaluationOff policy evaluation
Off policy evaluation
 
PRML8章
PRML8章PRML8章
PRML8章
 
クラシックな機械学習入門:付録:よく使う線形代数の公式
クラシックな機械学習入門:付録:よく使う線形代数の公式クラシックな機械学習入門:付録:よく使う線形代数の公式
クラシックな機械学習入門:付録:よく使う線形代数の公式
 
2014.01.23 prml勉強会4.2確率的生成モデル
2014.01.23 prml勉強会4.2確率的生成モデル2014.01.23 prml勉強会4.2確率的生成モデル
2014.01.23 prml勉強会4.2確率的生成モデル
 
2018512 AWS上での機械学習システムの構築とSageMaker
2018512 AWS上での機械学習システムの構築とSageMaker2018512 AWS上での機械学習システムの構築とSageMaker
2018512 AWS上での機械学習システムの構築とSageMaker
 
パターン認識と機械学習 §6.2 カーネル関数の構成
パターン認識と機械学習 §6.2 カーネル関数の構成パターン認識と機械学習 §6.2 カーネル関数の構成
パターン認識と機械学習 §6.2 カーネル関数の構成
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
PRML輪読#10
PRML輪読#10PRML輪読#10
PRML輪読#10
 
はじめてのパターン認識 第6章 後半
はじめてのパターン認識 第6章 後半はじめてのパターン認識 第6章 後半
はじめてのパターン認識 第6章 後半
 
[論文紹介] Understanding and improving transformer from a multi particle dynamic ...
[論文紹介] Understanding and improving transformer from a multi particle dynamic ...[論文紹介] Understanding and improving transformer from a multi particle dynamic ...
[論文紹介] Understanding and improving transformer from a multi particle dynamic ...
 
prml4.1.3-4.1.4
prml4.1.3-4.1.4prml4.1.3-4.1.4
prml4.1.3-4.1.4
 
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)
PRML 3.3.3-3.4 ベイズ線形回帰とモデル選択 / Baysian Linear Regression and Model Comparison)
 
東京都市大学 データ解析入門 2 行列分解 1
東京都市大学 データ解析入門 2 行列分解 1東京都市大学 データ解析入門 2 行列分解 1
東京都市大学 データ解析入門 2 行列分解 1
 

Ähnlich wie Representing and comparing probabilities

02-VariableLengthCodes_pres.pdf
02-VariableLengthCodes_pres.pdf02-VariableLengthCodes_pres.pdf
02-VariableLengthCodes_pres.pdf
JunZhao68
 
MarcoCeze_defense
MarcoCeze_defenseMarcoCeze_defense
MarcoCeze_defense
Marco Ceze
 
Intro to Feature Selection
Intro to Feature SelectionIntro to Feature Selection
Intro to Feature Selection
chenhm
 

Ähnlich wie Representing and comparing probabilities (20)

Representing and comparing probabilities: Part 2
Representing and comparing probabilities: Part 2Representing and comparing probabilities: Part 2
Representing and comparing probabilities: Part 2
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Chapter-4 combined.pptx
Chapter-4 combined.pptxChapter-4 combined.pptx
Chapter-4 combined.pptx
 
2 random variables notes 2p3
2 random variables notes 2p32 random variables notes 2p3
2 random variables notes 2p3
 
Lesson 26
Lesson 26Lesson 26
Lesson 26
 
AI Lesson 26
AI Lesson 26AI Lesson 26
AI Lesson 26
 
Probability cheatsheet
Probability cheatsheetProbability cheatsheet
Probability cheatsheet
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference Compilation
 
random variables-descriptive and contincuous
random variables-descriptive and contincuousrandom variables-descriptive and contincuous
random variables-descriptive and contincuous
 
02-VariableLengthCodes_pres.pdf
02-VariableLengthCodes_pres.pdf02-VariableLengthCodes_pres.pdf
02-VariableLengthCodes_pres.pdf
 
Nec 602 unit ii Random Variables and Random process
Nec 602 unit ii Random Variables and Random processNec 602 unit ii Random Variables and Random process
Nec 602 unit ii Random Variables and Random process
 
Q-Metrics in Theory and Practice
Q-Metrics in Theory and PracticeQ-Metrics in Theory and Practice
Q-Metrics in Theory and Practice
 
Q-Metrics in Theory And Practice
Q-Metrics in Theory And PracticeQ-Metrics in Theory And Practice
Q-Metrics in Theory And Practice
 
Probability Cheatsheet.pdf
Probability Cheatsheet.pdfProbability Cheatsheet.pdf
Probability Cheatsheet.pdf
 
RDFS with Attribute Equations via SPARQL Rewriting
RDFS with Attribute Equations via SPARQL RewritingRDFS with Attribute Equations via SPARQL Rewriting
RDFS with Attribute Equations via SPARQL Rewriting
 
Lec 2 discrete random variable
Lec 2 discrete random variableLec 2 discrete random variable
Lec 2 discrete random variable
 
MarcoCeze_defense
MarcoCeze_defenseMarcoCeze_defense
MarcoCeze_defense
 
Probability cheatsheet
Probability cheatsheetProbability cheatsheet
Probability cheatsheet
 
Computational Motor Control: Optimal Estimation in Noisy World (JAIST summer ...
Computational Motor Control: Optimal Estimation in Noisy World (JAIST summer ...Computational Motor Control: Optimal Estimation in Noisy World (JAIST summer ...
Computational Motor Control: Optimal Estimation in Noisy World (JAIST summer ...
 
Intro to Feature Selection
Intro to Feature SelectionIntro to Feature Selection
Intro to Feature Selection
 

Mehr von MLReview

Mehr von MLReview (12)

Bayesian Non-parametric Models for Data Science using PyMC
 Bayesian Non-parametric Models for Data Science using PyMC Bayesian Non-parametric Models for Data Science using PyMC
Bayesian Non-parametric Models for Data Science using PyMC
 
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
  Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...  Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 
PixelGAN Autoencoders
  PixelGAN Autoencoders  PixelGAN Autoencoders
PixelGAN Autoencoders
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 
Theoretical Neuroscience and Deep Learning Theory
Theoretical Neuroscience and Deep Learning TheoryTheoretical Neuroscience and Deep Learning Theory
Theoretical Neuroscience and Deep Learning Theory
 
2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems
 
Deep Learning for Semantic Composition
Deep Learning for Semantic CompositionDeep Learning for Semantic Composition
Deep Learning for Semantic Composition
 
Near human performance in question answering?
Near human performance in question answering?Near human performance in question answering?
Near human performance in question answering?
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
 
Real-time Edge-aware Image Processing with the Bilateral Grid
Real-time Edge-aware Image Processing with the Bilateral GridReal-time Edge-aware Image Processing with the Bilateral Grid
Real-time Edge-aware Image Processing with the Bilateral Grid
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 

Kürzlich hochgeladen

Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 

Representing and comparing probabilities

  • 1. Representing and comparing probabilities Arthur Gretton Gatsby Computational Neuroscience Unit, University College London UAI, 2017 1/51
  • 2. Comparing two samples Given: Samples from unknown distributions P and Q. Goal: do P and Q differ? 2/51
  • 3. An example: two-sample tests Have: Two collections of samples X;Y from unknown distributions P and Q. Goal: do P and Q differ? MNIST samples Samples from a GAN Significant difference in GAN and MNIST? T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, Xi Chen, NIPS 2016. 3/51
  • 4. Testing goodness of fit Given: A model P and samples and Q. Goal: is P a good fit for Q? Chicago crime data Model is Gaussian mixture with two components. 4/51
  • 5. Testing independence Given: Samples from a distribution PX Y Goal: Are X and Y independent? Their noses guide them through life, and they're never happier than when following an interesting scent. A large animal who slings slobber, exudes a distinctive houndy odor, and wants nothing more than to follow his nose. Text from dogtime.com and petfinder.com A responsive, interactive pet, one that will blow in your ear and follow you everywhere. YX 5/51
  • 6. Outline: part 1 Two sample testing Test statistic: Maximum Mean Discrepancy (MMD)... ...as a difference in feature means ...as an integral probability metric (not just a technicality!) Statistical testing with the MMD Troubleshooting GANs with MMD 6/51
  • 7. Outline: part 2 Goodness of fit testing The kernel Stein discrepancy Dependence testing Dependence using the MMD Depenence using feature covariances Statistical testing Additional topics 7/51
  • 8. Outline: part 2 Goodness of fit testing The kernel Stein discrepancy Dependence testing Dependence using the MMD Depenence using feature covariances Statistical testing Additional topics 7/51
  • 10. Feature mean difference Simple example: 2 Gaussians with different means Answer: t-test −6 −4 −2 0 2 4 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Two Gaussians with different means X Prob.density P X Q X 9/51
  • 11. Feature mean difference Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form '@xA a x2 −6 −4 −2 0 2 4 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Two Gaussians with different variances Prob.density X PX Q X 10/51
  • 12. Feature mean difference Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form '@xA a x2 −6 −4 −2 0 2 4 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Two Gaussians with different variances Prob.density X PX Q X 10 −1 10 0 10 1 10 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Densities of feature X 2 X2 Prob.density P X Q X 10/51
  • 13. Feature mean difference Gaussian and Laplace distributions Same mean and same variance Difference in means using higher order features...RKHS −4 −3 −2 −1 0 1 2 3 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Gaussian and Laplace densitiesProb.density X P X Q X 11/51
  • 14. Infinitely many features using kernels Kernels: dot products of features Feature map '@xA P p, '@xA a ‘:::'i @xA:::“ P `2 For positive definite k, k@x;xHA a h'@xA;'@xHAip Infinitely many features '@xA, dot product in closed form! 12/51
  • 15. Infinitely many features using kernels Kernels: dot products of features Feature map '@xA P p, '@xA a ‘:::'i @xA:::“ P `2 For positive definite k, k@x;xHA a h'@xA;'@xHAip Infinitely many features '@xA, dot product in closed form! Exponentiated quadratic kernel k@x;xHA a exp   kx  xHk2 Features: Gaussian Processes for Machine learning, Ras- mussen and Williams, Ch. 4. 12/51
  • 16. Infinitely many features of distributions Given P a Borel probability measure on ˆ, define feature map of probability P, P a ‘:::EP ‘'i @X A“:::“ For positive definite k@x;xHA, hP ;Q ip a EP;Q k@x;yA for x $ P and y $ Q. Fine print: feature map '(x) must be Bochner integrable for all probability measures considered. Always true if kernel bounded. 13/51
  • 17. Infinitely many features of distributions Given P a Borel probability measure on ˆ, define feature map of probability P, P a ‘:::EP ‘'i @X A“:::“ For positive definite k@x;xHA, hP ;Q ip a EP;Q k@x;yA for x $ P and y $ Q. Fine print: feature map '(x) must be Bochner integrable for all probability measures considered. Always true if kernel bounded. 13/51
  • 18. The maximum mean discrepancy The maximum mean discrepancy is the distance between feature means: MMD2 @P;QA a kP  Q k2 p a hP ;P ip C hQ ;Q ip  2 hP ;Q ip a EP k@X ;X HA| {z } (a) C EQ k@Y ;Y HA| {z } (a)  2EP;Q k@X ;Y A| {z } (b) 14/51
  • 19. The maximum mean discrepancy The maximum mean discrepancy is the distance between feature means: MMD2 @P;QA a kP  Q k2 p a hP ;P ip C hQ ;Q ip  2 hP ;Q ip a EP k@X ;X HA| {z } (a) C EQ k@Y ;Y HA| {z } (a)  2EP;Q k@X ;Y A| {z } (b) 14/51
  • 20. The maximum mean discrepancy The maximum mean discrepancy is the distance between feature means: MMD2 @P;QA a kP  Q k2 p a hP ;P ip C hQ ;Q ip  2 hP ;Q ip a EP k@X ;X HA| {z } (a) C EQ k@Y ;Y HA| {z } (a)  2EP;Q k@X ;Y A| {z } (b) (a)= within distrib. similarity, (b)= cross-distrib. similarity. 14/51
  • 21. Illustration of MMD Dogs @a PA and fish @a QA example revisited Each entry is one of k@dogi ;dogj A, k@dogi ;fishj A, or k@fishi ;fishj A 15/51
  • 22. Illustration of MMD The maximum mean discrepancy: MMD 2 = 1 n(n  1) ˆ i6=j k(dogi ;dogj ) + 1 n(n  1) ˆ i6=j k(fishi ;fishj )   2 n2 ˆ i;j k(dogi ;fishj ) 16/51
  • 23. MMD as an integral probability metric Are P and Q different? 0 0.2 0.4 0.6 0.8 1 -1 -0.5 0 0.5 1 Samples from P and Q 17/51
  • 24. MMD as an integral probability metric Are P and Q different? 0 0.2 0.4 0.6 0.8 1 -1 -0.5 0 0.5 1 Samples from P and Q 18/51
  • 25. MMD as an integral probability metric Integral probability metric: Find a well behaved function f @xA to maximize EP f @X A  EQ f @Y A 0 0.2 0.4 0.6 0.8 1 x -1 -0.5 0 0.5 1 f(x) Smooth function 19/51
  • 26. MMD as an integral probability metric Integral probability metric: Find a well behaved function f @xA to maximize EP f @X A  EQ f @Y A 0 0.2 0.4 0.6 0.8 1 x -1 -0.5 0 0.5 1 f(x) Smooth function 20/51
  • 27. MMD as an integral probability metric Maximum mean discrepancy: smooth function for P vs Q MMD@P;QYFA Xa sup kf k 1 ‘EP f @X A  EQ f @Y A“ (F a unit ball in RKHS p) 21/51
  • 28. MMD as an integral probability metric Maximum mean discrepancy: smooth function for P vs Q MMD@P;QYFA Xa sup kf k 1 ‘EP f @X A  EQ f @Y A“ (F a unit ball in RKHS p) −6 −4 −2 0 2 4 6 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 Witness f for Gauss and Laplace densities X Prob.densityandf f Gauss Laplace 21/51
  • 29. MMD as an integral probability metric Maximum mean discrepancy: smooth function for P vs Q MMD@P;QYFA Xa sup kf k 1 ‘EP f @X A  EQ f @Y A“ (F a unit ball in RKHS p) Functions are linear combinations of features: 21/51
  • 30. MMD as an integral probability metric Maximum mean discrepancy: smooth function for P vs Q MMD@P;QYFA Xa sup kf k 1 ‘EP f @X A  EQ f @Y A“ (F a unit ball in RKHS p) Expectations of functions are linear combinations of expected features EP @f @X AA a hf ;EP '@X Aip a hf ;P ip (always true if kernel is bounded) 21/51
  • 31. MMD as an integral probability metric Maximum mean discrepancy: smooth function for P vs Q MMD@P;QYFA Xa sup kf k 1 ‘EP f @X A  EQ f @Y A“ (F a unit ball in RKHS p) For characteristic RKHS p, MMD@P;QYFA a 0 iff P a Q Other choices for witness function class: Bounded continuous [Dudley, 2002] Bounded varation 1 (Kolmogorov metric) [Müller, 1997] Bounded Lipschitz (Wasserstein distances) [Dudley, 2002] 21/51
  • 32. Integral prob. metric vs feature difference The MMD: MMD@P;QYFA a sup f PF ‘EP f @X A  EQ f @Y A“ −6 −4 −2 0 2 4 6 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 Witness f for Gauss and Laplace densities X Prob.densityandf f Gauss Laplace 22/51
  • 33. Integral prob. metric vs feature difference The MMD: MMD@P;QYFA a sup f PF ‘EP f @X A  EQ f @Y A“ a sup f PF hf ;P  Q ip use EP f @X A a hP ;f ip 22/51
  • 34. Integral prob. metric vs feature difference The MMD: MMD@P;QYFA a sup f PF ‘EP f @X A  EQ f @Y A“ a sup f PF hf ;P  Q ip f 22/51
  • 35. Integral prob. metric vs feature difference The MMD: MMD@P;QYFA a sup f PF ‘EP f @X A  EQ f @Y A“ a sup f PF hf ;P  Q ip f 22/51
  • 36. Integral prob. metric vs feature difference The MMD: MMD@P;QYFA a sup f PF ‘EP f @X A  EQ f @Y A“ a sup f PF hf ;P  Q ip f* 22/51
  • 37. Integral prob. metric vs feature difference The MMD: MMD@P;QYFA a sup f PF ‘EP f @X A  EQ f @Y A“ a sup f PF hf ;P  Q ip a kP  Q k Function view and feature view equivalent 22/51
  • 38. Construction of MMD witness Construction of empirical witness function (proof: next slide!) Observe X = fx1;:::;xn g $ P Observe Y = fy1;:::;yn g $ Q 23/51
  • 39. Construction of MMD witness Construction of empirical witness function (proof: next slide!) 23/51
  • 40. Construction of MMD witness Construction of empirical witness function (proof: next slide!) v 23/51
  • 41. Construction of MMD witness Construction of empirical witness function (proof: next slide!) v witness(v) | {z } 23/51
  • 42. Derivation of empirical witness function Recall the witness function expression f £ G P  Q 24/51
  • 43. Derivation of empirical witness function Recall the witness function expression f £ G P  Q The empirical feature mean for P ˜P Xa 1 n nˆ i=1 '@xi A 24/51
  • 44. Derivation of empirical witness function Recall the witness function expression f £ G P  Q The empirical feature mean for P ˜P Xa 1 n nˆ i=1 '@xi A The empirical witness function at v f £@vA a hf £;'@vAip 24/51
  • 45. Derivation of empirical witness function Recall the witness function expression f £ G P  Q The empirical feature mean for P ˜P Xa 1 n nˆ i=1 '@xi A The empirical witness function at v f £@vA a hf £;'@vAip G h˜P   ˜Q ;'@vAip 24/51
  • 46. Derivation of empirical witness function Recall the witness function expression f £ G P  Q The empirical feature mean for P ˜P Xa 1 n nˆ i=1 '@xi A The empirical witness function at v f £@vA a hf £;'@vAip G h˜P   ˜Q ;'@vAip a 1 n nˆ i=1 k@xi ;vA   1 n nˆ i=1 k@yi ;vA Don’t need explicit feature coefficients f £ Xa h f £ 1 f £ 2 ::: i 24/51
  • 48. A statistical test using MMD The empirical MMD: MMD 2 = 1 n(n  1) ˆ i6=j k(xi ;xj ) + 1 n(n  1) ˆ i6=j k(yi ;yj )   2 n2 ˆ i;j k(xi ;yj ) How does this help decide whether P a Q? 26/51
  • 49. A statistical test using MMD The empirical MMD: MMD 2 = 1 n(n  1) ˆ i6=j k(xi ;xj ) + 1 n(n  1) ˆ i6=j k(yi ;yj )   2 n2 ˆ i;j k(xi ;yj ) Perspective from statistical hypothesis testing: Null hypothesis r0 when P a Q should see MMD 2 “close to zero”. Alternative hypothesis r1 when P Ta Q should see MMD 2 “far from zero” 26/51
  • 50. A statistical test using MMD The empirical MMD: MMD 2 = 1 n(n  1) ˆ i6=j k(xi ;xj ) + 1 n(n  1) ˆ i6=j k(yi ;yj )   2 n2 ˆ i;j k(xi ;yj ) Perspective from statistical hypothesis testing: Null hypothesis r0 when P a Q should see MMD 2 “close to zero”. Alternative hypothesis r1 when P Ta Q should see MMD 2 “far from zero” Want Threshold c for MMD 2 to get false positive rate 26/51
  • 51. Behaviour of MMD 2 when P T= Q Draw n a 200 i.i.d samples from P and Q Laplace with different y-variance. pn ¢ MMD 2 a 1:2 -2 0 2 -10 -8 -6 -4 -2 0 2 4 6 8 10 P Q 27/51
  • 52. Behaviour of MMD 2 when P T= Q 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 -2 0 2 -10 -8 -6 -4 -2 0 2 4 6 8 10 P Q 27/51
  • 53. Behaviour of MMD 2 when P T= Q Draw n a 200 new samples from P and Q Laplace with different y-variance. pn ¢ MMD 2 a 1:5 -2 0 2 -10 -8 -6 -4 -2 0 2 4 6 8 10 P Q 27/51
  • 54. Behaviour of MMD 2 when P T= Q 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 -2 0 2 -10 -8 -6 -4 -2 0 2 4 6 8 10 P Q 27/51
  • 55. Behaviour of MMD 2 when P T= Q Repeat this 150 times ::: 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 28/51
  • 56. Behaviour of MMD 2 when P T= Q Repeat this 300 times ::: 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 28/51
  • 57. Behaviour of MMD 2 when P T= Q Repeat this 3000 times ::: 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 28/51
  • 58. Asymptotics of MMD 2 when P T= Q When P Ta Q, statistic is asymptotically normal, MMD 2  MMD@P;QAp Vn @P;QA D  3 x@0;1A; where variance Vn @P;QA a O   n 1 ¡ . 0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 Empirical PDF Gaussian fit −6 −4 −2 0 2 4 6 0 0.5 1 1.5 Two Laplace distributions with different variances X Prob.density P X Q X 29/51
  • 59. Behaviour of MMD 2 when P = Q What happens when P and Q are the same? 30/51
  • 60. Behaviour of MMD 2 when P = Q Case of P a Q a x@0;1A -2 0 2 4 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 31/51
  • 61. Behaviour of MMD 2 when P = Q Case of P a Q a x@0;1A -2 0 2 4 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 31/51
  • 62. Behaviour of MMD 2 when P = Q Case of P a Q a x@0;1A -2 0 2 4 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 31/51
  • 63. Behaviour of MMD 2 when P = Q Case of P a Q a x@0;1A -2 0 2 4 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 31/51
  • 64. Behaviour of MMD 2 when P = Q Case of P a Q a x@0;1A -2 0 2 4 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 31/51
  • 65. Asymptotics of MMD 2 when P = Q Where P a Q, statistic has asymptotic distribution n MMD 2 $ Iˆ l=1 l h z2 l  2 i -2 0 2 4 6 0 0.2 0.4 0.6 where i i (xH) =  ˆ ~k(x;xH) | {z } centred i (x)dP(x) zl $ x@0;2A i:i:d: 32/51
  • 66. A statistical test A summary of the asymptotics: -2 -1 0 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 33/51
  • 67. A statistical test Test construction: (G., Borgwardt, Rasch, Schoelkopf, and Smola, JMLR 2012) -2 -1 0 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 33/51
  • 68. How do we get test threshold c? Original empirical MMD for dogs and fish: MMD 2 = 1 n(n  1) ˆ i6=j k(xi ;xj ) + 1 n(n  1) ˆ i6=j k(yi ;yj )   2 n2 ˆ i;j k(xi ;yj ) k(xi, yj)k(xi, xj) k(yi, yj) 34/51
  • 69. How do we get test threshold c? Permuted dog and fish samples (merdogs): 35/51
  • 70. How do we get test threshold c? Permuted dog and fish samples (merdogs): MMD 2 = 1 n(n  1) ˆ i6=j k(~xi ; ~xj ) + 1 n(n  1) ˆ i6=j k(~yi ;~yj )   2 n2 ˆ i;j k(~xi ;~yj ) Permutation simulates P a Q k(˜xi, ˜yj)k(˜xi, ˜xj) k(˜yi, ˜yj) 35/51
  • 71. Demonstration of permutation estimate of null Null distribution estimated from 500 permutations P a Q a x@0;1A -2 0 2 4 6 0 0.2 0.4 0.6 36/51
  • 72. Demonstration of permutation estimate of null Null distribution estimated from 500 permutations P a Q a x@0;1A -2 0 2 4 6 0 0.2 0.4 0.6 Use 1   quantile of permutation distribution for test threshold c 36/51
  • 73. How to choose the best kernel 37/51
  • 74. Optimizing kernel for test power The power of our test (Pr1 denotes probability under P Ta Q): Pr1 n MMD 2 ”c 38/51
  • 75. Optimizing kernel for test power The power of our test (Pr1 denotes probability under P Ta Q): Pr1 n MMD 2 ”c 3 ¨ 2 nMMD2 @P;QAp Vn @P;QA   cp Vn @P;QA 3 where ¨ is the CDF of the standard normal distribution. ”c is an estimate of c test threshold. 38/51
  • 76. Optimizing kernel for test power The power of our test (Pr1 denotes probability under P Ta Q): Pr1 n MMD 2 ”c 3 ¨ 2 MMD2 @P;QAp Vn @P;QA| {z } O(n1=2)   c n p Vn @P;QA| {z } O(n 1=2) 3 Variance under r1 decreases as p Vn @P;QA $ O@n 1=2A For large n, second term negligible! 38/51
  • 77. Optimizing kernel for test power The power of our test (Pr1 denotes probability under P Ta Q): Pr1 n MMD 2 ”c 3 ¨ 2 MMD2 @P;QAp Vn @P;QA   c n p Vn @P;QA 3 To maximize test power, maximize MMD2 @P;QAp Vn @P;QA (Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., ICLR 2017) Code: github.com/dougalsutherland/opt-mmd 38/51
  • 78. Graphical illustration Reminder: maximising test power same as minimizing false negatives -2 -1 0 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 39/51
  • 79. Troubleshooting for generative adversarial networks MNIST samples Samples from a GAN 40/51
  • 80. Troubleshooting for generative adversarial networks MNIST samples Samples from a GAN ARD map Power for optimzed ARD kernel: 1.00 at a 0:01 Power for optimized RBF kernel: 0.57 at a 0:01 40/51
  • 82. MMD for GAN critic Can you use MMD as a critic to train GANs? Can you train convolutional features as input to the MMD critic? From ICML 2015: From UAI 2015: 42/51
  • 83. MMD for GAN critic: 2017 update 43/51
  • 84. MMD for GAN critic: 2017 update 43/51
  • 85. MMD for GAN critic: 2017 update 43/51
  • 86. MMD for GAN critic: 2017 update “Energy Distance Kernel” 43/51
  • 87. An adaptive, linear time distribution metric 44/51
  • 88. Reminder: witness function for MMD v ^P (v): mean embedding of P ^Q (v): mean embedding of Q witness(v) = ^P (v)   ^Q (v) | {z } 45/51
  • 89. Distinguishing Feature(s) witness2 (v) Take square of witness (only worry about amplitude) v 46/51
  • 90. Distinguishing Feature(s) New test statistic: witness2 at a single v£; Linear time in number n of samples ....but how to choose best feature v£? v 46/51
  • 91. Distinguishing Feature(s) Best feature = v£ that maximizes witness2 (v)?? v£ v 46/51
  • 92. Distinguishing Feature(s) v Sample size n = 3 witness2 (v) 47/51
  • 97. Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ”n @vA Xa n witness2 (v) variance of v . Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016 48/51
  • 98. Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ”n @vA Xa n witness2 (v) variance of v . Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016 48/51
  • 99. Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ”n @vA Xa n witness2 (v) variance of v . Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016 P(x) Q(y) witness2 (v) 48/51
  • 100. Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ”n @vA Xa n witness2 (v) variance of v . Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016 witness2 (v) varianceX(v) 48/51
  • 101. Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ”n @vA Xa n witness2 (v) variance of v . Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016 witness2 (v) varianceY(v) 48/51
  • 102. Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ”n @vA Xa n witness2 (v) variance of v . Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016 witness2 (v) variance of v 48/51
  • 103. Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ”n @vA Xa n witness2 (v) variance of v . Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016 v ¤ ^¸n(v) Best location is v£ that maximizes ”n . Improve performance using multiple locations fv£ j gJ j =1 48/51
  • 104. Properties of the ME Test Can use J features † a fv1;:::;vJ g. Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †. Rejection threshold is T = (1  )-quantile of 2 (J). Under H1 X P Ta Q, it follows PH1 @”n A (unknown). But, asymptotically ^n 3 I. Consistent test. Test power = probability of rejecting H0 when H1 is true. 49/51
  • 105. Properties of the ME Test Can use J features † a fv1;:::;vJ g. Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †. Rejection threshold is T = (1  )-quantile of 2 (J). Under H1 X P Ta Q, it follows PH1 @”n A (unknown). But, asymptotically ^n 3 I. Consistent test. Test power = probability of rejecting H0 when H1 is true. 0 20 40 60 80 100 ^¸n Â2 (J) Runtime: y@nA for both testing and optimization. 49/51
  • 106. Properties of the ME Test Can use J features † a fv1;:::;vJ g. Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †. Rejection threshold is T = (1  )-quantile of 2 (J). Under H1 X P Ta Q, it follows PH1 @”n A (unknown). But, asymptotically ^n 3 I. Consistent test. Test power = probability of rejecting H0 when H1 is true. 0 20 40 60 80 100 ^¸n Â2 (J) T® Runtime: y@nA for both testing and optimization. 49/51
  • 107. Properties of the ME Test Can use J features † a fv1;:::;vJ g. Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †. Rejection threshold is T = (1  )-quantile of 2 (J). Under H1 X P Ta Q, it follows PH1 @”n A (unknown). But, asymptotically ^n 3 I. Consistent test. Test power = probability of rejecting H0 when H1 is true. 0 20 40 60 80 100 ^¸n Â2 (J) T® ℙH1 (^¸n) Runtime: y@nA for both testing and optimization. 49/51
  • 108. Properties of the ME Test Can use J features † a fv1;:::;vJ g. Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †. Rejection threshold is T = (1  )-quantile of 2 (J). Under H1 X P Ta Q, it follows PH1 @”n A (unknown). But, asymptotically ^n 3 I. Consistent test. Test power = probability of rejecting H0 when H1 is true. 0 20 40 60 80 100 ^¸n Â2 (J) T® ℙH1 (^¸n) Runtime: y@nA for both testing and optimization. 49/51
  • 109. Properties of the ME Test Can use J features † a fv1;:::;vJ g. Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †. Rejection threshold is T = (1  )-quantile of 2 (J). Under H1 X P Ta Q, it follows PH1 @”n A (unknown). But, asymptotically ^n 3 I. Consistent test. Test power = probability of rejecting H0 when H1 is true. 0 20 40 60 80 100 ^¸n Â2 (J) T® ℙH1 (^¸n) Theorem: Under H1, optimization of † (by maximizing ”n ) increases the (lower bound of) test power. Runtime: y@nA for both testing and optimization. 49/51
  • 110. Properties of the ME Test Can use J features † a fv1;:::;vJ g. Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †. Rejection threshold is T = (1  )-quantile of 2 (J). Under H1 X P Ta Q, it follows PH1 @”n A (unknown). But, asymptotically ^n 3 I. Consistent test. Test power = probability of rejecting H0 when H1 is true. 0 20 40 60 80 100 ^¸n Â2 (J) T® ℙH1 (^¸n) Theorem: Under H1, optimization of † (by maximizing ”n ) increases the (lower bound of) test power. Runtime: y@nA for both testing and optimization. 49/51
  • 111. Properties of the ME Test Can use J features † a fv1;:::;vJ g. Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †. Rejection threshold is T = (1  )-quantile of 2 (J). Under H1 X P Ta Q, it follows PH1 @”n A (unknown). But, asymptotically ^n 3 I. Consistent test. Test power = probability of rejecting H0 when H1 is true. 0 20 40 60 80 100 ^¸n Â2 (J) T® ℙH1 (^¸n) Theorem: Under H1, optimization of † (by maximizing ”n ) increases the (lower bound of) test power. Runtime: y@nA for both testing and optimization. 49/51
  • 112. Distinguishing Positive/Negative Emotions + : happy neutral surprised   : afraid angry disgusted 35 females and 35 males (Lundqvist et al., 1998). 48 ¢34 a 1632 dimensions. Pixel features. Sample size: 402. The proposed test achieves maximum test power in time O@nA. Informative features: differences at the nose, and smile lines. 50/51
  • 113. Distinguishing Positive/Negative Emotions + : happy neutral surprised   : afraid angry disgusted + vs. - 0.0 0.5 1.0 Power⟶ Random feature The proposed test achieves maximum test power in time O@nA. Informative features: differences at the nose, and smile lines. 50/51
  • 114. Distinguishing Positive/Negative Emotions + : happy neutral surprised   : afraid angry disgusted + vs. - 0.0 0.5 1.0 Power⟶ Random feature Proposed The proposed test achieves maximum test power in time O@nA. Informative features: differences at the nose, and smile lines. 50/51
  • 115. Distinguishing Positive/Negative Emotions + : happy neutral surprised   : afraid angry disgusted + vs. - 0.0 0.5 1.0 Power⟶ Random feature Proposed MMD (quadratic time) The proposed test achieves maximum test power in time O@nA. Informative features: differences at the nose, and smile lines. 50/51
  • 116. Distinguishing Positive/Negative Emotions + : happy neutral surprised   : afraid angry disgusted Learned feature The proposed test achieves maximum test power in time O@nA. Informative features: differences at the nose, and smile lines. 50/51
  • 117. Distinguishing Positive/Negative Emotions + : happy neutral surprised   : afraid angry disgusted Learned feature The proposed test achieves maximum test power in time O@nA. Informative features: differences at the nose, and smile lines. Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016 Code: https://github.com/wittawatj/interpretable-test 50/51
  • 118. Co-authors From Gatsby: Kacper Chwialkowski Wittawat Jitkrittum Bharath Sriperumbudur Heiko Strathmann Dougal Sutherland Zoltan Szabo Wenkai Xu External collaborators: Kenji Fukumizu Bernhard Schoelkopf Alex Smola Questions? 51/51