Representing and comparing probabilities

Representing and comparing probabilities
Arthur Gretton
Gatsby Computational Neuroscience Unit,
University College London
UAI, 2017
1/51

Comparing two samples
Given: Samples from unknown distributions P and Q.
Goal: do P and Q diﬀer?
2/51

An example: two-sample tests
Have: Two collections of samples X;Y from unknown distributions
P and Q.
Goal: do P and Q differ?
MNIST samples Samples from a GAN
Significant difference in GAN and MNIST?
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, Xi Chen, NIPS 2016. 3/51

Testing goodness of ﬁt
Given: A model P and samples and Q.
Goal: is P a good ﬁt for Q?
Chicago crime data
Model is Gaussian mixture with two components.
4/51

Testing independence
Given: Samples from a distribution PX Y
Goal: Are X and Y independent?
Their noses guide them
through life, and they're
never happier than when
following an interesting scent.
A large animal who slings slobber,
exudes a distinctive houndy odor,
and wants nothing more than to
follow his nose.
Text from dogtime.com and petfinder.com
A responsive, interactive
pet, one that will blow in
your ear and follow you
everywhere.
YX
5/51

Outline: part 1
Two sample testing
Test statistic: Maximum Mean Discrepancy (MMD)...
...as a diﬀerence in feature means
...as an integral probability metric (not just a technicality!)
Statistical testing with the MMD
Troubleshooting GANs with MMD
6/51

Outline: part 2
Goodness of ﬁt testing
The kernel Stein discrepancy
Dependence testing
Dependence using the MMD
Depenence using feature covariances
Statistical testing
Additional topics
7/51

Feature mean diﬀerence
Simple example: 2 Gaussians with diﬀerent means
Answer: t-test
−6 −4 −2 0 2 4 6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different means
X
Prob.density
P
X
Q
X
9/51

Two Gaussians with same means, diﬀerent variance
Idea: look at diﬀerence in means of features of the RVs
In Gaussian case: second order features of form '@xA a x2
−6 −4 −2 0 2 4 6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different variances
Prob.density
X
PX
Q
X
10/51

Two Gaussians with same means, diﬀerent variance
Idea: look at diﬀerence in means of features of the RVs
In Gaussian case: second order features of form '@xA a x2
−6 −4 −2 0 2 4 6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Two Gaussians with different variances
Prob.density
X
PX
Q
X
10
−1
10
0
10
1
10
2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Densities of feature X
2
X2
Prob.density
P
X
Q
X
10/51

Gaussian and Laplace distributions
Same mean and same variance
Diﬀerence in means using higher order features...RKHS
−4 −3 −2 −1 0 1 2 3 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Gaussian and Laplace densitiesProb.density
X
P
X
Q
X
11/51

Infinitely many features using kernels
Kernels: dot products
of features
Feature map '@xA P p,
'@xA a ‘:::'i @xA:::“ P `2
For positive definite k,
k@x;xHA a h'@xA;'@xHAip
Infinitely many features
'@xA, dot product in
closed form!
12/51

Infinitely many features using kernels
Kernels: dot products
of features
Feature map '@xA P p,
'@xA a ‘:::'i @xA:::“ P `2
For positive definite k,
k@x;xHA a h'@xA;'@xHAip
Infinitely many features
'@xA, dot product in
closed form!
Exponentiated quadratic kernel
k@x;xHA a exp

kx xHk2

Features: Gaussian Processes for Machine learning, Ras-
mussen and Williams, Ch. 4. 12/51

Infinitely many features of distributions
Given P a Borel probability measure on ˆ, define feature map of
probability P,
P a ‘:::EP ‘'i @X A“:::“
For positive definite k@x;xHA,
hP ;Q ip a EP;Q k@x;yA
for x $ P and y $ Q.
Fine print: feature map '(x) must be Bochner integrable for all probability measures considered.
Always true if kernel bounded.
13/51

The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature
means:
MMD2
@P;QA a kP Q k2
p
a hP ;P ip C hQ ;Q ip 2 hP ;Q ip
a EP k@X ;X HA| {z }
(a)
C EQ k@Y ;Y HA| {z }
(a)
2EP;Q k@X ;Y A| {z }
(b)
14/51

The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature
means:
MMD2
@P;QA a kP Q k2
p
a hP ;P ip C hQ ;Q ip 2 hP ;Q ip
a EP k@X ;X HA| {z }
(a)
C EQ k@Y ;Y HA| {z }
(a)
2EP;Q k@X ;Y A| {z }
(b)
(a)= within distrib. similarity, (b)= cross-distrib. similarity.
14/51

Illustration of MMD
Dogs @a PA and fish @a QA example revisited
Each entry is one of k@dogi ;dogj A, k@dogi ;fishj A, or k@fishi ;fishj A
15/51

Illustration of MMD
The maximum mean discrepancy:
MMD
2
=
1
n(n 1)
ˆ
i6=j
k(dogi ;dogj ) +
1
n(n 1)
ˆ
i6=j
k(fishi ;fishj )
2
n2
ˆ
i;j
k(dogi ;fishj )
16/51

MMD as an integral probability metric
Are P and Q diﬀerent?
0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
1
Samples from P and Q
17/51

Are P and Q diﬀerent?
0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
1
Samples from P and Q
18/51

Integral probability metric:
Find a well behaved function f @xA to maximize
EP f @X A EQ f @Y A
0 0.2 0.4 0.6 0.8 1
x
-1
-0.5
0
0.5
1
f(x)
Smooth function
19/51

Integral probability metric:
Find a well behaved function f @xA to maximize
EP f @X A EQ f @Y A
0 0.2 0.4 0.6 0.8 1
x
-1
-0.5
0
0.5
1
f(x)
Smooth function
20/51

Maximum mean discrepancy: smooth function for P vs Q
MMD@P;QYFA Xa sup
kf k 1
‘EP f @X A EQ f @Y A“
(F a unit ball in RKHS p)
21/51

MMD@P;QYFA Xa sup
kf k 1
−6 −4 −2 0 2 4 6
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Witness f for Gauss and Laplace densities
X
Prob.densityandf
f
Gauss
Laplace
21/51

MMD@P;QYFA Xa sup
kf k 1
Functions are linear combinations of features:
21/51

MMD@P;QYFA Xa sup
kf k 1
Expectations of functions are linear combinations
of expected features
EP @f @X AA a hf ;EP '@X Aip a hf ;P ip
(always true if kernel is bounded)
21/51

MMD@P;QYFA Xa sup
kf k 1
For characteristic RKHS p, MMD@P;QYFA a 0 iﬀ P a Q
Other choices for witness function class:
Bounded continuous [Dudley, 2002]
Bounded varation 1 (Kolmogorov metric) [Müller, 1997]
Bounded Lipschitz (Wasserstein distances) [Dudley, 2002]
21/51

Integral prob. metric vs feature diﬀerence
The MMD:
MMD@P;QYFA
a sup
f PF
−6 −4 −2 0 2 4 6
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Witness f for Gauss and Laplace densities
X
Prob.densityandf
f
Gauss
Laplace
22/51

The MMD:
MMD@P;QYFA
a sup
f PF
a sup
f PF
hf ;P Q ip
use
EP f @X A a hP ;f ip
22/51

The MMD:
MMD@P;QYFA
a sup
f PF
a sup
f PF
hf ;P Q ip
f
22/51

The MMD:
MMD@P;QYFA
a sup
f PF
a sup
f PF
hf ;P Q ip
f*
22/51

The MMD:
MMD@P;QYFA
a sup
f PF
a sup
f PF
hf ;P Q ip
a kP Q k
Function view and feature view equivalent
22/51

Construction of MMD witness
Construction of empirical witness function (proof: next slide!)
Observe X = fx1;:::;xn g $ P
Observe Y = fy1;:::;yn g $ Q
23/51

23/51

v
23/51

v
witness(v)
| {z }
23/51

Derivation of empirical witness function
Recall the witness function expression
f £ G P Q
24/51

f £ G P Q
The empirical feature mean for P
˜P Xa 1
n
nˆ
i=1
'@xi A
24/51

f £ G P Q
˜P Xa 1
n
nˆ
i=1
'@xi A
The empirical witness function at v
f £@vA a hf £;'@vAip
24/51

f £ G P Q
˜P Xa 1
n
nˆ
i=1
'@xi A
G h˜P ˜Q ;'@vAip
24/51

f £ G P Q
˜P Xa 1
n
nˆ
i=1
'@xi A
G h˜P ˜Q ;'@vAip
a 1
n
nˆ
i=1
k@xi ;vA 1
n
nˆ
i=1
k@yi ;vA
Don’t need explicit feature coeﬃcients f £ Xa
h
f £
1 f £
2 :::
i
24/51

A statistical test using MMD
The empirical MMD:
MMD
2
=
1
n(n 1)
ˆ
i6=j
k(xi ;xj ) +
1
n(n 1)
ˆ
i6=j
k(yi ;yj )
2
n2
ˆ
i;j
k(xi ;yj )
How does this help decide whether P a Q?
26/51

The empirical MMD:
MMD
2
=
1
n(n 1)
ˆ
i6=j
k(xi ;xj ) +
1
n(n 1)
ˆ
i6=j
k(yi ;yj )
2
n2
ˆ
i;j
k(xi ;yj )
Perspective from statistical hypothesis testing:
Null hypothesis r0 when P a Q
should see MMD
2
“close to zero”.
Alternative hypothesis r1 when P Ta Q
should see MMD
2
“far from zero”
26/51

The empirical MMD:
MMD
2
=
1
n(n 1)
ˆ
i6=j
k(xi ;xj ) +
1
n(n 1)
ˆ
i6=j
k(yi ;yj )
2
n2
ˆ
i;j
k(xi ;yj )
Perspective from statistical hypothesis testing:
Null hypothesis r0 when P a Q
should see MMD
2
“close to zero”.
Alternative hypothesis r1 when P Ta Q
should see MMD
2
“far from zero”
Want Threshold c for MMD
2
to get false positive rate 26/51

Behaviour of MMD
2
when P T= Q
Draw n a 200 i.i.d samples from P and Q
Laplace with diﬀerent y-variance.
pn ¢ MMD
2
a 1:2
-2 0 2
-10
-8
-6
-4
-2
0
2
4
6
8
10
P
Q
27/51

Behaviour of MMD
2
when P T= Q
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
-2 0 2
-10
-8
-6
-4
-2
0
2
4
6
8
10
P
Q
27/51

Behaviour of MMD
2
when P T= Q
Draw n a 200 new samples from P and Q
Laplace with diﬀerent y-variance.
pn ¢ MMD
2
a 1:5
-2 0 2
-10
-8
-6
-4
-2
0
2
4
6
8
10
P
Q
27/51

Behaviour of MMD
2
when P T= Q
Repeat this 150 times :::
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
28/51

Behaviour of MMD
2
when P T= Q
0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
28/51

Asymptotics of MMD
2
when P T= Q
When P Ta Q, statistic is asymptotically normal,
MMD
2
MMD@P;QAp
Vn @P;QA
D
3 x@0;1A;
where variance Vn @P;QA a O

n 1
¡
.
0 0.5 1 1.5 2 2.5 3 3.5
0
0.5
1
1.5
Empirical PDF
Gaussian fit
−6 −4 −2 0 2 4 6
0
0.5
1
1.5
Two Laplace distributions with different variances
X
Prob.density
P
X
Q
X
29/51

Behaviour of MMD
2
when P = Q
What happens when P and Q are the same?
30/51

Behaviour of MMD
2
when P = Q
Case of P a Q a x@0;1A
-2 0 2 4 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
31/51

Asymptotics of MMD
2
when P = Q
Where P a Q, statistic has asymptotic distribution
n MMD
2
$
Iˆ
l=1
l
h
z2
l 2
i
-2 0 2 4 6
0
0.2
0.4
0.6
where
i i (xH) =

ˆ
~k(x;xH)
| {z }
centred
i (x)dP(x)
zl $ x@0;2A i:i:d:
32/51

A statistical test
A summary of the asymptotics:
-2 -1 0 1 2 3 4 5 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
33/51

A statistical test
Test construction: (G., Borgwardt, Rasch, Schoelkopf, and Smola, JMLR 2012)
-2 -1 0 1 2 3 4 5 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
33/51

How do we get test threshold c?
Original empirical MMD for dogs and ﬁsh:
MMD
2
=
1
n(n 1)
ˆ
i6=j
k(xi ;xj )
+
1
n(n 1)
ˆ
i6=j
k(yi ;yj )
2
n2
ˆ
i;j
k(xi ;yj )
k(xi, yj)k(xi, xj)
k(yi, yj)
34/51

Permuted dog and ﬁsh samples (merdogs):
35/51

Permuted dog and ﬁsh samples (merdogs):
MMD
2
=
1
n(n 1)
ˆ
i6=j
k(~xi ; ~xj )
+
1
n(n 1)
ˆ
i6=j
k(~yi ;~yj )
2
n2
ˆ
i;j
k(~xi ;~yj )
Permutation simulates
P a Q
k(˜xi, ˜yj)k(˜xi, ˜xj)
k(˜yi, ˜yj)
35/51

Demonstration of permutation estimate of null
Null distribution estimated from 500 permutations
P a Q a x@0;1A
-2 0 2 4 6
0
0.2
0.4
0.6
36/51

Demonstration of permutation estimate of null
Null distribution estimated from 500 permutations
P a Q a x@0;1A
-2 0 2 4 6
0
0.2
0.4
0.6
Use 1 quantile of permutation distribution for
test threshold c 36/51

How to choose the best kernel
37/51

Optimizing kernel for test power
The power of our test (Pr1 denotes probability under P Ta Q):
Pr1

n MMD
2
”c

38/51

Pr1

n MMD
2
”c

3 ¨
2
nMMD2
@P;QAp
Vn @P;QA
cp
Vn @P;QA
3
where
¨ is the CDF of the standard normal distribution.
”c is an estimate of c test threshold.
38/51

Pr1

n MMD
2
”c

3 ¨
2
MMD2
@P;QAp
Vn @P;QA| {z }
O(n1=2)
c
n
p
Vn @P;QA| {z }
O(n 1=2)
3
Variance under r1 decreases as
p
Vn @P;QA $ O@n 1=2A
For large n, second term negligible!
38/51

Pr1

n MMD
2
”c

3 ¨
2
MMD2
@P;QAp
Vn @P;QA
c
n
p
Vn @P;QA
3
To maximize test power, maximize
MMD2
@P;QAp
Vn @P;QA
(Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., ICLR 2017)
Code: github.com/dougalsutherland/opt-mmd
38/51

Graphical illustration
Reminder: maximising test power same as minimizing false negatives
-2 -1 0 1 2 3 4 5 6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
39/51

Troubleshooting for generative adversarial networks
40/51

Troubleshooting for generative adversarial networks
ARD map
Power for optimzed ARD
kernel: 1.00 at a 0:01
Power for optimized RBF
kernel: 0.57 at a 0:01
40/51

Troubleshooting generative adversarial networks
41/51

MMD for GAN critic
Can you use MMD as a critic to train GANs?
Can you train convolutional features as input to the MMD critic?
From ICML 2015:
From UAI 2015:
42/51

MMD for GAN critic: 2017 update
43/51

MMD for GAN critic: 2017 update
“Energy Distance Kernel”
43/51

An adaptive, linear time distribution
metric
44/51

Reminder: witness function for MMD
v
^P (v): mean embedding of P
^Q (v): mean embedding of Q
witness(v) = ^P (v) ^Q (v)
| {z }
45/51

Distinguishing Feature(s)
witness2
(v)
Take square of witness (only worry about
amplitude)
v
46/51

New test statistic: witness2
at a single v£;
Linear time in number n of samples
....but how to choose best feature v£?
v
46/51

Best feature =
v£ that maximizes witness2
(v)??
v£
v
46/51

v
Sample size n = 3
witness2
(v)
47/51

v
Sample size n = 50
47/51

v
Sample size n = 500
47/51

P(x)
Q(y)
witness2
(v)
v
Population witness2
function
47/51

P(x)
Q(y)
witness2
(v)
v
v£? v£?
47/51

Variance of witness function
Variance at v = variance of X at v + variance of Y at v.
ME Statistic: ”n @vA Xa n witness2
(v)
variance of v .
Jitkrittum, Szabo, Chwialkowski, G., NIPS 2016
48/51

(v)
variance of v .
P(x)
Q(y)
witness2
(v)
48/51

(v)
variance of v .
witness2
(v)
varianceX(v)
48/51

(v)
variance of v .
witness2
(v)
varianceY(v)
48/51

(v)
variance of v .
witness2
(v)
variance of v
48/51

(v)
variance of v .
v ¤
^¸n(v)
Best location is v£ that maximizes ”n .
Improve performance using multiple locations fv£
j gJ
j =1
48/51

Properties of the ME Test
Can use J features † a fv1;:::;vJ g.
Under H0 X P a Q, asymptotically ”n @†A follows 2@JA for any †.
Rejection threshold is T = (1 )-quantile of 2
(J).
Under H1 X P Ta Q, it follows PH1 @”n A (unknown).
But, asymptotically ^n 3 I. Consistent test.
Test power = probability of rejecting H0 when H1 is true.
49/51

(J).
0 20 40 60 80 100
^¸n
Â2
(J)
Runtime: y@nA for both testing and optimization.
49/51

(J).
0 20 40 60 80 100
^¸n
Â2
(J)
T®
49/51

(J).
0 20 40 60 80 100
^¸n
Â2
(J)
T®
ℙH1
(^¸n)
49/51

(J).
0 20 40 60 80 100
^¸n
Â2
(J)
T®
ℙH1
(^¸n)
Theorem: Under H1, optimization of † (by maximizing ”n ) increases
the (lower bound of) test power.
49/51

Distinguishing Positive/Negative Emotions
+ :
happy neutral surprised
:
afraid angry disgusted
35 females and 35 males
(Lundqvist et al., 1998).
48 ¢34 a 1632 dimensions.
Pixel features.
Sample size: 402.
The proposed test achieves maximum test power in time O@nA.
Informative features: diﬀerences at the nose, and smile lines.
50/51

+ :
:
+ vs. -
0.0
0.5
1.0
Power⟶
Random feature
50/51

+ :
:
+ vs. -
0.0
0.5
1.0
Power⟶
Random feature
Proposed
50/51

+ :
:
+ vs. -
0.0
0.5
1.0
Power⟶
Random feature
Proposed
MMD (quadratic time)
50/51

+ :
:
Learned feature
50/51

+ :
:
Learned feature
Code: https://github.com/wittawatj/interpretable-test
50/51

Co-authors
From Gatsby:
Kacper Chwialkowski
Wittawat Jitkrittum
Bharath Sriperumbudur
Heiko Strathmann
Dougal Sutherland
Zoltan Szabo
Wenkai Xu
External collaborators:
Kenji Fukumizu
Bernhard Schoelkopf
Alex Smola
Questions?
51/51

Representing and comparing probabilities

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Representing and comparing probabilities

Ähnlich wie Representing and comparing probabilities (20)

Mehr von MLReview

Mehr von MLReview (12)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Representing and comparing probabilities