SlideShare ist ein Scribd-Unternehmen logo
1 von 64
Downloaden Sie, um offline zu lesen
Measuring Sample Quality with Stein’s Method
Lester Mackey∗,†
Joint work with Jackson Gorham†
Microsoft Research∗
Stanford University†
August 29, 2017
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 1 / 23
Motivation: Large-scale Posterior Inference
Question: How do we scale Markov chain Monte Carlo (MCMC)
posterior inference to massive datasets?
MCMC Benefit: Approximates intractable posterior
expectations EP [h(Z)] = X
p(x)h(x)dx with asymptotically
exact sample estimates EQ[h(X)] = 1
n
n
i=1 h(xi)
Problem: Each point xi requires iterating over entire dataset!
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 2 / 23
Motivation: Large-scale Posterior Inference
Question: How do we scale Markov chain Monte Carlo (MCMC)
posterior inference to massive datasets?
MCMC Benefit: Approximates intractable posterior
expectations EP [h(Z)] = X
p(x)h(x)dx with asymptotically
exact sample estimates EQ[h(X)] = 1
n
n
i=1 h(xi)
Problem: Each point xi requires iterating over entire dataset!
Template solution: Approximate MCMC with subset posteriors
[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]
Approximate standard MCMC procedure in a manner that makes
use of only a small subset of datapoints per sample
Reduced computational overhead leads to faster sampling and
reduced Monte Carlo variance
Introduces asymptotic bias: target distribution is not stationary
Hope that for fixed amount of sampling time, variance reduction
will outweigh bias introduced
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 2 / 23
Motivation: Large-scale Posterior Inference
Template solution: Approximate MCMC with subset posteriors
[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]
Hope that for fixed amount of sampling time, variance reduction
will outweigh bias introduced
Introduces new challenges
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 3 / 23
Motivation: Large-scale Posterior Inference
Template solution: Approximate MCMC with subset posteriors
[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]
Hope that for fixed amount of sampling time, variance reduction
will outweigh bias introduced
Introduces new challenges
How do we compare and evaluate samples from approximate
MCMC procedures?
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 3 / 23
Motivation: Large-scale Posterior Inference
Template solution: Approximate MCMC with subset posteriors
[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]
Hope that for fixed amount of sampling time, variance reduction
will outweigh bias introduced
Introduces new challenges
How do we compare and evaluate samples from approximate
MCMC procedures?
How do we select samplers and their tuning parameters?
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 3 / 23
Motivation: Large-scale Posterior Inference
Template solution: Approximate MCMC with subset posteriors
[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]
Hope that for fixed amount of sampling time, variance reduction
will outweigh bias introduced
Introduces new challenges
How do we compare and evaluate samples from approximate
MCMC procedures?
How do we select samplers and their tuning parameters?
How do we quantify the bias-variance trade-off explicitly?
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 3 / 23
Motivation: Large-scale Posterior Inference
Template solution: Approximate MCMC with subset posteriors
[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]
Hope that for fixed amount of sampling time, variance reduction
will outweigh bias introduced
Introduces new challenges
How do we compare and evaluate samples from approximate
MCMC procedures?
How do we select samplers and their tuning parameters?
How do we quantify the bias-variance trade-off explicitly?
Difficulty: Standard evaluation criteria like effective sample size,
trace plots, and variance diagnostics assume convergence to the
target distribution and do not account for asymptotic bias
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 3 / 23
Motivation: Large-scale Posterior Inference
Template solution: Approximate MCMC with subset posteriors
[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]
Hope that for fixed amount of sampling time, variance reduction
will outweigh bias introduced
Introduces new challenges
How do we compare and evaluate samples from approximate
MCMC procedures?
How do we select samplers and their tuning parameters?
How do we quantify the bias-variance trade-off explicitly?
Difficulty: Standard evaluation criteria like effective sample size,
trace plots, and variance diagnostics assume convergence to the
target distribution and do not account for asymptotic bias
This talk: Introduce new quality measures suitable for comparing
the quality of approximate MCMC samples
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 3 / 23
Quality Measures for Samples
Challenge: Develop measure suitable for comparing the quality of
any two samples approximating a common target distribution
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 4 / 23
Quality Measures for Samples
Challenge: Develop measure suitable for comparing the quality of
any two samples approximating a common target distribution
Given
Continuous target distribution P with support X = Rd
and
density p
p known up to normalization, integration under P is intractable
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 4 / 23
Quality Measures for Samples
Challenge: Develop measure suitable for comparing the quality of
any two samples approximating a common target distribution
Given
Continuous target distribution P with support X = Rd
and
density p
p known up to normalization, integration under P is intractable
Sample points x1, . . . , xn ∈ X
Define discrete distribution Qn with, for any function h,
EQn [h(X)] = 1
n
n
i=1 h(xi) used to approximate EP [h(Z)]
We make no assumption about the provenance of the xi
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 4 / 23
Quality Measures for Samples
Challenge: Develop measure suitable for comparing the quality of
any two samples approximating a common target distribution
Given
Continuous target distribution P with support X = Rd
and
density p
p known up to normalization, integration under P is intractable
Sample points x1, . . . , xn ∈ X
Define discrete distribution Qn with, for any function h,
EQn [h(X)] = 1
n
n
i=1 h(xi) used to approximate EP [h(Z)]
We make no assumption about the provenance of the xi
Goal: Quantify how well EQn approximates EP
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 4 / 23
Quality Measures for Samples
Challenge: Develop measure suitable for comparing the quality of
any two samples approximating a common target distribution
Given
Continuous target distribution P with support X = Rd
and
density p
p known up to normalization, integration under P is intractable
Sample points x1, . . . , xn ∈ X
Define discrete distribution Qn with, for any function h,
EQn [h(X)] = 1
n
n
i=1 h(xi) used to approximate EP [h(Z)]
We make no assumption about the provenance of the xi
Goal: Quantify how well EQn approximates EP in a manner that
I. Detects when a sample sequence is converging to the target
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 4 / 23
Quality Measures for Samples
Challenge: Develop measure suitable for comparing the quality of
any two samples approximating a common target distribution
Given
Continuous target distribution P with support X = Rd
and
density p
p known up to normalization, integration under P is intractable
Sample points x1, . . . , xn ∈ X
Define discrete distribution Qn with, for any function h,
EQn [h(X)] = 1
n
n
i=1 h(xi) used to approximate EP [h(Z)]
We make no assumption about the provenance of the xi
Goal: Quantify how well EQn approximates EP in a manner that
I. Detects when a sample sequence is converging to the target
II. Detects when a sample sequence is not converging to the target
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 4 / 23
Quality Measures for Samples
Challenge: Develop measure suitable for comparing the quality of
any two samples approximating a common target distribution
Given
Continuous target distribution P with support X = Rd
and
density p
p known up to normalization, integration under P is intractable
Sample points x1, . . . , xn ∈ X
Define discrete distribution Qn with, for any function h,
EQn [h(X)] = 1
n
n
i=1 h(xi) used to approximate EP [h(Z)]
We make no assumption about the provenance of the xi
Goal: Quantify how well EQn approximates EP in a manner that
I. Detects when a sample sequence is converging to the target
II. Detects when a sample sequence is not converging to the target
III. Is computationally feasible
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 4 / 23
Integral Probability Metrics
Goal: Quantify how well EQn approximates EP
Idea: Consider an integral probability metric (IPM) [M¨uller, 1997]
dH(Qn, P) = sup
h∈H
|EQn [h(X)] − EP [h(Z)]|
Measures maximum discrepancy between sample and target
expectations over a class of real-valued test functions H
When H sufficiently large, convergence of dH(Qn, P) to zero
implies (Qn)n≥1 converges weakly to P (Requirement II)
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 5 / 23
Integral Probability Metrics
Goal: Quantify how well EQn approximates EP
Idea: Consider an integral probability metric (IPM) [M¨uller, 1997]
dH(Qn, P) = sup
h∈H
|EQn [h(X)] − EP [h(Z)]|
Measures maximum discrepancy between sample and target
expectations over a class of real-valued test functions H
When H sufficiently large, convergence of dH(Qn, P) to zero
implies (Qn)n≥1 converges weakly to P (Requirement II)
Problem: Integration under P intractable!
⇒ Most IPMs cannot be computed in practice
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 5 / 23
Integral Probability Metrics
Goal: Quantify how well EQn approximates EP
Idea: Consider an integral probability metric (IPM) [M¨uller, 1997]
dH(Qn, P) = sup
h∈H
|EQn [h(X)] − EP [h(Z)]|
Measures maximum discrepancy between sample and target
expectations over a class of real-valued test functions H
When H sufficiently large, convergence of dH(Qn, P) to zero
implies (Qn)n≥1 converges weakly to P (Requirement II)
Problem: Integration under P intractable!
⇒ Most IPMs cannot be computed in practice
Idea: Only consider functions with EP [h(Z)] known a priori to be 0
Then IPM computation only depends on Qn!
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 5 / 23
Integral Probability Metrics
Goal: Quantify how well EQn approximates EP
Idea: Consider an integral probability metric (IPM) [M¨uller, 1997]
dH(Qn, P) = sup
h∈H
|EQn [h(X)] − EP [h(Z)]|
Measures maximum discrepancy between sample and target
expectations over a class of real-valued test functions H
When H sufficiently large, convergence of dH(Qn, P) to zero
implies (Qn)n≥1 converges weakly to P (Requirement II)
Problem: Integration under P intractable!
⇒ Most IPMs cannot be computed in practice
Idea: Only consider functions with EP [h(Z)] known a priori to be 0
Then IPM computation only depends on Qn!
How do we select this class of test functions?
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 5 / 23
Integral Probability Metrics
Goal: Quantify how well EQn approximates EP
Idea: Consider an integral probability metric (IPM) [M¨uller, 1997]
dH(Qn, P) = sup
h∈H
|EQn [h(X)] − EP [h(Z)]|
Measures maximum discrepancy between sample and target
expectations over a class of real-valued test functions H
When H sufficiently large, convergence of dH(Qn, P) to zero
implies (Qn)n≥1 converges weakly to P (Requirement II)
Problem: Integration under P intractable!
⇒ Most IPMs cannot be computed in practice
Idea: Only consider functions with EP [h(Z)] known a priori to be 0
Then IPM computation only depends on Qn!
How do we select this class of test functions?
Will the resulting discrepancy measure track sample sequence
convergence (Requirements I and II)?
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 5 / 23
Integral Probability Metrics
Goal: Quantify how well EQn approximates EP
Idea: Consider an integral probability metric (IPM) [M¨uller, 1997]
dH(Qn, P) = sup
h∈H
|EQn [h(X)] − EP [h(Z)]|
Measures maximum discrepancy between sample and target
expectations over a class of real-valued test functions H
When H sufficiently large, convergence of dH(Qn, P) to zero
implies (Qn)n≥1 converges weakly to P (Requirement II)
Problem: Integration under P intractable!
⇒ Most IPMs cannot be computed in practice
Idea: Only consider functions with EP [h(Z)] known a priori to be 0
Then IPM computation only depends on Qn!
How do we select this class of test functions?
Will the resulting discrepancy measure track sample sequence
convergence (Requirements I and II)?
How do we solve the resulting optimization problem in practice?
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 5 / 23
Stein’s Method
Stein’s method [1972] provides a recipe for controlling convergence:
1 Identify operator T and set G of functions g : X → Rd
with
EP [(T g)(Z)] = 0 for all g ∈ G.
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 6 / 23
Stein’s Method
Stein’s method [1972] provides a recipe for controlling convergence:
1 Identify operator T and set G of functions g : X → Rd
with
EP [(T g)(Z)] = 0 for all g ∈ G.
T and G together define the Stein discrepancy [Gorham and Mackey, 2015]
S(Qn, T , G) sup
g∈G
|EQn [(T g)(X)]| = dT G(Qn, P),
an IPM-type measure with no explicit integration under P
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 6 / 23
Stein’s Method
Stein’s method [1972] provides a recipe for controlling convergence:
1 Identify operator T and set G of functions g : X → Rd
with
EP [(T g)(Z)] = 0 for all g ∈ G.
T and G together define the Stein discrepancy [Gorham and Mackey, 2015]
S(Qn, T , G) sup
g∈G
|EQn [(T g)(X)]| = dT G(Qn, P),
an IPM-type measure with no explicit integration under P
2 Lower bound S(Qn, T , G) by reference IPM dH(Qn, P )
⇒ S(Qn, T , G) → 0 only if (Qn)n≥1 converges to P (Req. II)
Performed once, in advance, for large classes of distributions
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 6 / 23
Stein’s Method
Stein’s method [1972] provides a recipe for controlling convergence:
1 Identify operator T and set G of functions g : X → Rd
with
EP [(T g)(Z)] = 0 for all g ∈ G.
T and G together define the Stein discrepancy [Gorham and Mackey, 2015]
S(Qn, T , G) sup
g∈G
|EQn [(T g)(X)]| = dT G(Qn, P),
an IPM-type measure with no explicit integration under P
2 Lower bound S(Qn, T , G) by reference IPM dH(Qn, P )
⇒ S(Qn, T , G) → 0 only if (Qn)n≥1 converges to P (Req. II)
Performed once, in advance, for large classes of distributions
3 Upper bound S(Qn, T , G) by any means necessary to
demonstrate convergence to 0 (Requirement I)
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 6 / 23
Stein’s Method
Stein’s method [1972] provides a recipe for controlling convergence:
1 Identify operator T and set G of functions g : X → Rd
with
EP [(T g)(Z)] = 0 for all g ∈ G.
T and G together define the Stein discrepancy [Gorham and Mackey, 2015]
S(Qn, T , G) sup
g∈G
|EQn [(T g)(X)]| = dT G(Qn, P),
an IPM-type measure with no explicit integration under P
2 Lower bound S(Qn, T , G) by reference IPM dH(Qn, P )
⇒ S(Qn, T , G) → 0 only if (Qn)n≥1 converges to P (Req. II)
Performed once, in advance, for large classes of distributions
3 Upper bound S(Qn, T , G) by any means necessary to
demonstrate convergence to 0 (Requirement I)
Standard use: As analytical tool to prove convergence
Our goal: Develop Stein discrepancy into practical quality measure
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 6 / 23
Identifying a Stein Operator T
Goal: Identify operator T for which EP [(T g)(Z)] = 0 for all g ∈ G
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 7 / 23
Identifying a Stein Operator T
Goal: Identify operator T for which EP [(T g)(Z)] = 0 for all g ∈ G
Approach: Generator method of Barbour [1988, 1990], G¨otze [1991]
Identify a Markov process (Zt)t≥0 with stationary distribution P
Under mild conditions, its infinitesimal generator
(Au)(x) = lim
t→0
(E[u(Zt) | Z0 = x] − u(x))/t
satisfies EP [(Au)(Z)] = 0
Overdamped Langevin diffusion: dZt = 1
2
log p(Zt)dt + dWt
Generator: (AP u)(x) = 1
2
u(x), log p(x) + 1
2
, u(x)
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 7 / 23
Identifying a Stein Operator T
Goal: Identify operator T for which EP [(T g)(Z)] = 0 for all g ∈ G
Approach: Generator method of Barbour [1988, 1990], G¨otze [1991]
Identify a Markov process (Zt)t≥0 with stationary distribution P
Under mild conditions, its infinitesimal generator
(Au)(x) = lim
t→0
(E[u(Zt) | Z0 = x] − u(x))/t
satisfies EP [(Au)(Z)] = 0
Overdamped Langevin diffusion: dZt = 1
2
log p(Zt)dt + dWt
Generator: (AP u)(x) = 1
2
u(x), log p(x) + 1
2
, u(x)
Stein operator: (TP g)(x) g(x), log p(x) + , g(x)
[Gorham and Mackey, 2015, Oates, Girolami, and Chopin, 2016]
Depends on P only through log p; computable even if p
cannot be normalized!
Multivariate generalization of density method operator
(T g)(x) = g(x) d
dx log p(x) + g (x) [Stein, Diaconis, Holmes, and Reinert, 2004]
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 7 / 23
Identifying a Stein Set G
Goal: Identify set G for which EP [(TP g)(Z)] = 0 for all g ∈ G
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 8 / 23
Identifying a Stein Set G
Goal: Identify set G for which EP [(TP g)(Z)] = 0 for all g ∈ G
Approach: Reproducing kernels k : X × X → R
A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and
positive semidefinite ( i,l ciclk(zi, zl) ≥ 0, ∀zi ∈ X, ci ∈ R)
e.g., Gaussian kernel k(x, y) = e−1
2
x−y 2
2
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 8 / 23
Identifying a Stein Set G
Goal: Identify set G for which EP [(TP g)(Z)] = 0 for all g ∈ G
Approach: Reproducing kernels k : X × X → R
A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and
positive semidefinite ( i,l ciclk(zi, zl) ≥ 0, ∀zi ∈ X, ci ∈ R)
e.g., Gaussian kernel k(x, y) = e−1
2
x−y 2
2
Yields function space Kk = {f : f(x) = m
i=1 cik(zi, x), m ∈ N}
with norm f Kk
= m
i,l=1 ciclk(zi, zl)
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 8 / 23
Identifying a Stein Set G
Goal: Identify set G for which EP [(TP g)(Z)] = 0 for all g ∈ G
Approach: Reproducing kernels k : X × X → R
A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and
positive semidefinite ( i,l ciclk(zi, zl) ≥ 0, ∀zi ∈ X, ci ∈ R)
e.g., Gaussian kernel k(x, y) = e−1
2
x−y 2
2
Yields function space Kk = {f : f(x) = m
i=1 cik(zi, x), m ∈ N}
with norm f Kk
= m
i,l=1 ciclk(zi, zl)
We define the kernel Stein set of vector-valued g : X → Rd
as
Gk, · {g = (g1, . . . , gd) | v ∗
≤ 1 for vj gj Kk
}.
Each gj belongs to reproducing kernel Hilbert space (RKHS) Kk
Component norms vj gj Kk
are jointly bounded by 1
EP [(TP g)(Z)] = 0 for all g ∈ Gk, · whenever k ∈ C
(1,1)
b and
log p integrable under P [Gorham and Mackey, 2017]
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 8 / 23
Computing the Kernel Stein Discrepancy
Kernel Stein discrepancy (KSD) S(Qn, TP , Gk, · )
Stein operator (TP g)(x) g(x), log p(x) + , g(x)
Stein set Gk, · {g = (g1, . . . , gd) | v ∗
≤ 1 for vj gj Kk
}
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 9 / 23
Computing the Kernel Stein Discrepancy
Kernel Stein discrepancy (KSD) S(Qn, TP , Gk, · )
Stein operator (TP g)(x) g(x), log p(x) + , g(x)
Stein set Gk, · {g = (g1, . . . , gd) | v ∗
≤ 1 for vj gj Kk
}
Benefit: Computable in closed form [Gorham and Mackey, 2017]
S(Qn, TP , Gk, · ) = w for wj
n
i,i =1 kj
0(xi, xi ).
Reduces to parallelizable pairwise evaluations of Stein kernels
kj
0(x, y) 1
p(x)p(y) xj yj (p(x)k(x, y)p(y))
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 9 / 23
Computing the Kernel Stein Discrepancy
Kernel Stein discrepancy (KSD) S(Qn, TP , Gk, · )
Stein operator (TP g)(x) g(x), log p(x) + , g(x)
Stein set Gk, · {g = (g1, . . . , gd) | v ∗
≤ 1 for vj gj Kk
}
Benefit: Computable in closed form [Gorham and Mackey, 2017]
S(Qn, TP , Gk, · ) = w for wj
n
i,i =1 kj
0(xi, xi ).
Reduces to parallelizable pairwise evaluations of Stein kernels
kj
0(x, y) 1
p(x)p(y) xj yj (p(x)k(x, y)p(y))
Stein set choice inspired by control functional kernels
k0 = d
j=1 kj
0 of Oates, Girolami, and Chopin [2016]
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 9 / 23
Computing the Kernel Stein Discrepancy
Kernel Stein discrepancy (KSD) S(Qn, TP , Gk, · )
Stein operator (TP g)(x) g(x), log p(x) + , g(x)
Stein set Gk, · {g = (g1, . . . , gd) | v ∗
≤ 1 for vj gj Kk
}
Benefit: Computable in closed form [Gorham and Mackey, 2017]
S(Qn, TP , Gk, · ) = w for wj
n
i,i =1 kj
0(xi, xi ).
Reduces to parallelizable pairwise evaluations of Stein kernels
kj
0(x, y) 1
p(x)p(y) xj yj (p(x)k(x, y)p(y))
Stein set choice inspired by control functional kernels
k0 = d
j=1 kj
0 of Oates, Girolami, and Chopin [2016]
When · = · 2, recovers the KSD of Chwialkowski, Strathmann, and
Gretton [2016], Liu, Lee, and Jordan [2016]
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 9 / 23
Computing the Kernel Stein Discrepancy
Kernel Stein discrepancy (KSD) S(Qn, TP , Gk, · )
Stein operator (TP g)(x) g(x), log p(x) + , g(x)
Stein set Gk, · {g = (g1, . . . , gd) | v ∗
≤ 1 for vj gj Kk
}
Benefit: Computable in closed form [Gorham and Mackey, 2017]
S(Qn, TP , Gk, · ) = w for wj
n
i,i =1 kj
0(xi, xi ).
Reduces to parallelizable pairwise evaluations of Stein kernels
kj
0(x, y) 1
p(x)p(y) xj yj (p(x)k(x, y)p(y))
Stein set choice inspired by control functional kernels
k0 = d
j=1 kj
0 of Oates, Girolami, and Chopin [2016]
When · = · 2, recovers the KSD of Chwialkowski, Strathmann, and
Gretton [2016], Liu, Lee, and Jordan [2016]
To ease notation, will use Gk Gk, · 2
in remainder of the talk
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 9 / 23
Detecting Non-convergence
Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 10 / 23
Detecting Non-convergence
Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P
In higher dimensions, KSDs based on common kernels fail to
detect non-convergence, even for Gaussian targets P
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 10 / 23
Detecting Non-convergence
Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P
In higher dimensions, KSDs based on common kernels fail to
detect non-convergence, even for Gaussian targets P
Theorem (KSD fails with light kernel tails [Gorham and Mackey, 2017])
Suppose d ≥ 3, P = N(0, Id), and α (1
2
− 1
d
)−1
. If k(x, y) and its
derivatives decay at a o( x − y −α
2 ) rate as x − y 2 → ∞, then
S(Qn, TP , Gk) → 0 for some (Qn)n≥1 not converging to P.
Gaussian (k(x, y) = e−1
2
x−y 2
2 ) and Mat´ern kernels fail for d ≥ 3
Inverse multiquadric kernels (k(x, y) = (1 + x − y 2
2)β
) with
β < −1 fail for d > 2β
1+β
The violating sample sequences (Qn)n≥1 are simple to construct
Problem: Kernels with light tails ignore excess mass in the tails
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 10 / 23
Detecting Non-convergence
Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P
Consider the inverse multiquadric (IMQ) kernel
k(x, y) = (c2
+ x − y 2
2)β
for some β < 0, c ∈ R.
IMQ KSD fails to detect non-convergence when β < −1
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 11 / 23
Detecting Non-convergence
Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P
Consider the inverse multiquadric (IMQ) kernel
k(x, y) = (c2
+ x − y 2
2)β
for some β < 0, c ∈ R.
IMQ KSD fails to detect non-convergence when β < −1
However, IMQ KSD detects non-convergence when β ∈ (−1, 0)
Theorem (IMQ KSD detects non-convergence [Gorham and Mackey, 2017])
Suppose P ∈ P and k(x, y) = (c2
+ x − y 2
2)β
for β ∈ (−1, 0). If
S(Qn, TP , Gk) → 0, then (Qn)n≥1 converges weakly to P.
P is the set of targets P with Lipschitz log p and distant
strong log concavity ( log(p(x)/p(y)),y−x
x−y 2
2
≥ κ for x − y 2 ≥ r)
Includes Gaussian mixtures with common covariance, Bayesian
logistic and Student’s t regression with Gaussian priors, ...
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 11 / 23
Detecting Convergence
Goal: Show S(Qn, TP , Gk) → 0 when Qn converges to P
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 12 / 23
Detecting Convergence
Goal: Show S(Qn, TP , Gk) → 0 when Qn converges to P
Proposition (KSD detects convergence [Gorham and Mackey, 2017])
If k ∈ C
(2,2)
b and log p Lipschitz and square integrable under P,
then S(Qn, TP , Gk) → 0 whenever the Wasserstein distance
dW · 2
(Qn, P) → 0.
Covers Gaussian, Mat´ern, IMQ, and other common bounded
kernels k
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 12 / 23
A Simple Example
i.i.d. from mixture
target P
i.i.d. from single
mixture component
101
102
103
104
101
102
103
104
10−2.5
10−2
10−1.5
10−1
10−0.5
100
Number of sample points, n
Discrepancyvalue
i.i.d. from mixture
target P
i.i.d. from single
mixture component
gh=TPg
−3 0 3 −3 0 3
0.2
0.4
0.6
0.8
1.0
−1.0
−0.5
0.0
0.5
x
Left plot:
For target p(x) ∝ e−1
2
(x+1.5)2
+ e−1
2
(x−1.5)2
, compare an i.i.d.
sample Qn from P and an i.i.d. sample Qn from one component
Expect S(Q1:n, TP , Gk) → 0 & S(Q1:n, TP , Gk) → 0
Compare IMQ KSD (β = −1/2, c = 1) with Wasserstein distance
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 13 / 23
A Simple Example
i.i.d. from mixture
target P
i.i.d. from single
mixture component
101
102
103
104
101
102
103
104
10−2.5
10−2
10−1.5
10−1
10−0.5
100
Number of sample points, n
Discrepancyvalue
i.i.d. from mixture
target P
i.i.d. from single
mixture component
gh=TPg
−3 0 3 −3 0 3
0.2
0.4
0.6
0.8
1.0
−1.0
−0.5
0.0
0.5
x
Right plot: For n = 103
sample points,
(Top) Recovered optimal Stein functions g ∈ Gk
(Bottom) Associated test functions h TP g which best
discriminate sample Qn from target P
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 14 / 23
Selecting Sampler Hyperparameters
q q
q
q
q
ESS (higher is better)
KSD (lower is better)
2.0
2.5
3.0
3.5
−0.5
0.0
0.5
0 10−4
10−3
10−2
10−1
Tolerance parameter, ε
ε = 0 (n = 230) ε = 10−2
(n = 416) ε = 10−1
(n = 1000)
−3
−2
−1
0
1
2
3
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x1
x2
Approximate slice sampling [DuBois, Korattikara, Welling, and Smyth, 2014]
Tolerance parameter controls number of datapoints evaluated
too small ⇒ too few sample points generated
too large ⇒ sampling from very different distribution
Standard MCMC selection criteria like effective sample size
(ESS) and asymptotic variance do not account for this bias
ESS maximized at tolerance = 10−1
, IMQ KSD minimized at
tolerance = 10−2
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 15 / 23
Selecting Samplers
Stochastic Gradient Fisher Scoring (SGFS)
[Ahn, Korattikara, and Welling, 2012]
Approximate MCMC procedure designed for scalability
Approximates Metropolis-adjusted Langevin algorithm but does
not use Metropolis-Hastings correction
Target P is not stationary distribution
Goal: Choose between two variants
SGFS-f inverts a d × d matrix for each new sample point
SGFS-d inverts a diagonal matrix to reduce sampling time
MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012]
10000 images, 51 features, binary label indicating whether
image of a 7 or a 9
Bayesian logistic regression posterior P
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 16 / 23
Selecting Samplers
q
q
q
q qq
q
q
qq
q q q
qqqqqq q q q q
17500
18000
18500
19000
102
102.5
103
103.5
104
104.5
Number of sample points, n
IMQkernelSteindiscrepancy
Sampler
q SGFS−d
SGFS−f
SGFS−d
qq
−0.4
−0.3
−0.2
BEST
−0.3 −0.2 −0.1 0.0
x7
x51
SGFS−d
qq
0.8
0.9
1.0
1.1
1.2
WORST
−0.5 −0.4 −0.3 −0.2 −0.1 0.0
x8
x42
SGFS−f
qq
−0.1
0.0
0.1
0.2
BEST
−0.1 0.0 0.1 0.2
x32
x34
SGFS−f
qq
−0.6
−0.5
−0.4
−0.3
WORST
−1.5 −1.4 −1.3 −1.2 −1.1
x2
x25
Left: IMQ KSD quality comparison for SGFS Bayesian logistic
regression (no surrogate ground truth used)
Right: SGFS sample points (n = 5 × 104
) with bivariate
marginal means and 95% confidence ellipses (blue) that align
best and worst with surrogate ground truth sample (red)
Both suggest small speed-up of SGFS-d (0.0017s per sample vs.
0.0019s for SGFS-f) outweighed by loss in inferential accuracy
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 17 / 23
Beyond Sample Quality Comparison
One-sample hypothesis testing
Chwialkowski, Strathmann, and Gretton [2016] used the KSD S(Qn, TP , Gk)
to test whether a sample was drawn from a target distribution P
(see also Liu, Lee, and Jordan [2016])
Test with default Gaussian kernel k experienced considerable loss
of power as the dimension d increased
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 18 / 23
Beyond Sample Quality Comparison
One-sample hypothesis testing
Chwialkowski, Strathmann, and Gretton [2016] used the KSD S(Qn, TP , Gk)
to test whether a sample was drawn from a target distribution P
(see also Liu, Lee, and Jordan [2016])
Test with default Gaussian kernel k experienced considerable loss
of power as the dimension d increased
We recreate their experiment with IMQ kernel (β = −1
2
, c = 1)
For n = 500, generate sample (xi)n
i=1 with xi = zi + ui e1
zi
iid
∼ N(0, Id) and ui
iid
∼ Unif[0, 1]. Target P = N(0, Id).
Compare with standard normality test of Baringhaus and Henze [1988]
Table: Mean power of multivariate normality tests across 400 simulations
d=2 d=5 d=10 d=15 d=20 d=25
B&H 1.0 1.0 1.0 0.91 0.57 0.26
Gaussian 1.0 1.0 0.88 0.29 0.12 0.02
IMQ 1.0 1.0 1.0 1.0 1.0 1.0
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 18 / 23
Beyond Sample Quality Comparison
Improving sample quality
Given sample points (xi)n
i=1, can minimize KSD S( ˜Qn, TP , Gk)
over all weighted samples ˜Qn = n
i=1 qn(xi)δxi
for qn a
probability mass function
Liu and Lee [2016] do this with Gaussian kernel k(x, y) = e− 1
h
x−y 2
2
Bandwidth h set to median of the squared Euclidean distance
between pairs of sample points
We recreate their experiment with the IMQ kernel
k(x, y) = (1 + 1
h
x − y 2
2)−1/2
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 19 / 23
Improving Sample Quality
q q q q q
10−4.5
10−4
10−3.5
10−3
10−2.5
10−2
2 10 50 75 100
Dimension, d
AverageMSE,||EPZ−EQn
~X||2
2
d
Sample
q Initial Qn
Gaussian KSD
IMQ KSD
MSE averaged over 500 simulations (±2 standard errors)
Target P = N(0, Id)
Starting sample Qn = 1
n
n
i=1 δxi
for xi
iid
∼ P, n = 100.
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 20 / 23
Future Directions
Many opportunities for future development
1 Improve scalability of KSD while maintaining convergence
determining properties
Low-rank, sparse, or stochastic approximations of kernel matrix
Subsampling of likelihood terms in log p
2 Addressing other inferential tasks
Design of control variates [Oates, Girolami, and Chopin, 2016, Oates and Girolami, 2016]
Variational inference [Liu and Wang, 2016, Liu and Feng, 2016]
Training generative adversarial networks [Wang and Liu, 2016]
3 Exploring the impact of Stein operator choice
An infinite number of operators T characterize P
How is discrepancy impacted? How do we select the best T ?
Thm: If log p bounded and k ∈ C
(1,1)
0 , then
S(Qn, TP , Gk) → 0 for some (Qn)n≥1 not converging to P
Diffusion Stein operators (T g)(x) = 1
p(x) , p(x)m(x)g(x) of
Gorham, Duncan, Vollmer, and Mackey [2016] may be more appropriate for these
heavy-tailed targets
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 21 / 23
References I
S. Ahn, A. Korattikara, and M. Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In Proc. 29th ICML,
ICML’12, 2012.
A. D. Barbour. Stein’s method and Poisson process convergence. J. Appl. Probab., (Special Vol. 25A):175–184, 1988. ISSN
0021-9002. A celebration of applied probability.
A. D. Barbour. Stein’s method for diffusion approximations. Probab. Theory Related Fields, 84(3):297–322, 1990. ISSN
0178-8051. doi: 10.1007/BF01197887.
L. Baringhaus and N. Henze. A consistent test for multivariate normality based on the empirical characteristic function.
Metrika, 35(1):339–348, 1988.
K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In Proc. 33rd ICML, ICML, 2016.
C. DuBois, A. Korattikara, M. Welling, and P. Smyth. Approximate slice sampling for Bayesian posterior inference. In Proc.
17th AISTATS, pages 185–193, 2014.
J. Gorham and L. Mackey. Measuring sample quality with Stein’s method. In C. Cortes, N. D. Lawrence, D. D. Lee,
M. Sugiyama, and R. Garnett, editors, Adv. NIPS 28, pages 226–234. Curran Associates, Inc., 2015.
J. Gorham and L. Mackey. Measuring sample quality with kernels. arXiv:1703.01717, Mar. 2017.
J. Gorham, A. Duncan, S. Vollmer, and L. Mackey. Measuring sample quality with diffusions. arXiv:1611.06972, Nov. 2016.
F. G¨otze. On the rate of convergence in the multivariate CLT. Ann. Probab., 19(2):724–739, 1991.
A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proc. of 31st
ICML, ICML’14, 2014.
Q. Liu and Y. Feng. Two methods for wild variational inference. arXiv preprint arXiv:1612.00081, 2016.
Q. Liu and J. Lee. Black-box importance sampling. arXiv:1610.05247, Oct. 2016. To appear in AISTATS 2017.
Q. Liu and D. Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. arXiv:1608.04471,
Aug. 2016.
Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In Proc. of 33rd ICML, volume 48 of
ICML, pages 276–284, 2016.
L. Mackey and J. Gorham. Multivariate Stein factors for a class of strongly log-concave distributions. arXiv:1512.07392, 2015.
A. M¨uller. Integral probability metrics and their generating classes of functions. Ann. Appl. Probab., 29(2):pp. 429–443, 1997.
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 22 / 23
References II
C. Oates and M. Girolami. Control functionals for quasi-monte carlo integration. In A. Gretton and C. C. Robert, editors,
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of
Machine Learning Research, pages 56–65, Cadiz, Spain, 09–11 May 2016. PMLR.
C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), pages n/a–n/a, 2016. ISSN 1467-9868. doi: 10.1111/rssb.12185.
C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In
Proc. 6th Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971),
Vol. II: Probability theory, pages 583–602. Univ. California Press, Berkeley, Calif., 1972.
C. Stein, P. Diaconis, S. Holmes, and G. Reinert. Use of exchangeable pairs in the analysis of simulations. In Stein’s method:
expository lectures and applications, volume 46 of IMS Lecture Notes Monogr. Ser., pages 1–26. Inst. Math. Statist.,
Beachwood, OH, 2004.
D. Wang and Q. Liu. Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Learning.
arXiv:1611.01722, Nov. 2016.
M. Welling and Y. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, 2011.
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 23 / 23
Comparing Discrepancies
qq
qq
qq
qq
qq
qqqqqqqq
qqqq
qq
qqqq
qqqq
qqqq
qqqq
qqqq
qqqq
qq
qq
qqqq
qq
qq
qq
qq qq
qqqqqq
qqqqqq
qqqqqq qq
qqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqq qqqq
i.i.d. from mixture
target P
i.i.d. from single
mixture component
101
102
103
104
101
102
103
104
10−2.5
10−2
10−1.5
10−1
10−0.5
100
Number of sample points, n
Discrepancyvalue
Discrepancy
q
IMQ KSD
Graph Stein
discrepancy
Wasserstein
q
qqqqqqqqq
q
q
q
qqq
qqq
qqq
qq
q
qq
q
q
q
q
qq
qqqqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
d = 1 d = 4
101
102
103
104
101
102
103
104
10−3
10−2
10−1
100
101
102
103
Number of sample points, n
Computationtime(sec)
−
−
Left: Samples drawn i.i.d. from either the bimodal Gaussian
mixture target p(x) ∝ e−1
2
(x+1.5)2
+ e−1
2
(x−1.5)2
or a single
mixture component.
Right: Discrepancy computation time using d cores in d
dimensions.
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 24 / 23
The Importance of Kernel Choice
i.i.d. from target P Off−target sample
q
q
q
qqqqqqq
q
q
q
qqqqqq
q
q
qqqqqqq
q
q
q
qqqqqqq
q
q
q
qqqqqq
q
q
qqqqqqq
q
q
q
qqqqqq
q
q
q
q
qqqqqq
q
q qqqqqqq
q q q qqqqqqq
q q qqqqqqq
q q qqqqqqq
q q q qqqqqqq
q q qqqqqqq
q q qqqqqqq
q q q qqqqqqq q q qqqqqqq q q qqqqqqq
10−2
10−1
100
10−1
100
10−2
10−1
100
GaussianMatérnInverseMultiquadric
102
103
104
105
102
103
104
105
Number of sample points, n
KernelSteindiscrepancy
Dimension
q
d = 5
d = 8
d = 20
Target P = N(0, Id)
Off-target Qn has all
xi 2 ≤ 2n1/d
log n,
xi − xj 2 ≥ 2 log n
Gaussian and Mat´ern
KSDs driven to 0 by
an off-target sequence
that does not converge
to P
IMQ KSD
(β = −1
2
, c = 1) does
not have this
deficiency
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 25 / 23
Selecting Sampler Hyperparameters
Setup [Welling and Teh, 2011]
Consider the posterior distribution P induced by L datapoints yl
drawn i.i.d. from a Gaussian mixture likelihood
Yl|X
iid
∼ 1
2
N(X1, 2) + 1
2
N(X1 + X2, 2)
under Gaussian priors on the parameters X ∈ R2
X1 ∼ N(0, 10) ⊥⊥ X2 ∼ N(0, 1)
Draw m = 100 datapoints yl with parameters (x1, x2) = (0, 1)
Induces posterior with second mode at (x1, x2) = (1, −1)
For range of parameters , run approximate slice sampling for
148000 datapoint likelihood evaluations and store resulting
posterior sample Qn
Use minimum IMQ KSD (β = −1
2
, c = 1) to select appropriate
Compare with standard MCMC parameter selection criterion,
effective sample size (ESS), a measure of Markov chain
autocorrelation
Compute median of diagnostic over 50 random sequences
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 26 / 23
Selecting Samplers
Setup
MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012]
10000 images, 51 features, binary label indicating whether
image of a 7 or a 9
Bayesian logistic regression posterior P
L independent observations (yl, vl) ∈ {1, −1} × Rd with
P(Yl = 1|vl, X) = 1/(1 + exp(− vl, X ))
Flat improper prior on the parameters X ∈ Rd
Use IMQ KSD (β = −1
2
, c = 1) to compare SGFS-f to SGFS-d
drawing 105
sample points and discarding first half as burn-in
For external support, compare bivariate marginal means and
95% confidence ellipses with surrogate ground truth Hamiltonian
Monte chain with 105
sample points [Ahn, Korattikara, and Welling, 2012]
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 27 / 23
Detecting Non-convergence
Goal: Show S(Qn, TP , Gk) → 0 only if (Qn)n≥1 converges to P
Let P be the set of targets P with Lipschitz log p and distant
strong log concavity ( log(p(x)/p(y)),y−x
x−y 2
2
≥ k for x − y 2 ≥ r)
Includes Gaussian mixtures with common covariance, Bayesian
logistic and Student’s t regression with Gaussian priors, ...
For a different Stein set G, Gorham, Duncan, Vollmer, and Mackey [2016]
showed (Qn)n≥1 converges to P if P ∈ P and S(Qn, TP , G) → 0
New contribution [Gorham and Mackey, 2017]
Theorem (Univarite KSD detects non-convergence)
Suppose P ∈ P and k(x, y) = Φ(x − y) for Φ ∈ C2
with a
non-vanishing generalized Fourier transform. If d = 1, then
S(Qn, TP , Gk) → 0 only if (Qn)n≥1 converges weakly to P.
Justifies use of KSD with Gaussian, Mat´ern, or inverse
multiquadric kernels k in the univariate case
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 28 / 23
The Importance of Tightness
Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P
A sequence (Qn)n≥1 is uniformly tight if for every > 0, there
is a finite number R( ) such that supn Qn( X 2 > R( )) ≤
Intuitively, no mass in the sequence escapes to infinity
Theorem (KSD detects tight non-convergence [Gorham and Mackey, 2017])
Suppose that P ∈ P and k(x, y) = Φ(x − y) for Φ ∈ C2
with a
non-vanishing generalized Fourier transform. If (Qn)n≥1 is uniformly
tight and S(Qn, TP , Gk) → 0, then (Qn)n≥1 converges weakly to P.
Good news, but, ideally, KSD would detect non-tight sequences
automatically...
Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 29 / 23

Weitere ähnliche Inhalte

Was ist angesagt?

A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...
Gianluca Bontempi
 

Was ist angesagt? (20)

Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic Model
 
Lecture10 - Naïve Bayes
Lecture10 - Naïve BayesLecture10 - Naïve Bayes
Lecture10 - Naïve Bayes
 
Modeling uncertainty in deep learning
Modeling uncertainty in deep learning Modeling uncertainty in deep learning
Modeling uncertainty in deep learning
 
Gtti 10032021
Gtti 10032021Gtti 10032021
Gtti 10032021
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and Algorithms
 
sigir2017bayesian
sigir2017bayesiansigir2017bayesian
sigir2017bayesian
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
Macrocanonical models for texture synthesis
Macrocanonical models for texture synthesisMacrocanonical models for texture synthesis
Macrocanonical models for texture synthesis
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABC
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation Problem
 
WSC 2011, advanced tutorial on simulation in Statistics
WSC 2011, advanced tutorial on simulation in StatisticsWSC 2011, advanced tutorial on simulation in Statistics
WSC 2011, advanced tutorial on simulation in Statistics
 
Science in text mining
Science in text miningScience in text mining
Science in text mining
 
Dependent processes in Bayesian Nonparametrics
Dependent processes in Bayesian NonparametricsDependent processes in Bayesian Nonparametrics
Dependent processes in Bayesian Nonparametrics
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligence
 
Parametric Density Estimation using Gaussian Mixture Models
Parametric Density Estimation using Gaussian Mixture ModelsParametric Density Estimation using Gaussian Mixture Models
Parametric Density Estimation using Gaussian Mixture Models
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
A Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian NonparametricsA Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian Nonparametrics
 
A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...
 
Bayesian Nonparametrics, Applications to biology, ecology, and marketing
Bayesian Nonparametrics, Applications to biology, ecology, and marketingBayesian Nonparametrics, Applications to biology, ecology, and marketing
Bayesian Nonparametrics, Applications to biology, ecology, and marketing
 

Ähnlich wie Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applied Mathematics Opening Workshop, Measuring Sample Quality with Stein's Method - Lester Mackey, Aug 29, 2017

Statistical Machine________ Learning.ppt
Statistical Machine________ Learning.pptStatistical Machine________ Learning.ppt
Statistical Machine________ Learning.ppt
SandeepGupta229023
 
Probability And Stats Intro2
Probability And Stats Intro2Probability And Stats Intro2
Probability And Stats Intro2
mailund
 

Ähnlich wie Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applied Mathematics Opening Workshop, Measuring Sample Quality with Stein's Method - Lester Mackey, Aug 29, 2017 (20)

Talk slides imsct2016
Talk slides imsct2016Talk slides imsct2016
Talk slides imsct2016
 
clustering tendency
clustering tendencyclustering tendency
clustering tendency
 
TS-IASSL2014
TS-IASSL2014TS-IASSL2014
TS-IASSL2014
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
 
ML-04.pdf
ML-04.pdfML-04.pdf
ML-04.pdf
 
Variational inference
Variational inference  Variational inference
Variational inference
 
My7class
My7classMy7class
My7class
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximations
 
Bayesian/Fiducial/Frequentist Uncertainty Quantification by Artificial Samples
Bayesian/Fiducial/Frequentist Uncertainty Quantification by Artificial SamplesBayesian/Fiducial/Frequentist Uncertainty Quantification by Artificial Samples
Bayesian/Fiducial/Frequentist Uncertainty Quantification by Artificial Samples
 
Statistical Machine________ Learning.ppt
Statistical Machine________ Learning.pptStatistical Machine________ Learning.ppt
Statistical Machine________ Learning.ppt
 
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?
 
EMBODO LP Grade 12 Mean and Variance of the Sampling Distribution of the Samp...
EMBODO LP Grade 12 Mean and Variance of the Sampling Distribution of the Samp...EMBODO LP Grade 12 Mean and Variance of the Sampling Distribution of the Samp...
EMBODO LP Grade 12 Mean and Variance of the Sampling Distribution of the Samp...
 
Bayesian Statistics.pdf
Bayesian Statistics.pdfBayesian Statistics.pdf
Bayesian Statistics.pdf
 
Probability And Stats Intro2
Probability And Stats Intro2Probability And Stats Intro2
Probability And Stats Intro2
 
Statistics Applied to Biomedical Sciences
Statistics Applied to Biomedical SciencesStatistics Applied to Biomedical Sciences
Statistics Applied to Biomedical Sciences
 
Varese italie #2
Varese italie #2Varese italie #2
Varese italie #2
 
Kernel estimation(ref)
Kernel estimation(ref)Kernel estimation(ref)
Kernel estimation(ref)
 
Slides csm
Slides csmSlides csm
Slides csm
 

Mehr von The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
The Statistical and Applied Mathematical Sciences Institute
 

Mehr von The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Kürzlich hochgeladen

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Kürzlich hochgeladen (20)

How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 

Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applied Mathematics Opening Workshop, Measuring Sample Quality with Stein's Method - Lester Mackey, Aug 29, 2017

  • 1. Measuring Sample Quality with Stein’s Method Lester Mackey∗,† Joint work with Jackson Gorham† Microsoft Research∗ Stanford University† August 29, 2017 Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 1 / 23
  • 2. Motivation: Large-scale Posterior Inference Question: How do we scale Markov chain Monte Carlo (MCMC) posterior inference to massive datasets? MCMC Benefit: Approximates intractable posterior expectations EP [h(Z)] = X p(x)h(x)dx with asymptotically exact sample estimates EQ[h(X)] = 1 n n i=1 h(xi) Problem: Each point xi requires iterating over entire dataset! Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 2 / 23
  • 3. Motivation: Large-scale Posterior Inference Question: How do we scale Markov chain Monte Carlo (MCMC) posterior inference to massive datasets? MCMC Benefit: Approximates intractable posterior expectations EP [h(Z)] = X p(x)h(x)dx with asymptotically exact sample estimates EQ[h(X)] = 1 n n i=1 h(xi) Problem: Each point xi requires iterating over entire dataset! Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Approximate standard MCMC procedure in a manner that makes use of only a small subset of datapoints per sample Reduced computational overhead leads to faster sampling and reduced Monte Carlo variance Introduces asymptotic bias: target distribution is not stationary Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 2 / 23
  • 4. Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 3 / 23
  • 5. Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 3 / 23
  • 6. Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 3 / 23
  • 7. Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? How do we quantify the bias-variance trade-off explicitly? Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 3 / 23
  • 8. Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? How do we quantify the bias-variance trade-off explicitly? Difficulty: Standard evaluation criteria like effective sample size, trace plots, and variance diagnostics assume convergence to the target distribution and do not account for asymptotic bias Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 3 / 23
  • 9. Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? How do we quantify the bias-variance trade-off explicitly? Difficulty: Standard evaluation criteria like effective sample size, trace plots, and variance diagnostics assume convergence to the target distribution and do not account for asymptotic bias This talk: Introduce new quality measures suitable for comparing the quality of approximate MCMC samples Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 3 / 23
  • 10. Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 4 / 23
  • 11. Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = Rd and density p p known up to normalization, integration under P is intractable Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 4 / 23
  • 12. Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = Rd and density p p known up to normalization, integration under P is intractable Sample points x1, . . . , xn ∈ X Define discrete distribution Qn with, for any function h, EQn [h(X)] = 1 n n i=1 h(xi) used to approximate EP [h(Z)] We make no assumption about the provenance of the xi Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 4 / 23
  • 13. Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = Rd and density p p known up to normalization, integration under P is intractable Sample points x1, . . . , xn ∈ X Define discrete distribution Qn with, for any function h, EQn [h(X)] = 1 n n i=1 h(xi) used to approximate EP [h(Z)] We make no assumption about the provenance of the xi Goal: Quantify how well EQn approximates EP Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 4 / 23
  • 14. Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = Rd and density p p known up to normalization, integration under P is intractable Sample points x1, . . . , xn ∈ X Define discrete distribution Qn with, for any function h, EQn [h(X)] = 1 n n i=1 h(xi) used to approximate EP [h(Z)] We make no assumption about the provenance of the xi Goal: Quantify how well EQn approximates EP in a manner that I. Detects when a sample sequence is converging to the target Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 4 / 23
  • 15. Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = Rd and density p p known up to normalization, integration under P is intractable Sample points x1, . . . , xn ∈ X Define discrete distribution Qn with, for any function h, EQn [h(X)] = 1 n n i=1 h(xi) used to approximate EP [h(Z)] We make no assumption about the provenance of the xi Goal: Quantify how well EQn approximates EP in a manner that I. Detects when a sample sequence is converging to the target II. Detects when a sample sequence is not converging to the target Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 4 / 23
  • 16. Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = Rd and density p p known up to normalization, integration under P is intractable Sample points x1, . . . , xn ∈ X Define discrete distribution Qn with, for any function h, EQn [h(X)] = 1 n n i=1 h(xi) used to approximate EP [h(Z)] We make no assumption about the provenance of the xi Goal: Quantify how well EQn approximates EP in a manner that I. Detects when a sample sequence is converging to the target II. Detects when a sample sequence is not converging to the target III. Is computationally feasible Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 4 / 23
  • 17. Integral Probability Metrics Goal: Quantify how well EQn approximates EP Idea: Consider an integral probability metric (IPM) [M¨uller, 1997] dH(Qn, P) = sup h∈H |EQn [h(X)] − EP [h(Z)]| Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of dH(Qn, P) to zero implies (Qn)n≥1 converges weakly to P (Requirement II) Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 5 / 23
  • 18. Integral Probability Metrics Goal: Quantify how well EQn approximates EP Idea: Consider an integral probability metric (IPM) [M¨uller, 1997] dH(Qn, P) = sup h∈H |EQn [h(X)] − EP [h(Z)]| Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of dH(Qn, P) to zero implies (Qn)n≥1 converges weakly to P (Requirement II) Problem: Integration under P intractable! ⇒ Most IPMs cannot be computed in practice Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 5 / 23
  • 19. Integral Probability Metrics Goal: Quantify how well EQn approximates EP Idea: Consider an integral probability metric (IPM) [M¨uller, 1997] dH(Qn, P) = sup h∈H |EQn [h(X)] − EP [h(Z)]| Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of dH(Qn, P) to zero implies (Qn)n≥1 converges weakly to P (Requirement II) Problem: Integration under P intractable! ⇒ Most IPMs cannot be computed in practice Idea: Only consider functions with EP [h(Z)] known a priori to be 0 Then IPM computation only depends on Qn! Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 5 / 23
  • 20. Integral Probability Metrics Goal: Quantify how well EQn approximates EP Idea: Consider an integral probability metric (IPM) [M¨uller, 1997] dH(Qn, P) = sup h∈H |EQn [h(X)] − EP [h(Z)]| Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of dH(Qn, P) to zero implies (Qn)n≥1 converges weakly to P (Requirement II) Problem: Integration under P intractable! ⇒ Most IPMs cannot be computed in practice Idea: Only consider functions with EP [h(Z)] known a priori to be 0 Then IPM computation only depends on Qn! How do we select this class of test functions? Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 5 / 23
  • 21. Integral Probability Metrics Goal: Quantify how well EQn approximates EP Idea: Consider an integral probability metric (IPM) [M¨uller, 1997] dH(Qn, P) = sup h∈H |EQn [h(X)] − EP [h(Z)]| Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of dH(Qn, P) to zero implies (Qn)n≥1 converges weakly to P (Requirement II) Problem: Integration under P intractable! ⇒ Most IPMs cannot be computed in practice Idea: Only consider functions with EP [h(Z)] known a priori to be 0 Then IPM computation only depends on Qn! How do we select this class of test functions? Will the resulting discrepancy measure track sample sequence convergence (Requirements I and II)? Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 5 / 23
  • 22. Integral Probability Metrics Goal: Quantify how well EQn approximates EP Idea: Consider an integral probability metric (IPM) [M¨uller, 1997] dH(Qn, P) = sup h∈H |EQn [h(X)] − EP [h(Z)]| Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of dH(Qn, P) to zero implies (Qn)n≥1 converges weakly to P (Requirement II) Problem: Integration under P intractable! ⇒ Most IPMs cannot be computed in practice Idea: Only consider functions with EP [h(Z)] known a priori to be 0 Then IPM computation only depends on Qn! How do we select this class of test functions? Will the resulting discrepancy measure track sample sequence convergence (Requirements I and II)? How do we solve the resulting optimization problem in practice? Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 5 / 23
  • 23. Stein’s Method Stein’s method [1972] provides a recipe for controlling convergence: 1 Identify operator T and set G of functions g : X → Rd with EP [(T g)(Z)] = 0 for all g ∈ G. Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 6 / 23
  • 24. Stein’s Method Stein’s method [1972] provides a recipe for controlling convergence: 1 Identify operator T and set G of functions g : X → Rd with EP [(T g)(Z)] = 0 for all g ∈ G. T and G together define the Stein discrepancy [Gorham and Mackey, 2015] S(Qn, T , G) sup g∈G |EQn [(T g)(X)]| = dT G(Qn, P), an IPM-type measure with no explicit integration under P Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 6 / 23
  • 25. Stein’s Method Stein’s method [1972] provides a recipe for controlling convergence: 1 Identify operator T and set G of functions g : X → Rd with EP [(T g)(Z)] = 0 for all g ∈ G. T and G together define the Stein discrepancy [Gorham and Mackey, 2015] S(Qn, T , G) sup g∈G |EQn [(T g)(X)]| = dT G(Qn, P), an IPM-type measure with no explicit integration under P 2 Lower bound S(Qn, T , G) by reference IPM dH(Qn, P ) ⇒ S(Qn, T , G) → 0 only if (Qn)n≥1 converges to P (Req. II) Performed once, in advance, for large classes of distributions Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 6 / 23
  • 26. Stein’s Method Stein’s method [1972] provides a recipe for controlling convergence: 1 Identify operator T and set G of functions g : X → Rd with EP [(T g)(Z)] = 0 for all g ∈ G. T and G together define the Stein discrepancy [Gorham and Mackey, 2015] S(Qn, T , G) sup g∈G |EQn [(T g)(X)]| = dT G(Qn, P), an IPM-type measure with no explicit integration under P 2 Lower bound S(Qn, T , G) by reference IPM dH(Qn, P ) ⇒ S(Qn, T , G) → 0 only if (Qn)n≥1 converges to P (Req. II) Performed once, in advance, for large classes of distributions 3 Upper bound S(Qn, T , G) by any means necessary to demonstrate convergence to 0 (Requirement I) Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 6 / 23
  • 27. Stein’s Method Stein’s method [1972] provides a recipe for controlling convergence: 1 Identify operator T and set G of functions g : X → Rd with EP [(T g)(Z)] = 0 for all g ∈ G. T and G together define the Stein discrepancy [Gorham and Mackey, 2015] S(Qn, T , G) sup g∈G |EQn [(T g)(X)]| = dT G(Qn, P), an IPM-type measure with no explicit integration under P 2 Lower bound S(Qn, T , G) by reference IPM dH(Qn, P ) ⇒ S(Qn, T , G) → 0 only if (Qn)n≥1 converges to P (Req. II) Performed once, in advance, for large classes of distributions 3 Upper bound S(Qn, T , G) by any means necessary to demonstrate convergence to 0 (Requirement I) Standard use: As analytical tool to prove convergence Our goal: Develop Stein discrepancy into practical quality measure Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 6 / 23
  • 28. Identifying a Stein Operator T Goal: Identify operator T for which EP [(T g)(Z)] = 0 for all g ∈ G Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 7 / 23
  • 29. Identifying a Stein Operator T Goal: Identify operator T for which EP [(T g)(Z)] = 0 for all g ∈ G Approach: Generator method of Barbour [1988, 1990], G¨otze [1991] Identify a Markov process (Zt)t≥0 with stationary distribution P Under mild conditions, its infinitesimal generator (Au)(x) = lim t→0 (E[u(Zt) | Z0 = x] − u(x))/t satisfies EP [(Au)(Z)] = 0 Overdamped Langevin diffusion: dZt = 1 2 log p(Zt)dt + dWt Generator: (AP u)(x) = 1 2 u(x), log p(x) + 1 2 , u(x) Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 7 / 23
  • 30. Identifying a Stein Operator T Goal: Identify operator T for which EP [(T g)(Z)] = 0 for all g ∈ G Approach: Generator method of Barbour [1988, 1990], G¨otze [1991] Identify a Markov process (Zt)t≥0 with stationary distribution P Under mild conditions, its infinitesimal generator (Au)(x) = lim t→0 (E[u(Zt) | Z0 = x] − u(x))/t satisfies EP [(Au)(Z)] = 0 Overdamped Langevin diffusion: dZt = 1 2 log p(Zt)dt + dWt Generator: (AP u)(x) = 1 2 u(x), log p(x) + 1 2 , u(x) Stein operator: (TP g)(x) g(x), log p(x) + , g(x) [Gorham and Mackey, 2015, Oates, Girolami, and Chopin, 2016] Depends on P only through log p; computable even if p cannot be normalized! Multivariate generalization of density method operator (T g)(x) = g(x) d dx log p(x) + g (x) [Stein, Diaconis, Holmes, and Reinert, 2004] Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 7 / 23
  • 31. Identifying a Stein Set G Goal: Identify set G for which EP [(TP g)(Z)] = 0 for all g ∈ G Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 8 / 23
  • 32. Identifying a Stein Set G Goal: Identify set G for which EP [(TP g)(Z)] = 0 for all g ∈ G Approach: Reproducing kernels k : X × X → R A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and positive semidefinite ( i,l ciclk(zi, zl) ≥ 0, ∀zi ∈ X, ci ∈ R) e.g., Gaussian kernel k(x, y) = e−1 2 x−y 2 2 Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 8 / 23
  • 33. Identifying a Stein Set G Goal: Identify set G for which EP [(TP g)(Z)] = 0 for all g ∈ G Approach: Reproducing kernels k : X × X → R A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and positive semidefinite ( i,l ciclk(zi, zl) ≥ 0, ∀zi ∈ X, ci ∈ R) e.g., Gaussian kernel k(x, y) = e−1 2 x−y 2 2 Yields function space Kk = {f : f(x) = m i=1 cik(zi, x), m ∈ N} with norm f Kk = m i,l=1 ciclk(zi, zl) Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 8 / 23
  • 34. Identifying a Stein Set G Goal: Identify set G for which EP [(TP g)(Z)] = 0 for all g ∈ G Approach: Reproducing kernels k : X × X → R A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and positive semidefinite ( i,l ciclk(zi, zl) ≥ 0, ∀zi ∈ X, ci ∈ R) e.g., Gaussian kernel k(x, y) = e−1 2 x−y 2 2 Yields function space Kk = {f : f(x) = m i=1 cik(zi, x), m ∈ N} with norm f Kk = m i,l=1 ciclk(zi, zl) We define the kernel Stein set of vector-valued g : X → Rd as Gk, · {g = (g1, . . . , gd) | v ∗ ≤ 1 for vj gj Kk }. Each gj belongs to reproducing kernel Hilbert space (RKHS) Kk Component norms vj gj Kk are jointly bounded by 1 EP [(TP g)(Z)] = 0 for all g ∈ Gk, · whenever k ∈ C (1,1) b and log p integrable under P [Gorham and Mackey, 2017] Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 8 / 23
  • 35. Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Qn, TP , Gk, · ) Stein operator (TP g)(x) g(x), log p(x) + , g(x) Stein set Gk, · {g = (g1, . . . , gd) | v ∗ ≤ 1 for vj gj Kk } Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 9 / 23
  • 36. Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Qn, TP , Gk, · ) Stein operator (TP g)(x) g(x), log p(x) + , g(x) Stein set Gk, · {g = (g1, . . . , gd) | v ∗ ≤ 1 for vj gj Kk } Benefit: Computable in closed form [Gorham and Mackey, 2017] S(Qn, TP , Gk, · ) = w for wj n i,i =1 kj 0(xi, xi ). Reduces to parallelizable pairwise evaluations of Stein kernels kj 0(x, y) 1 p(x)p(y) xj yj (p(x)k(x, y)p(y)) Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 9 / 23
  • 37. Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Qn, TP , Gk, · ) Stein operator (TP g)(x) g(x), log p(x) + , g(x) Stein set Gk, · {g = (g1, . . . , gd) | v ∗ ≤ 1 for vj gj Kk } Benefit: Computable in closed form [Gorham and Mackey, 2017] S(Qn, TP , Gk, · ) = w for wj n i,i =1 kj 0(xi, xi ). Reduces to parallelizable pairwise evaluations of Stein kernels kj 0(x, y) 1 p(x)p(y) xj yj (p(x)k(x, y)p(y)) Stein set choice inspired by control functional kernels k0 = d j=1 kj 0 of Oates, Girolami, and Chopin [2016] Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 9 / 23
  • 38. Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Qn, TP , Gk, · ) Stein operator (TP g)(x) g(x), log p(x) + , g(x) Stein set Gk, · {g = (g1, . . . , gd) | v ∗ ≤ 1 for vj gj Kk } Benefit: Computable in closed form [Gorham and Mackey, 2017] S(Qn, TP , Gk, · ) = w for wj n i,i =1 kj 0(xi, xi ). Reduces to parallelizable pairwise evaluations of Stein kernels kj 0(x, y) 1 p(x)p(y) xj yj (p(x)k(x, y)p(y)) Stein set choice inspired by control functional kernels k0 = d j=1 kj 0 of Oates, Girolami, and Chopin [2016] When · = · 2, recovers the KSD of Chwialkowski, Strathmann, and Gretton [2016], Liu, Lee, and Jordan [2016] Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 9 / 23
  • 39. Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Qn, TP , Gk, · ) Stein operator (TP g)(x) g(x), log p(x) + , g(x) Stein set Gk, · {g = (g1, . . . , gd) | v ∗ ≤ 1 for vj gj Kk } Benefit: Computable in closed form [Gorham and Mackey, 2017] S(Qn, TP , Gk, · ) = w for wj n i,i =1 kj 0(xi, xi ). Reduces to parallelizable pairwise evaluations of Stein kernels kj 0(x, y) 1 p(x)p(y) xj yj (p(x)k(x, y)p(y)) Stein set choice inspired by control functional kernels k0 = d j=1 kj 0 of Oates, Girolami, and Chopin [2016] When · = · 2, recovers the KSD of Chwialkowski, Strathmann, and Gretton [2016], Liu, Lee, and Jordan [2016] To ease notation, will use Gk Gk, · 2 in remainder of the talk Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 9 / 23
  • 40. Detecting Non-convergence Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 10 / 23
  • 41. Detecting Non-convergence Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P In higher dimensions, KSDs based on common kernels fail to detect non-convergence, even for Gaussian targets P Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 10 / 23
  • 42. Detecting Non-convergence Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P In higher dimensions, KSDs based on common kernels fail to detect non-convergence, even for Gaussian targets P Theorem (KSD fails with light kernel tails [Gorham and Mackey, 2017]) Suppose d ≥ 3, P = N(0, Id), and α (1 2 − 1 d )−1 . If k(x, y) and its derivatives decay at a o( x − y −α 2 ) rate as x − y 2 → ∞, then S(Qn, TP , Gk) → 0 for some (Qn)n≥1 not converging to P. Gaussian (k(x, y) = e−1 2 x−y 2 2 ) and Mat´ern kernels fail for d ≥ 3 Inverse multiquadric kernels (k(x, y) = (1 + x − y 2 2)β ) with β < −1 fail for d > 2β 1+β The violating sample sequences (Qn)n≥1 are simple to construct Problem: Kernels with light tails ignore excess mass in the tails Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 10 / 23
  • 43. Detecting Non-convergence Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P Consider the inverse multiquadric (IMQ) kernel k(x, y) = (c2 + x − y 2 2)β for some β < 0, c ∈ R. IMQ KSD fails to detect non-convergence when β < −1 Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 11 / 23
  • 44. Detecting Non-convergence Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P Consider the inverse multiquadric (IMQ) kernel k(x, y) = (c2 + x − y 2 2)β for some β < 0, c ∈ R. IMQ KSD fails to detect non-convergence when β < −1 However, IMQ KSD detects non-convergence when β ∈ (−1, 0) Theorem (IMQ KSD detects non-convergence [Gorham and Mackey, 2017]) Suppose P ∈ P and k(x, y) = (c2 + x − y 2 2)β for β ∈ (−1, 0). If S(Qn, TP , Gk) → 0, then (Qn)n≥1 converges weakly to P. P is the set of targets P with Lipschitz log p and distant strong log concavity ( log(p(x)/p(y)),y−x x−y 2 2 ≥ κ for x − y 2 ≥ r) Includes Gaussian mixtures with common covariance, Bayesian logistic and Student’s t regression with Gaussian priors, ... Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 11 / 23
  • 45. Detecting Convergence Goal: Show S(Qn, TP , Gk) → 0 when Qn converges to P Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 12 / 23
  • 46. Detecting Convergence Goal: Show S(Qn, TP , Gk) → 0 when Qn converges to P Proposition (KSD detects convergence [Gorham and Mackey, 2017]) If k ∈ C (2,2) b and log p Lipschitz and square integrable under P, then S(Qn, TP , Gk) → 0 whenever the Wasserstein distance dW · 2 (Qn, P) → 0. Covers Gaussian, Mat´ern, IMQ, and other common bounded kernels k Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 12 / 23
  • 47. A Simple Example i.i.d. from mixture target P i.i.d. from single mixture component 101 102 103 104 101 102 103 104 10−2.5 10−2 10−1.5 10−1 10−0.5 100 Number of sample points, n Discrepancyvalue i.i.d. from mixture target P i.i.d. from single mixture component gh=TPg −3 0 3 −3 0 3 0.2 0.4 0.6 0.8 1.0 −1.0 −0.5 0.0 0.5 x Left plot: For target p(x) ∝ e−1 2 (x+1.5)2 + e−1 2 (x−1.5)2 , compare an i.i.d. sample Qn from P and an i.i.d. sample Qn from one component Expect S(Q1:n, TP , Gk) → 0 & S(Q1:n, TP , Gk) → 0 Compare IMQ KSD (β = −1/2, c = 1) with Wasserstein distance Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 13 / 23
  • 48. A Simple Example i.i.d. from mixture target P i.i.d. from single mixture component 101 102 103 104 101 102 103 104 10−2.5 10−2 10−1.5 10−1 10−0.5 100 Number of sample points, n Discrepancyvalue i.i.d. from mixture target P i.i.d. from single mixture component gh=TPg −3 0 3 −3 0 3 0.2 0.4 0.6 0.8 1.0 −1.0 −0.5 0.0 0.5 x Right plot: For n = 103 sample points, (Top) Recovered optimal Stein functions g ∈ Gk (Bottom) Associated test functions h TP g which best discriminate sample Qn from target P Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 14 / 23
  • 49. Selecting Sampler Hyperparameters q q q q q ESS (higher is better) KSD (lower is better) 2.0 2.5 3.0 3.5 −0.5 0.0 0.5 0 10−4 10−3 10−2 10−1 Tolerance parameter, ε ε = 0 (n = 230) ε = 10−2 (n = 416) ε = 10−1 (n = 1000) −3 −2 −1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 x1 x2 Approximate slice sampling [DuBois, Korattikara, Welling, and Smyth, 2014] Tolerance parameter controls number of datapoints evaluated too small ⇒ too few sample points generated too large ⇒ sampling from very different distribution Standard MCMC selection criteria like effective sample size (ESS) and asymptotic variance do not account for this bias ESS maximized at tolerance = 10−1 , IMQ KSD minimized at tolerance = 10−2 Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 15 / 23
  • 50. Selecting Samplers Stochastic Gradient Fisher Scoring (SGFS) [Ahn, Korattikara, and Welling, 2012] Approximate MCMC procedure designed for scalability Approximates Metropolis-adjusted Langevin algorithm but does not use Metropolis-Hastings correction Target P is not stationary distribution Goal: Choose between two variants SGFS-f inverts a d × d matrix for each new sample point SGFS-d inverts a diagonal matrix to reduce sampling time MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012] 10000 images, 51 features, binary label indicating whether image of a 7 or a 9 Bayesian logistic regression posterior P Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 16 / 23
  • 51. Selecting Samplers q q q q qq q q qq q q q qqqqqq q q q q 17500 18000 18500 19000 102 102.5 103 103.5 104 104.5 Number of sample points, n IMQkernelSteindiscrepancy Sampler q SGFS−d SGFS−f SGFS−d qq −0.4 −0.3 −0.2 BEST −0.3 −0.2 −0.1 0.0 x7 x51 SGFS−d qq 0.8 0.9 1.0 1.1 1.2 WORST −0.5 −0.4 −0.3 −0.2 −0.1 0.0 x8 x42 SGFS−f qq −0.1 0.0 0.1 0.2 BEST −0.1 0.0 0.1 0.2 x32 x34 SGFS−f qq −0.6 −0.5 −0.4 −0.3 WORST −1.5 −1.4 −1.3 −1.2 −1.1 x2 x25 Left: IMQ KSD quality comparison for SGFS Bayesian logistic regression (no surrogate ground truth used) Right: SGFS sample points (n = 5 × 104 ) with bivariate marginal means and 95% confidence ellipses (blue) that align best and worst with surrogate ground truth sample (red) Both suggest small speed-up of SGFS-d (0.0017s per sample vs. 0.0019s for SGFS-f) outweighed by loss in inferential accuracy Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 17 / 23
  • 52. Beyond Sample Quality Comparison One-sample hypothesis testing Chwialkowski, Strathmann, and Gretton [2016] used the KSD S(Qn, TP , Gk) to test whether a sample was drawn from a target distribution P (see also Liu, Lee, and Jordan [2016]) Test with default Gaussian kernel k experienced considerable loss of power as the dimension d increased Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 18 / 23
  • 53. Beyond Sample Quality Comparison One-sample hypothesis testing Chwialkowski, Strathmann, and Gretton [2016] used the KSD S(Qn, TP , Gk) to test whether a sample was drawn from a target distribution P (see also Liu, Lee, and Jordan [2016]) Test with default Gaussian kernel k experienced considerable loss of power as the dimension d increased We recreate their experiment with IMQ kernel (β = −1 2 , c = 1) For n = 500, generate sample (xi)n i=1 with xi = zi + ui e1 zi iid ∼ N(0, Id) and ui iid ∼ Unif[0, 1]. Target P = N(0, Id). Compare with standard normality test of Baringhaus and Henze [1988] Table: Mean power of multivariate normality tests across 400 simulations d=2 d=5 d=10 d=15 d=20 d=25 B&H 1.0 1.0 1.0 0.91 0.57 0.26 Gaussian 1.0 1.0 0.88 0.29 0.12 0.02 IMQ 1.0 1.0 1.0 1.0 1.0 1.0 Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 18 / 23
  • 54. Beyond Sample Quality Comparison Improving sample quality Given sample points (xi)n i=1, can minimize KSD S( ˜Qn, TP , Gk) over all weighted samples ˜Qn = n i=1 qn(xi)δxi for qn a probability mass function Liu and Lee [2016] do this with Gaussian kernel k(x, y) = e− 1 h x−y 2 2 Bandwidth h set to median of the squared Euclidean distance between pairs of sample points We recreate their experiment with the IMQ kernel k(x, y) = (1 + 1 h x − y 2 2)−1/2 Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 19 / 23
  • 55. Improving Sample Quality q q q q q 10−4.5 10−4 10−3.5 10−3 10−2.5 10−2 2 10 50 75 100 Dimension, d AverageMSE,||EPZ−EQn ~X||2 2 d Sample q Initial Qn Gaussian KSD IMQ KSD MSE averaged over 500 simulations (±2 standard errors) Target P = N(0, Id) Starting sample Qn = 1 n n i=1 δxi for xi iid ∼ P, n = 100. Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 20 / 23
  • 56. Future Directions Many opportunities for future development 1 Improve scalability of KSD while maintaining convergence determining properties Low-rank, sparse, or stochastic approximations of kernel matrix Subsampling of likelihood terms in log p 2 Addressing other inferential tasks Design of control variates [Oates, Girolami, and Chopin, 2016, Oates and Girolami, 2016] Variational inference [Liu and Wang, 2016, Liu and Feng, 2016] Training generative adversarial networks [Wang and Liu, 2016] 3 Exploring the impact of Stein operator choice An infinite number of operators T characterize P How is discrepancy impacted? How do we select the best T ? Thm: If log p bounded and k ∈ C (1,1) 0 , then S(Qn, TP , Gk) → 0 for some (Qn)n≥1 not converging to P Diffusion Stein operators (T g)(x) = 1 p(x) , p(x)m(x)g(x) of Gorham, Duncan, Vollmer, and Mackey [2016] may be more appropriate for these heavy-tailed targets Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 21 / 23
  • 57. References I S. Ahn, A. Korattikara, and M. Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In Proc. 29th ICML, ICML’12, 2012. A. D. Barbour. Stein’s method and Poisson process convergence. J. Appl. Probab., (Special Vol. 25A):175–184, 1988. ISSN 0021-9002. A celebration of applied probability. A. D. Barbour. Stein’s method for diffusion approximations. Probab. Theory Related Fields, 84(3):297–322, 1990. ISSN 0178-8051. doi: 10.1007/BF01197887. L. Baringhaus and N. Henze. A consistent test for multivariate normality based on the empirical characteristic function. Metrika, 35(1):339–348, 1988. K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In Proc. 33rd ICML, ICML, 2016. C. DuBois, A. Korattikara, M. Welling, and P. Smyth. Approximate slice sampling for Bayesian posterior inference. In Proc. 17th AISTATS, pages 185–193, 2014. J. Gorham and L. Mackey. Measuring sample quality with Stein’s method. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Adv. NIPS 28, pages 226–234. Curran Associates, Inc., 2015. J. Gorham and L. Mackey. Measuring sample quality with kernels. arXiv:1703.01717, Mar. 2017. J. Gorham, A. Duncan, S. Vollmer, and L. Mackey. Measuring sample quality with diffusions. arXiv:1611.06972, Nov. 2016. F. G¨otze. On the rate of convergence in the multivariate CLT. Ann. Probab., 19(2):724–739, 1991. A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proc. of 31st ICML, ICML’14, 2014. Q. Liu and Y. Feng. Two methods for wild variational inference. arXiv preprint arXiv:1612.00081, 2016. Q. Liu and J. Lee. Black-box importance sampling. arXiv:1610.05247, Oct. 2016. To appear in AISTATS 2017. Q. Liu and D. Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. arXiv:1608.04471, Aug. 2016. Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In Proc. of 33rd ICML, volume 48 of ICML, pages 276–284, 2016. L. Mackey and J. Gorham. Multivariate Stein factors for a class of strongly log-concave distributions. arXiv:1512.07392, 2015. A. M¨uller. Integral probability metrics and their generating classes of functions. Ann. Appl. Probab., 29(2):pp. 429–443, 1997. Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 22 / 23
  • 58. References II C. Oates and M. Girolami. Control functionals for quasi-monte carlo integration. In A. Gretton and C. C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 56–65, Cadiz, Spain, 09–11 May 2016. PMLR. C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pages n/a–n/a, 2016. ISSN 1467-9868. doi: 10.1111/rssb.12185. C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proc. 6th Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971), Vol. II: Probability theory, pages 583–602. Univ. California Press, Berkeley, Calif., 1972. C. Stein, P. Diaconis, S. Holmes, and G. Reinert. Use of exchangeable pairs in the analysis of simulations. In Stein’s method: expository lectures and applications, volume 46 of IMS Lecture Notes Monogr. Ser., pages 1–26. Inst. Math. Statist., Beachwood, OH, 2004. D. Wang and Q. Liu. Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Learning. arXiv:1611.01722, Nov. 2016. M. Welling and Y. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, 2011. Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 23 / 23
  • 59. Comparing Discrepancies qq qq qq qq qq qqqqqqqq qqqq qq qqqq qqqq qqqq qqqq qqqq qqqq qq qq qqqq qq qq qq qq qq qqqqqq qqqqqq qqqqqq qq qqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqq qqqq i.i.d. from mixture target P i.i.d. from single mixture component 101 102 103 104 101 102 103 104 10−2.5 10−2 10−1.5 10−1 10−0.5 100 Number of sample points, n Discrepancyvalue Discrepancy q IMQ KSD Graph Stein discrepancy Wasserstein q qqqqqqqqq q q q qqq qqq qqq qq q qq q q q q qq qqqqqq q q q q q q q q q q q q q d = 1 d = 4 101 102 103 104 101 102 103 104 10−3 10−2 10−1 100 101 102 103 Number of sample points, n Computationtime(sec) − − Left: Samples drawn i.i.d. from either the bimodal Gaussian mixture target p(x) ∝ e−1 2 (x+1.5)2 + e−1 2 (x−1.5)2 or a single mixture component. Right: Discrepancy computation time using d cores in d dimensions. Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 24 / 23
  • 60. The Importance of Kernel Choice i.i.d. from target P Off−target sample q q q qqqqqqq q q q qqqqqq q q qqqqqqq q q q qqqqqqq q q q qqqqqq q q qqqqqqq q q q qqqqqq q q q q qqqqqq q q qqqqqqq q q q qqqqqqq q q qqqqqqq q q qqqqqqq q q q qqqqqqq q q qqqqqqq q q qqqqqqq q q q qqqqqqq q q qqqqqqq q q qqqqqqq 10−2 10−1 100 10−1 100 10−2 10−1 100 GaussianMatérnInverseMultiquadric 102 103 104 105 102 103 104 105 Number of sample points, n KernelSteindiscrepancy Dimension q d = 5 d = 8 d = 20 Target P = N(0, Id) Off-target Qn has all xi 2 ≤ 2n1/d log n, xi − xj 2 ≥ 2 log n Gaussian and Mat´ern KSDs driven to 0 by an off-target sequence that does not converge to P IMQ KSD (β = −1 2 , c = 1) does not have this deficiency Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 25 / 23
  • 61. Selecting Sampler Hyperparameters Setup [Welling and Teh, 2011] Consider the posterior distribution P induced by L datapoints yl drawn i.i.d. from a Gaussian mixture likelihood Yl|X iid ∼ 1 2 N(X1, 2) + 1 2 N(X1 + X2, 2) under Gaussian priors on the parameters X ∈ R2 X1 ∼ N(0, 10) ⊥⊥ X2 ∼ N(0, 1) Draw m = 100 datapoints yl with parameters (x1, x2) = (0, 1) Induces posterior with second mode at (x1, x2) = (1, −1) For range of parameters , run approximate slice sampling for 148000 datapoint likelihood evaluations and store resulting posterior sample Qn Use minimum IMQ KSD (β = −1 2 , c = 1) to select appropriate Compare with standard MCMC parameter selection criterion, effective sample size (ESS), a measure of Markov chain autocorrelation Compute median of diagnostic over 50 random sequences Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 26 / 23
  • 62. Selecting Samplers Setup MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012] 10000 images, 51 features, binary label indicating whether image of a 7 or a 9 Bayesian logistic regression posterior P L independent observations (yl, vl) ∈ {1, −1} × Rd with P(Yl = 1|vl, X) = 1/(1 + exp(− vl, X )) Flat improper prior on the parameters X ∈ Rd Use IMQ KSD (β = −1 2 , c = 1) to compare SGFS-f to SGFS-d drawing 105 sample points and discarding first half as burn-in For external support, compare bivariate marginal means and 95% confidence ellipses with surrogate ground truth Hamiltonian Monte chain with 105 sample points [Ahn, Korattikara, and Welling, 2012] Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 27 / 23
  • 63. Detecting Non-convergence Goal: Show S(Qn, TP , Gk) → 0 only if (Qn)n≥1 converges to P Let P be the set of targets P with Lipschitz log p and distant strong log concavity ( log(p(x)/p(y)),y−x x−y 2 2 ≥ k for x − y 2 ≥ r) Includes Gaussian mixtures with common covariance, Bayesian logistic and Student’s t regression with Gaussian priors, ... For a different Stein set G, Gorham, Duncan, Vollmer, and Mackey [2016] showed (Qn)n≥1 converges to P if P ∈ P and S(Qn, TP , G) → 0 New contribution [Gorham and Mackey, 2017] Theorem (Univarite KSD detects non-convergence) Suppose P ∈ P and k(x, y) = Φ(x − y) for Φ ∈ C2 with a non-vanishing generalized Fourier transform. If d = 1, then S(Qn, TP , Gk) → 0 only if (Qn)n≥1 converges weakly to P. Justifies use of KSD with Gaussian, Mat´ern, or inverse multiquadric kernels k in the univariate case Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 28 / 23
  • 64. The Importance of Tightness Goal: Show S(Qn, TP , Gk) → 0 only if Qn converges to P A sequence (Qn)n≥1 is uniformly tight if for every > 0, there is a finite number R( ) such that supn Qn( X 2 > R( )) ≤ Intuitively, no mass in the sequence escapes to infinity Theorem (KSD detects tight non-convergence [Gorham and Mackey, 2017]) Suppose that P ∈ P and k(x, y) = Φ(x − y) for Φ ∈ C2 with a non-vanishing generalized Fourier transform. If (Qn)n≥1 is uniformly tight and S(Qn, TP , Gk) → 0, then (Qn)n≥1 converges weakly to P. Good news, but, ideally, KSD would detect non-tight sequences automatically... Mackey (MSR) Kernel Stein Discrepancy August 29, 2017 29 / 23