dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
1. Appendix for Lecture 2: Monte Carlo Methods (Basics)
Dahua Lin
1 Justification of Basic Sampling Methods
Proposition 1. Let F be the cdf of a real-valued random variable with distribution D. Let
U ∼ Uniform([0, 1]), then F−1(U) ∼ D.
Proof. Let X = F−1(U). It suffices to show that the cdf of X is F. For any t ∈ R,
P(X ≤ t) = P(F−1
(U) ≤ t) = P(U ≤ F(t)) = F(t). (1)
Here, we utilize the fact that F is non-decreasing.
Proposition 2. Samples producted using Rejection sampling has the desired distribution.
Proof. Each iteration actually generate two random variables: x and u, where u ∈ {0, 1} is the
indicator of acceptance. The join distribution of x and u is given by
˜p(dx, u = 1) = a(u = 1|x)q(dx) =
p(x)
Mq(x)
q(x)dx =
p(x)
M
µ(dx). (2)
Here, a(u|x) is the conditional distribution of u on x, and µ is the base measure. On the other
hand, we have
Pr(u = 1) = ˜p(dx, u = 1) =
p(x)
M
µ(dx) =
1
M
. (3)
Thus, the resultant distribution is
˜p(dx|u = 1) =
˜p(dx, u = 1)
Pr(u = 1)
= p(x)µ(dx). (4)
This completes the proof.
2 Markov Chain Theory
Proposition 3. When the state space Ω is countable, we have
µ − ν TV =
1
2
x∈Ω
|µ(x) − ν(x)|. (5)
Proof. Let A = {x ∈ Ω : µ(x) ≥ nu(x)}. By definition, we have
µ − ν TV ≥ |µ(A) − ν(A)| = µ(A) − ν(A), (6)
µ − ν TV ≥ |µ(Ac
) − ν(Ac
)| = ν(Ac
) − µ(Ac
). (7)
We also have
µ(A) − ν(A) =
x∈A
µ(x) − ν(x) =
x∈A
|µ(x) − ν(x)|, (8)
ν(Ac
) − µ(Ac
) =
x∈Ac
ν(x) − µ(x) =
x∈Ac
|µ(x) − ν(x)|. (9)
1
2. Combining the equations above results in
µ − ν TV ≥
1
2
(µ(A) − ν(A) + ν(Ac
) − µ(Ac
))
=
1
2
x∈A
|µ(x) − ν(x)| +
x∈Ac
|µ(x) − ν(x)|
=
1
2
x∈Ω
|µ(x) − ν(x)|. (10)
Next we show the inequality of the other direction. For any A ⊂ Ω, we have
|µ(Ac
) − ν(Ac
)| = |(µ(Ω) − µ(A)) − (ν(Ω) − ν(A))| = |µ(A) − ν(A)| (11)
Hence,
|µ(A) − ν(A)| =
1
2
(|µ(A) − ν(A)| + |µ(Ac
) − ν(Ac
)|)
≤
1
2
x∈A
|µ(x) − ν(x)| +
x∈Ac
|µ(x) − ν(x)|
≤
1
2
x∈Ω
|µ(x) − ν(x)|. (12)
As A is arbitrary, we can conclude that
µ − ν TV sup
A
|µ(A) − ν(A)| ≤
1
2
x∈Ω
|µ(x) − ν(x)|. (13)
This completes the proof.
Proposition 4. The total variation distance (µ, ν) → µ − ν TV is a metric.
Proof. To show that it is a metric, we verify the four properties that a metric needs to satisfy
one by one.
1. µ − ν TV is non-negative, as |µ(A) − ν(A)| is always non-negative.
2. When µ = ν, |µ(A) − ν(A)| is always zero, and hence µ − ν TV = 0. On the other
hand, when µ = ν, there exists A ⊂ S such that |µ(A) − ν(A)| > 0, and therefore
µ − ν TV ≥ |µ(A) − ν(A)| > 0. Together we can conclude that µ − ν TV = 0 iff µ = ν.
3. µ−ν TV = ν −µ TV as |µ(A)−ν(A)| = |ν(A)−µ(A)| holds for any measurable subset
A.
4. Next, we show that the total variation distance satisfies the triangle inequality, as below.
Let µ, ν, η be three probability measures over Ω:
µ − ν TV = sup
A∈S
|µ(A) − ν(A)|
= sup
A∈S
|µ(A) − η(A) + η(A) − ν(A)|
≤ sup
A∈S
(|µ(A) − η(A)| + |η(A) − ν(A)|)
≤ sup
A∈S
|µ(A) − η(A)| + sup
A∈S
|η(A) − ν(A)|
= µ − η TV + η − ν TV . (14)
2
3. The proof is completed.
Proposition 5. Consider a Markov chain over a countable space Ω with transition proba-
bility matrix P. Let π be a probability measure over Ω that is in detailed balance with P,
i.e. π(x)P(x, y) = π(y)P(y, x), ∀x, y ∈ Ω. Then π is invariant to P, i.e. π = πP.
Proof. With the assumption of detailed balance, we have
(πP)(y) =
x∈Ω
π(x)P(x, y) =
x∈Ω
π(y)P(y, x) = π(y)
x∈Ω
P(y, x) = π(y). (15)
Hence, π = πP, or in other words, π is invariant to P.
Proposition 6. Let (Xt) be an ergodic Markov chain Markov(π, P) where π is in detailed
balance with P, then given arbitrary sequence x0, . . . , xn ∈ Ω, we have
Pr(X0 = x0, . . . , Xn = xn) = Pr(X0 = xn, . . . , Xn = x0). (16)
Proof. First, we have
Pr(X0 = x0, . . . , Xn = xn) = π(x0)P(x0, x1) · · · P(xn−1, xn). (17)
On the other hand, by detailed balance, we have P(x, y) = π(y)P(y,x)
π(x) , and thus
Pr(X0 = xn, . . . , Xn = x0) = π(xn)P(xn, xn−1) · · · P(x1, x0)
= π(xn)
π(xn−1)P(xn−1, xn)
π(xn)
· · ·
π(x0)P(x0, x1)
π(x1)
= P(xn−1, xn) · · · P(x0, x1)π(x0). (18)
Comparing Eq.(17) and Eq.(18) results in the equality that we intend to prove.
Proposition 7. Over a measurable space (Ω, S), if a stochastic kernel P is reversible w.r.t. π,
then π is invariant to P.
Proof. Let π = πP, it suffices to show that π (A) = π(A) for every A ∈ S under the reversibility
assumption. Given any A ∈ S, let fA(x, y) := 1(y ∈ A), then we have
π (A) = π(dx)P(x, A)
= π(dx) fA(x, y)P(x, dy)
= fA(x, y)π(dx)P(x, dy)
= fA(y, x)π(dx)P(x, dy)
= 1(x ∈ A)π(dx)P(x, dy)
= 1(x ∈ A)π(dx) P(x, dy)
= 1(x ∈ A)π(dx) = π(A). (19)
This completes the proof.
3
4. Proposition 8. Given a stochastic kernel P and a probability measure π over (Ω, S). Suppose
both Px and π are absolutely continuous w.r.t. a base measure µ, that is, π(dx) = π(x)µ(dx)
and P(x, dy) = Px(dy) = px(y)µ(dy), then P is reversible w.r.t. π if and only if
π(x)px(y) = π(y)py(x), a.e. (20)
Proof. First, assuming detailed balance, i.e. π(x)px(y) = π(y)py(x), a.e., we show reversibility.
f(x, y)π(dx)P(x, dy) = f(x, y)π(x)px(y)µ(dx)µ(dy)
= f(x, y)π(y)py(x)µ(dx)µ(dy) ...[detailed balance]
= f(y, x)π(x)px(y)µ(dx)µ(dy) ...[exchange variables]
= f(y, x)π(dx)P(x, dy). (21)
Next, we show the converse. The definition of reversibility implies that
f(x, y)π(dx)P(x, dy) = f(x, y)π(dy)P(y, dx) (22)
Hence,
f(x, y)π(x)px(y)µ(dx)µ(dy) = f(x, y)π(y)py(x)µ(dx)µ(dy) (23)
Hence, f(x, y)π(x)px(y) = f(x, y)π(y)py(x) a.e. for arbitrary integrable function f, which im-
plies that π(x)px(y) = π(y)py(x) a.e.
Proposition 9. Given a stochastic kernel P and a probability measure π over (Ω, S). If
P(x, dy) = m(x)Ix(dy) + px(y)µ(dy) and π(x)px(y) = π(y)py(x) a.e, then P is reversible
w.r.t. π.
Proof. Under the given conditions, we have
f(x, y)π(dx)P(x, dy) = f(x, y)π(dx) (m(x)Ix(dy) + px(y)µ(dy))
= f(x, y)m(x)π(dx)Ix(dy) + f(x, y)px(y)µ(dy)
= f(x, x)m(x)π(dx) + f(x, y)px(y)π(dx)µ(dy)
= f(x, x)m(x)π(dx) + f(x, y)px(y)π(x)µ(dx)µ(dy). (24)
For the right hand side, we have
f(y, x)π(dx)P(x, dy) = f(y, x)π(dx) (m(x)Ix(dy) + px(y)µ(dy))
= f(y, x)m(x)π(dx)Ix(dy) + f(y, x)px(y)µ(dy)
= f(x, x)m(x)π(dx) + f(y, x)px(y)π(dx)µ(dy)
= f(x, x)m(x)π(dx) + f(y, x)px(y)π(x)µ(dx)µ(dy)
= f(x, x)m(x)π(dx) + f(x, y)py(x)π(y)µ(dx)µ(dy). (25)
With π(x)px(y) = π(y)py(x), we can see that the left and right hand sides are equal. This
completes the proof.
4
5. 3 Justification of MCMC Methods
Proposition 10. Samples produced using the Metropolis-Hastings algorithm has the desired
distribution, and the resultant chain is reversible.
Proof. It suffices to show that the M-H update is reversible w.r.t. π, which implies that π is
invariant. The stochastic kernel of M-H update is given by
P(x, dy) = m(x)I(x, dy) + q(x, dy)a(x, y) = r(x)I(x, dy) + qx(y)a(x, y)µ(dy) (26)
Here, µ is the base measure, and I(x, dy) is the identity measure given by I(x, A) = 1(x ∈ A),
and m(x) is the probability that the proposal is rejected, which is given by
m(x) = 1 −
Ω
q(x, dy)a(x, y). (27)
Let g(x, y) = h(x)qx(y)a(x, y). With Proposition 9, it suffices to show that g(x, y) = g(y, x).
Here, a(x, y) = min{r(x, y), 1}. Also, from the definition r(x, y) =
h(y)qy(x)
h(x)qx(y) , it is easy to see
that r(x, y) = 1/r(y, x). We first consider the case where r(x, y) ≤ 1 (thus r(y, x) ≥ 1), then
g(x, y) = h(x)qx(y)a(x, y) = h(x)qx(y)
h(y)qy(x)
h(x)qx(y)
= h(y)qy(x), (28)
and
g(y, x) = h(y)qy(x)a(y, x) = h(y)qy(x). (29)
Hence, g(x, y) = g(y, x) when r(x, y) ≤ 1. Similarly, we can show that the equality holds when
r(x, y) ≥ 1. This completes the proof.
Proposition 11. The Metropolis algorithm is a special case of the Metropolis-Hastings algo-
rithm.
Proof. It suffices to show that when q is symmetric, i.e. qx(y) = qy(x), the acceptance rate
reduces to the form given in the Metropolis algorithm. Particularly, when qx(y) = qy(x), the
acceptance rate of the M-H algorithm is
a(x, y) = min{r(x, y), 1} = min
h(y)qy(x)
h(x)qx(y)
, 1 = min
h(y)
h(x)
, 1 . (30)
This completes the proof.
Proposition 12. The Gibbs sampling update is a special case of the Metropolis-Hastings update.
Proof. Without losing generality, we assume the sample is comprised of two components: x =
(x1, x2). Consider a proposal qx(dy) = π(dy1, x2)I(dx2). In this case, we have
r((x1, x2), (y1, x2)) =
π(y1, x2)π(x1, x2)
π(x1, x2)π(y1, x2)
= 1. (31)
This implies that the candidate is always accepted. Also, generating a sample from qx is
equivalent to drawing one from p(y|z). This completes the argument.
Proposition 13. Let K1, . . . , Km be stochastic kernels with invariant measure π, and q ∈ Rm
be a probability vector, then K = m
k=1 qiKi is also a stochastic kernel with invariant measure
π. Moreover, if K1, . . . , Km are all reversible, then K is reversible.
5
6. Proof. First, it is easy to see that convex combinations of probability measures remain proba-
bility measures. As an immediate consequence, Kx, a convex combination of Ki(x, ·), is also a
probability measure. Given a measurable subset A, Ki(·, A) is measurable for each i, so is their
convex combinations. Hence, we can conclude that K remains a stochastic kernel. Next, we
show that π invariant to K, as
πK = π
m
i=1
qiKi =
m
i=1
qi(πKi) =
m
i=1
qiπ = π. (32)
This proves the first statement. Then, we assume that K1, . . . , Km are reversible, then for K,
we have
f(x, y)π(dx)K(x, dy) =
m
i=1
qi f(x, y)π(dx)Ki(x, dy)
=
m
i=1
qi f(y, x)π(dx)Ki(x, dy)
= f(y, x)π(dx)K(x, dy). (33)
This implies that K is also reversible, thus completing the proof.
Proposition 14. Let K1, . . . , Km be stochastic kernels with invariant measure π. Then K =
Km ◦ · · · ◦ K1 is also a stochastic kernel with invariant measure π.
Proof. Consider K = K2 ◦ K1. To show that K is a stochastic kernel, we first show that
Kx(dy) = K(x, dy) is a probability measure. Given arbitrary measurable subset A, we have
K(x, A) = K1(x, dy)K2(y, A). (34)
As this is a bounded non-negative integration and K2(y, A) is measurable, it constitutes a
measure. Also,
K(x, Ω) = K1(x, dy)K2(y, Ω) = K1(x, dy) = 1. (35)
Hence, K(x, ·) is a probability measure, and thus K a stochastic kernel. Next, we show π is
invariant to K:
πK = π(K2 ◦ K1) = (πK1)K2 = πK2 = π. (36)
We have proved the statement for a composition of two kernels K1 ◦ K2. By induction, we can
further extend to finite composition, thus completing the proof.
6