Murphy: Machine learning A probabilistic perspective: Ch.9

CH.9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY
D. YONEOKA
9.1. Introduction.
Outline.
• Property of exponential family
• Derive theorem and algorithm for appliucations
• How to generate classiffier (GLM)
9.2. The exponential family.
Why this is important?
• Finite-size sufficient statistics (→ Able to compress information w/o loss)
• Existence of conjugate priors (See Sec.9.2.5)
• Least set of assumptions subject to some constraints (See Sec.9.2.6)
• Core of GLM (See Sec.9.3)
• Core of variation inference (See Sec.21.2)
9.2.1. Definition.
Let pdf or pmt p(x|θ), for x = (x1, . . . , xm) ∈ χm
and θ ∈ Θ ⊆ Rd
, and it is said to be exponential family if as follow;
p(x|θ) =
1
Z(θ)
h(x) exp[θϕ(x)]
= h(x) exp[θϕ(x) − A(θ)]
= h(x) exp[η(θT
)ϕ(x) − A(η(θ))]
where
Z(θ) =
∫
χm
h(x) exp[θϕ(x)]
A(θ) = logZ(θ)
Let we call;
• θ: natural parameter or canonical parameter
• ϕ(x): sufficient statistic
• Z(θ): partition function
• A(θ): log partition function or cumulant function
• h(x): scaling constant, offen = 1
• η(θ): mapping of θ to the canonical parameters
Note;
• If dim(θ) < dim(η(θ)), it is called a curved exponential family, that means we have more sufficient statistics than
parameters.
• If dim(θ) = dim(η(θ)), it is called canonical form.
Natural parameter space is H = {θ : A(θ) < ∞} Written in its canonical form, a density in exponential family has
some convexity properties. These convexity properties are useful in manipulating with moments and other functionals of
sufficient statistics.
1

2 D. YONEOKA
Theorem: H is a convex set
Proof.
We should prove that A is a convex function. Let 0 α ∞ and take θ and θ1 in H. Write
A(αθ + (1 − α)θ1) = log
∫
χm
exp {(αθ + (1 − α)θ1)ϕ(x)} h(x)dx
= log
∫
χm
exp(αθϕ(x))h(x)dx
∫
χm
exp((1 − α)θ1ϕ(x))h(x)dx
≤ log
[∫
χm
(
exp(αθϕ(x)
)1
α h(x)dx
]α [∫
χm
(
exp((1 − α)θ1ϕ(x)
) 1
1−α h(x)dx
]1−α
= αA(θ) + (1 − α)A(θ1) (from Holder’s inequality)
≤ ∞
Thus αA(θ) + (1 − α)A(θ1) ∈ H and the theorem holds.

9.2.2. Example.
9.2.2.1. Bernoulli, i.e., x ∈ {0, 1}.
Ber(x|µ) = µx
(1 − µ)x
= exp[x log(µ) + (1 − x) log(1 − µ)] = exp[ϕ(x)T
θ],
where ϕ(x) = [I(x = 0), I(x = 1)] and θ = [log(µ), log(1 − µ)].
→ but over-complete because I(x = 0) + I(x = 1)] = 1
→ θ can not be identifiable.
→ To aquaire the identifiability, add assumption that θ is minimal
Then,
Ber(x|µ) = (1 − µ) exp(x log(
µ
1 − µ
))
Note: We can recover the mean parameter µ from canonical parameter θ = log
µ
1 − µ
using
µ = sigm(θ) =
1
1 + e−θ
9.2.2.2. Multinoulli, i.e., xk = I(x = k).
Cat(x|µ) =
K∏
k=1
µxk
k = exp


K∑
k=1
xk log µk


⇐⇒ Cat(x|θ) = exp[θT
ϕ(x) − A(θ)],
where
θ = [log
µ1
µK
, . . . , log
µ1
µK
]
ϕ(x) = [I(x = 1), . . . , I(x = K − 1)].
As with the above case, we can find as follow;
µk =
eθk
1 +
∑K−1
j=1 eθj
µK =
1
∑K−1
j=1 eθj
A(θ) = log

1 +
K−1∑
j=1
eθj


9.2.2.3. Univariate Gaussian.
N(x|µ, σ2
) =
1
(2πσ2)1/2
exp[−
1
2σ2
(x − µ)2
]
=
1
Z(θ)
exp(θT
ϕ(x)),

CH.9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY 3
where
θ =
(
µ/σ2
−1
2σ2
)
ϕ(x) =
(
x
x2
)
Z(µ, σ2
) =
√
2πσ exp[
µ2
2σ2
]
A(θ) =
−θ2
1
4θ2
−
1
2
log(−2θ2) −
1
2
log(2π)
9.2.2.4. Non-example.
For example, uniform distribution and t-distribution are not exponential family because these can not be expressed in the
required form.
9.2.3. Log partition function.
A(θ) is called cumulant function, which means derivatives of A can be used to generate the cumulant of the sufficient
statistics, i.e.,
dA
dθ
=
d
dθ
(
log
∫
exp(θϕ(x))h(x)dx
)
=
∫
ϕ(x) exp(θϕ(x))h(x)dx
exp(A(θ))
=
∫
ϕ(x) exp(θϕ(x) − A(θ))h(x)dx
=
∫
ϕ(x)p(x)dx = E[ϕ(x)] = Expectation of the sufficient statistics
d2
A
dθ2
=
∫
ϕ(x) exp(θϕ(x) − A(θ))h(x)(ϕ(x) − A
′
(θ))dx =
∫
ϕ(x)p(x)(ϕ(x) − A
′
(θ))dx
=
∫
ϕ2
(x)p(x)dx − A
′
(θ)
∫
ϕ(x)p(x)dx = E[ϕ2
(x)] − E[ϕ(x)]2
(∵ A
′
(θ) =
dA
dθ
= E[ϕ(x)])
= Var[ϕ(x)] = Variance of the sufficient statistics
9.2.4. MLE for the exponential family.
The likelihood of the exponential family can generally be expressed as follow;
p(D|θ) =


N∏
i=1
h(xi)

 g(θ)N
exp

η(θ)T
[
N∑
i=1
ϕ(xi)]

 ,
on which the sufficient statistics indicates
ϕ(D) = [
N∑
i=1
ϕ1(xi), . . . ,
N∑
i=1
ϕK(xi)]
Pittman-Koopman-Darmois theorem
Under certain condition, the exponential family is the only family of distribution with finite sufficient statistics.
Note; One of the condition required in this theorem
• The support of the distribution is not depend on the distribution parameters.
– e.g., Uniform distribution p(x|θ) =
1
θ
I(0 ≤ x ≤ θ) do not fit this condition.
– The sufficient statistics; N and max xi
– Have finite sufficient statistics but the support of the distribution is depend on the parameter
– → Not exponential family

Derivation of MLE.
• Now we consider only a canonical exponential family, η(θ) = θ, the log likelihood is follow;
log p(D|θ) = θT
ϕ(D) − NA(θ)

4 D. YONEOKA
• Concavity of log likelihood
– Second derivative of −A(θ) is non-positive
– θT
ϕ(D) is linear in θ
– → Log likelihood is concave
– → Has a unique global maximum
Set the gradient of the log likelihood = 0 MLE must satisfy
∇θ log p(D|θ) = ϕ(D) − NE[ϕ(X)] = 0 ⇐⇒ E[ϕ(X] =
1
N
N∑
i=1
ϕ(xi),
which is called moment matching.
9.2.5. Bayes for the exponential family.
Existence of Conjugate prior
Conjugate prior exists
⇐⇒ prior p(θ|τ) has the same form as the likelihood p(D|θ)
⇐⇒ prior is exponential family
Note: To make sense, we require p(D|θ) = p(s(D)|θ), which means likelihood have finite sufficient statistics.

9.2.5.1. Likelihood.
The likelihood of exponential family is given by
p(D|θ) ∝ g(θ)N
exp(η(θ)T
sN),
where sN =
∑N
i=1. or
p(D|θ) ∝ exp(NηT
¯s) − NA(η),
where ¯s =
1
N
sN
9.2.5.2. Prior.
The natural conjugate prior has the form
p(D|ν0, τ0) ∝ g(θ)ν0
exp(η(θ)T
τ0)
or by setting τ0 = ν0 ¯τ0 we get
p(D|ν0, τ0) ∝ exp(ν0ηT
ˆτ0 − ν0A(η))
9.2.5.3. Posterior.
The posterior is given by
p(D|θ) = p(D|νN, τN) = p(D|ν0 + N, τ0 + sN).
In the canonical form, this becomes
p(D|θ) ∝ exp(η(ν0 ¯τ0 + N ¯s) − (ν0 + N)A(η))
= p(η|ν0 + N,
ν0 ¯τ0 + N ¯s
ν0 + N
)
Note: the posterior hyper-parameters are
• convex combination of the prior mean hyper-parameters
• average of the sufficient statistics
9.2.5.4. Posterior predictive density.
Define
• the future data as D
′
= ( ˜x1, . . . , ˜xN′ )
• past data as D = (x1, . . . , xN′ )
• ˜τ0 = (ν0, τ0)
• ˜s(D) = (N, s(D))
• ˜s(D
′
) = (N
′
, s(D
′
))

So the prior becomes
p(θ| ˜τ0) =
1
Z( ˜τ0)
g(θ)ν0
exp(η(θ)T
τ0)
Hence
p(D
′
|D) =
∫
p(D
′
|θ)p(θ|D)dθ
=


N
′
∏
i=1
h( ˜xi)

 Z( ˜τ0 + ˜s(D))−1
∫
g(θ)ν0+N+N
′
exp


∑
k
ηk(θ)(τk +
N∑
i=1
sk(xi) +
N
′
∑
i=1
sk( ˜xi)))

 dθ
=


N
′
∏
i=1
h( ˜xi)


Z( ˜τ0 + ˜s(D) + ˜s(D
′
))
Z( ˜τ0 + ˜s(D))
9.2.5.5. Eample: Bernoulli distribution.
The likelihood is
p(D|θ) = (1 − θ)N
exp

log(
θ
1 − θ
)
∑
i
xi


Hence conjugate prior is
p(θ|ν0, τ0) ∝ (1 − θ)ν
0 exp
(
log(
θ
1 − θ
)τ0
)
= θτ0
(1 − θ)ν0−τ0
So the posterior is
p(θ|D) ∝ θτ0+s
(1 − θ)ν0−τ0+n−s
= θτn
(1 − θ)νn−τn
,
where s =
∑
i I(xi = 1) is sufficient statistics (when bernoulli, it means num of heads).
How to derive posterior predictive distribution.
Assume p(θ) = Beta(θ|α, β) and let s = s(D)and future data as D
′
= ( ˜x1, . . . , ˜xm) and s
′
=
∑m
i=1 I(˜xi = 1)
p(D
′
|D) =
∫ 1
0
p(D
′
|θ|Beta(α, β))dθ
=
Γ(αn + βn)
Γ(αn)Γ(βn)
∫ 1
0
θα0+s′
(1 − θ)βn+m−s′
−1
dθ
=
Γ(αn + βn)
Γ(αn)Γ(βn)
Γ(αn+m)Γ(βn+m)
Γ(αn+m + βn+m)
9.2.6. Maximum entropy derivation of the exponential family.
Explain one of justification for use of exponential family.
The principal of maximum entropy or maxent
We should pick up the distribution with maximum entropy ”−
∑n
i=1 pi log pi”, subject to the constraints that the
moments of the distribution match the empirical moments of the specified functions.

Suppose all we know is
∑
x fk(x)p(x) = Fk.
Set the constraint as p(x) 0,
∑
x p(x) = 1and Lagrangian to minimize the entropy as
J(p, λ) −
∑
x
p(x) log p(x) + λ0(1 −
∑
x
) +
∑
x
λk(Fk −
∑
x
p(x) fk(x))
Setting
∂J
∂p(x)
= 0 yields
p(x) =
1
exp(1 + λ0)
exp(−
∑
k
λk fk(x))
Using
∑
x p(x) = 1, we have
1 =
∑
x
p(x) =
1
exp(1 + λ0)
∑
x
exp(−
∑
k
λk fk(x))

6 D. YONEOKA
Hence the normalization constant Z = exp(1 + λ0) is given by
Z =
∑
x
exp(−
∑
k
λk fk(x)),
which means the maxent distribution p(x) has the form of the exponential family, also known as the Gibbs distribution.
9.3. Generalized linear model (GLM).
• Linear and logistic regression are one of example of GLM (McCullagh and Nelder 1989)
• Models in which the output density is int the exponential family
9.3.1. Basics.
Unconditional distribution for scalar response variable:
p(y|θ, σ2
) = exp
[
yiθ − A(θ)
σ2
+ c(yi, σ2
)
]
,
where
• σ2
is the dispersion parameter
• θ is the natural parameter
• A is the partition function
• c is the normalization constant
To convert from the mean parameter to the natural parameter, we can use a link function
θ = Ψ(µ),
which function is uniquely determined by the form of the exponential family distribution.
In addition to that,
• Link function: µi = g−1
(wT
xi)
• Mean function: g(µi) = wT
xi
• when g = Ψ, it is called the canonical link function
9.3.2. ML and MAP estimation.
The log likelohood has the following form:
l(w) = log p(D|w) =
1
σ2
N∑
i=1
li,
where li θiyi − A(θi). We can compute the gradient vector using the chain rule as follow:
dli
dwi
=
dli
dθi
dθi
dµi
dµi
dηi
dηi
dwi
= (yi − A′
(θi))
dθi
dµi
dµi
dηi
xij
= (yi − µi)
dθi
dµi
dµi
dηi
xij
If we use a canonical link, θi = ηi, this simplifies to
∇wl(w) =
1
σ2


N∑
i=1
(yi − µi)xi

 .
In addition to that, for improved efficiency, we should use a second-order method. If we used a canonical link, the Hessian
is given by
H =
1
σ2
N∑
i=1
dµi
dθi
xixT
i =
1
σ2
XT
SX,
where S = diag(
dµ1
dθ1
, . . . ,
dµN
dθN
) is a diagonal weighting matrix. Specifically, we have the following Newton update:
wt+1 = (XT
StX)−1
XT
Stzt
zt = θt + S−1
t (y − µt),
where θt = Xwt and µt = g−1
(ηt).
Note1: If we extend to handle non-canonical links

• The Hessian has another form
• The expected Hessian (Fisher information matrix) has the same form as (9.92)
– Using the expected Hessian instead of actual one is called as the Fisher scoring method
Note2: To perform MAP estimation with Gaussian distribution, See Section 8.3.6.
9.3.3. Bayesian inference.
Usually do with MCMC when GLM is estimated in bayes manner. See Dey et al. 2000 for detail.
9.4. Probit regression.
In this section, we focus on the probit regression case where g−1
(η) = Φ(η) =
∫ η
−∞
t2
√
2π
exp(
−1
2
)dt
9.4.1. ML/MAP estimation using gradient-based optimization.
Let µi = wT
xi, and let ˜y ∈ {−1, +1}, then the gradient of the log likelihood is given by
gi
d
dw
log p(˜yi|wT
xi) =
dµi
dw
d
dµi
log p(˜yi|wT
xi) = xi
˜yiϕ(µi)
Φ(˜yiµi)
and the Hessian is given by
Hi =
d
dw2
log p(˜y|wT
xi) = −xi
(
ϕ(µi)2
Φ(˜yµi)2
+
˜yiµiϕ(µi)
Φ(˜yiµi)
)
xT
i
Note: If we use prior p(w) ∼ N(0, V0),
• The gradient of log likelihood:
∑
i gi + 2V −1
0 w
• The Hessian of log likelihood:
∑
i Hi + 2V −1
0
9.4.2. Latent variable interpretation.
Assume we can observe only action which is decided based on the latent utilities, like:
u0i wT
0 xi + δ0i
u1i wT
1 xi + δ1i
yi I(u1i − u0i),
where δ’s are error term. This is called a random utility model or RUM (McFadden 1974; Train 2009).
If δ’s have a Gaussian distribution, then so does ϵi and let us deﬁne zi = u1i − u0i + ϵi, where ϵi = δ1i − δ0i, we can
rewrite
zi wxi + ϵi
ϵ ∼ N(0, 1)
yi = 1 = I(zi ≥ 0)
This is called a diﬀerence RUM or dRUM model.
When we marginalized out zi, we recover the pro bit model
p(yi|x, w) =
∫
I(zi ≥ 0)N(zi|wT
x, 1)dzi
= p(wT
x + ϵ ≥ 0) = p(ϵ ≥ −wT
x)
= 1 − Φ(wT
x) = Φ(wT
x)
Note: Interestingly, if we use Gumbel distribution for the δ’s, we induce a logistic distribution for ϵi, and the model
reduces to logistic regression.
9.4.3. Ordinal probit regression.
One advantage: “’Extendability to ordinal probit regression”
i.e., it is easy to extend to the case where the response variable is ordinal, thai is, it can take on C discrete values.
Example: If C=3.
Partition the real line to 3 intervals: (−∞, 0], (0, γ], (γ, ∞)
We can vary the parameter γ to ensure the right relative amount of probability mass falls in each intervals, so as to match
the empirical frequencies of each class label.

8 D. YONEOKA
9.4.4. Multinomial probit models.
Consider the case where the response variable can take on C unordered categorical values, yi ∈ {1, . . . ,C} The multinomial
pro bit model is defined as follows:
zic = xT
xic + ϵic
ϵ ∼ N(0, R)
yi = argmaxczic
Since only relative utilities matter, we constrain R to be a correlation matrix.
Note; If we use yic = I(zic 0) instead of yi = argmaxczic, we get a model known as multivariate probit.
9.5. Multi-task learning.
If we can assume the input-output mapping is similar across models, we can get better performance by fitting all the
parameter at the same time.
This is also called
• multi-task learning (Caruana 1998)
• transfer learning (Raina et al. 2005)
• learning to learn (Thrun and Pratt 1997)
In statistics, this is tacked using hierarchical bayesian models (Bakker and Heskes 2003) although there are other
possible methods (Chai 2010).
9.5.1. Hierarchical Bayes for multi-task learning.
Let yij be the response of the i’th item in group j, for i = 1 : Nj and j = 1 : J.
The goal of multi-task learning
The goal is to fit the models p(yj|xj) for all j.

Suppose E[yi j|xij] = g(xT
ijβj), βj ∼ N(β∗, σ2
jI) and β∗ ∼ N(µ, σ∗I). In addition to that, for simplicity, that µ = 0 and
that σ2
j and σ2
∗ are all known.
The overall log probability has the form
logp(D|β) + logp(β) =
∑
j

logp(Dj|βj) −
||βj − β∗||2
2σ2
j

 −
||β2
∗||2
2σ2
∗
We can perform MAP estimation of β = (β1:J, β∗) using standard gradient methods. Alternatively, we can also perform
an iterative optimization methods.
Note:
• Likelihood and prior are convex, thus guaranteed to converge to the global optimum
• Once the models are trained, we can discard β∗ and use each model separately.
9.5.2. Application to personalized email spam filtering.
The goal is to find each βj to filter individual spam email. We can make two copies of each feature xi, one concatenated
with the user id and one not.
E[yij|xi, u] = (β∗, w1, . . . , wJ)T
[xi, I(u = 1)xi, . . . , I(u = J)xi],
where u is the user id. In other words,
E[yij|xi, u = j] = (βT
∗ + wj)T
xi.
Thus β∗ will be estimated from everyone’s email, whereas wj will just be estimated from user j’s email.
To write down with the above hierarchical bayesian mode, define wj = βj −β∗. Then the log probability of the original
model can be rewritten as
∑
j

logp(Dj|β∗ + wj) −
||wj||2
2σ2
j

 −
||β2
∗||2
2σ2
∗
Note; If we assume σ2
j = σ2
∗, the effect is the same as using the augmented feature trick, with the same regularizer
strength for both wj and β∗. However, one typically gets better performance by not requiring that σ2
j be equal to σ2
∗
(Finkel and Manning 2009).

9.5.3. Application to domain addaption.
Domain adaptation is the problem of training a set of classifiers on data drawn from different distributions.
(Finkel and Manning 2009) used the above hierarchical Bayesian model for two NLP tasks. They reports
• Reasonably large improvements over fitting separate models to each dataset
• Small improvements over the approach of pooling all the data and fitting a single model
9.5.4. Other kinds of prior.
Let’s consider the possibility to use other prior that Gaussian.
For example, consider the task of conjoint analysis. We can use multi-task feature selection (Lenk et al. 1996; Agryriou
et al. 2008): we use a sparsity-promoting prior on βj, rather than a Gaussian prior.
Negative transfer: If we pool the parameters across tasks that are qualitatively different, the performance will be worse
than not using pooling.
→ To overcome this problem, use more flexible prior such as a mixture of Gaussians, for which we can get more robust
result against prior misspecification. (Xue et al. 2007; Jacob et al. 2008).
9.6. Generalized linear mixed models.
Similarly as above, we can allow the parameters to cary at the groups βj, or to be tied across α, which gives the form:
E[yij|xij, xi] = g
(
ϕ1(xij)T
βj + ϕ2(xj)T
β
′
j + ϕ3(xij)T
α + ϕ4(xj)T
α
)
GLMM: Frequentists call the terms βj random effects, since they vary randomly across groups, but they call α a fixed
effect.
9.6.1. Example: semi-parametric GLMM for medical data.
Suppose yij is the amount of spinal bone mineral density (SBMD) for person j at measurement i. Let xij be the age of
person, and let xij be their ethnicity.
Here semi-parametric models which combine linear regression with non-parametric regression (Ruppert et al. 2003)
because we also see that there is variation across individuals within each group.
Specifically, we will use
• ϕ1(xij) = 1 to account for the random effect of each person
• ϕ0(xij) = 0 since no other coefficients are person-specific
• ϕ3(xij) = [bk(xij)], where bk is the k’th spline basis functions (see Section 15.4.6.2), to account for the nonlinear
effect of age
• ϕ4(xij) = [I(xj = w), I(xj = a), I(xj = b), I(xj = h)] to account for the effect of the different ethnicities
• Use a linear link function
The overall model is
E[yij|xij, xj] = βj + αT
b(xij) + ϵij + α
′
wI(xj = w) + α
′
aI(xj = a) + α
′
bI(xj = b) + α
′
hI(xj = h),
where ϵij ∼ N(0, σ2
y).
This means
• α contains the non-parametric part of the model related to age
• α
′
contains the parametric part of the model related to ethnicity
• βj is a random offset for person j
We can perform posterior inference to compute p(α, α
′
, β, σ2
|D), whose prior of regression coefficient is Gaussian.
(Sec. 9.6.2) And we can also perform significance testing, by computing p(αp − αw|D) for each ethnic group g relative to
some baseline.
9.6.2. Computational Issues.
Difficulties in GLMM
• p(uij|θ) may not be conjugate to the prior p(θ) where θ = (α, β)
• There are two levels of unknown in the models, namely the regression coefficients θand the parameters related
with distribution of prior η = (µ, σ)
To adapt fully bayesian inference methods, variational bayes (Hall et al. 2011, Sec 21.5) and MCMC (German and
Hill 2007, Sec 24.1).
Or to use empirical bays. In the context of a GLMM, EM algorithm is useful, where
• E step: compute p(θ|η, D)

10 D. YONEOKA
• M step: optimize η
Note:
• Need a approximation in E step
– Numerical quadrature
– Monte Carlo (Breslow and Clayton 1993)
• Faster approach is to use variational EM (Braun and McAuliffe 2010)
• Among frequents, GEE (Generalized estimating equation) is popular.
– Not recommended. Because not statistically efficient as likelihood based methods (See 6.4.3)
– Provide only estimates of the population parameters α but not the random effects βj

Murphy: Machine learning A probabilistic perspective: Ch.9

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie Murphy: Machine learning A probabilistic perspective: Ch.9

Ähnlich wie Murphy: Machine learning A probabilistic perspective: Ch.9 (20)

Mehr von Daisuke Yoneoka

Mehr von Daisuke Yoneoka (19)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Murphy: Machine learning A probabilistic perspective: Ch.9