SlideShare ist ein Scribd-Unternehmen logo
1 von 10
Downloaden Sie, um offline zu lesen
CH.9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY
D. YONEOKA
9.1. Introduction.
Outline.
• Property of exponential family
• Derive theorem and algorithm for appliucations
• How to generate classiffier (GLM)
9.2. The exponential family.
Why this is important?
• Finite-size sufficient statistics (→ Able to compress information w/o loss)
• Existence of conjugate priors (See Sec.9.2.5)
• Least set of assumptions subject to some constraints (See Sec.9.2.6)
• Core of GLM (See Sec.9.3)
• Core of variation inference (See Sec.21.2)
9.2.1. Definition.
Let pdf or pmt p(x|θ), for x = (x1, . . . , xm) ∈ χm
and θ ∈ Θ ⊆ Rd
, and it is said to be exponential family if as follow;
p(x|θ) =
1
Z(θ)
h(x) exp[θϕ(x)]
= h(x) exp[θϕ(x) − A(θ)]
= h(x) exp[η(θT
)ϕ(x) − A(η(θ))]
where
Z(θ) =
∫
χm
h(x) exp[θϕ(x)]
A(θ) = logZ(θ)
Let we call;
• θ: natural parameter or canonical parameter
• ϕ(x): sufficient statistic
• Z(θ): partition function
• A(θ): log partition function or cumulant function
• h(x): scaling constant, offen = 1
• η(θ): mapping of θ to the canonical parameters
Note;
• If dim(θ) < dim(η(θ)), it is called a curved exponential family, that means we have more sufficient statistics than
parameters.
• If dim(θ) = dim(η(θ)), it is called canonical form.
Natural parameter space is H = {θ : A(θ) < ∞} Written in its canonical form, a density in exponential family has
some convexity properties. These convexity properties are useful in manipulating with moments and other functionals of
sufficient statistics.
1
2 D. YONEOKA
Theorem: H is a convex set 
Proof.
We should prove that A is a convex function. Let 0  α  ∞ and take θ and θ1 in H. Write
A(αθ + (1 − α)θ1) = log
∫
χm
exp {(αθ + (1 − α)θ1)ϕ(x)} h(x)dx
= log
∫
χm
exp(αθϕ(x))h(x)dx
∫
χm
exp((1 − α)θ1ϕ(x))h(x)dx
≤ log
[∫
χm
(
exp(αθϕ(x)
)1
α h(x)dx
]α [∫
χm
(
exp((1 − α)θ1ϕ(x)
) 1
1−α h(x)dx
]1−α
= αA(θ) + (1 − α)A(θ1) (from Holder’s inequality)
≤ ∞
Thus αA(θ) + (1 − α)A(θ1) ∈ H and the theorem holds.
 
9.2.2. Example.
9.2.2.1. Bernoulli, i.e., x ∈ {0, 1}.
Ber(x|µ) = µx
(1 − µ)x
= exp[x log(µ) + (1 − x) log(1 − µ)] = exp[ϕ(x)T
θ],
where ϕ(x) = [I(x = 0), I(x = 1)] and θ = [log(µ), log(1 − µ)].
→ but over-complete because I(x = 0) + I(x = 1)] = 1
→ θ can not be identifiable.
→ To aquaire the identifiability, add assumption that θ is minimal
Then,
Ber(x|µ) = (1 − µ) exp(x log(
µ
1 − µ
))
Note: We can recover the mean parameter µ from canonical parameter θ = log
µ
1 − µ
using
µ = sigm(θ) =
1
1 + e−θ
9.2.2.2. Multinoulli, i.e., xk = I(x = k).
Cat(x|µ) =
K∏
k=1
µxk
k = exp


K∑
k=1
xk log µk


⇐⇒ Cat(x|θ) = exp[θT
ϕ(x) − A(θ)],
where
θ = [log
µ1
µK
, . . . , log
µ1
µK
]
ϕ(x) = [I(x = 1), . . . , I(x = K − 1)].
As with the above case, we can find as follow;
µk =
eθk
1 +
∑K−1
j=1 eθj
µK =
1
∑K−1
j=1 eθj
A(θ) = log

1 +
K−1∑
j=1
eθj


9.2.2.3. Univariate Gaussian.
N(x|µ, σ2
) =
1
(2πσ2)1/2
exp[−
1
2σ2
(x − µ)2
]
=
1
Z(θ)
exp(θT
ϕ(x)),
CH.9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY 3
where
θ =
(
µ/σ2
−1
2σ2
)
ϕ(x) =
(
x
x2
)
Z(µ, σ2
) =
√
2πσ exp[
µ2
2σ2
]
A(θ) =
−θ2
1
4θ2
−
1
2
log(−2θ2) −
1
2
log(2π)
9.2.2.4. Non-example.
For example, uniform distribution and t-distribution are not exponential family because these can not be expressed in the
required form.
9.2.3. Log partition function.
A(θ) is called cumulant function, which means derivatives of A can be used to generate the cumulant of the sufficient
statistics, i.e.,
dA
dθ
=
d
dθ
(
log
∫
exp(θϕ(x))h(x)dx
)
=
∫
ϕ(x) exp(θϕ(x))h(x)dx
exp(A(θ))
=
∫
ϕ(x) exp(θϕ(x) − A(θ))h(x)dx
=
∫
ϕ(x)p(x)dx = E[ϕ(x)] = Expectation of the sufficient statistics
d2
A
dθ2
=
∫
ϕ(x) exp(θϕ(x) − A(θ))h(x)(ϕ(x) − A
′
(θ))dx =
∫
ϕ(x)p(x)(ϕ(x) − A
′
(θ))dx
=
∫
ϕ2
(x)p(x)dx − A
′
(θ)
∫
ϕ(x)p(x)dx = E[ϕ2
(x)] − E[ϕ(x)]2
(∵ A
′
(θ) =
dA
dθ
= E[ϕ(x)])
= Var[ϕ(x)] = Variance of the sufficient statistics
9.2.4. MLE for the exponential family.
The likelihood of the exponential family can generally be expressed as follow;
p(D|θ) =


N∏
i=1
h(xi)

 g(θ)N
exp

η(θ)T
[
N∑
i=1
ϕ(xi)]

 ,
on which the sufficient statistics indicates
ϕ(D) = [
N∑
i=1
ϕ1(xi), . . . ,
N∑
i=1
ϕK(xi)]
Pittman-Koopman-Darmois theorem 
Under certain condition, the exponential family is the only family of distribution with finite sufficient statistics.
Note; One of the condition required in this theorem
• The support of the distribution is not depend on the distribution parameters.
– e.g., Uniform distribution p(x|θ) =
1
θ
I(0 ≤ x ≤ θ) do not fit this condition.
– The sufficient statistics; N and max xi
– Have finite sufficient statistics but the support of the distribution is depend on the parameter
– → Not exponential family
 
Derivation of MLE.
• Now we consider only a canonical exponential family, η(θ) = θ, the log likelihood is follow;
log p(D|θ) = θT
ϕ(D) − NA(θ)
4 D. YONEOKA
• Concavity of log likelihood
– Second derivative of −A(θ) is non-positive
– θT
ϕ(D) is linear in θ
– → Log likelihood is concave
– → Has a unique global maximum
Set the gradient of the log likelihood = 0 MLE must satisfy
∇θ log p(D|θ) = ϕ(D) − NE[ϕ(X)] = 0 ⇐⇒ E[ϕ(X] =
1
N
N∑
i=1
ϕ(xi),
which is called moment matching.
9.2.5. Bayes for the exponential family.
Existence of Conjugate prior 
Conjugate prior exists
⇐⇒ prior p(θ|τ) has the same form as the likelihood p(D|θ)
⇐⇒ prior is exponential family
Note: To make sense, we require p(D|θ) = p(s(D)|θ), which means likelihood have finite sufficient statistics.
 
9.2.5.1. Likelihood.
The likelihood of exponential family is given by
p(D|θ) ∝ g(θ)N
exp(η(θ)T
sN),
where sN =
∑N
i=1. or
p(D|θ) ∝ exp(NηT
¯s) − NA(η),
where ¯s =
1
N
sN
9.2.5.2. Prior.
The natural conjugate prior has the form
p(D|ν0, τ0) ∝ g(θ)ν0
exp(η(θ)T
τ0)
or by setting τ0 = ν0 ¯τ0 we get
p(D|ν0, τ0) ∝ exp(ν0ηT
ˆτ0 − ν0A(η))
9.2.5.3. Posterior.
The posterior is given by
p(D|θ) = p(D|νN, τN) = p(D|ν0 + N, τ0 + sN).
In the canonical form, this becomes
p(D|θ) ∝ exp(η(ν0 ¯τ0 + N ¯s) − (ν0 + N)A(η))
= p(η|ν0 + N,
ν0 ¯τ0 + N ¯s
ν0 + N
)
Note: the posterior hyper-parameters are
• convex combination of the prior mean hyper-parameters
• average of the sufficient statistics
9.2.5.4. Posterior predictive density.
Define
• the future data as D
′
= ( ˜x1, . . . , ˜xN′ )
• past data as D = (x1, . . . , xN′ )
• ˜τ0 = (ν0, τ0)
• ˜s(D) = (N, s(D))
• ˜s(D
′
) = (N
′
, s(D
′
))
CH.9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY 5
So the prior becomes
p(θ| ˜τ0) =
1
Z( ˜τ0)
g(θ)ν0
exp(η(θ)T
τ0)
Hence
p(D
′
|D) =
∫
p(D
′
|θ)p(θ|D)dθ
=


N
′
∏
i=1
h( ˜xi)

 Z( ˜τ0 + ˜s(D))−1
∫
g(θ)ν0+N+N
′
exp


∑
k
ηk(θ)(τk +
N∑
i=1
sk(xi) +
N
′
∑
i=1
sk( ˜xi)))

 dθ
=


N
′
∏
i=1
h( ˜xi)


Z( ˜τ0 + ˜s(D) + ˜s(D
′
))
Z( ˜τ0 + ˜s(D))
9.2.5.5. Eample: Bernoulli distribution.
The likelihood is
p(D|θ) = (1 − θ)N
exp

log(
θ
1 − θ
)
∑
i
xi


Hence conjugate prior is
p(θ|ν0, τ0) ∝ (1 − θ)ν
0 exp
(
log(
θ
1 − θ
)τ0
)
= θτ0
(1 − θ)ν0−τ0
So the posterior is
p(θ|D) ∝ θτ0+s
(1 − θ)ν0−τ0+n−s
= θτn
(1 − θ)νn−τn
,
where s =
∑
i I(xi = 1) is sufficient statistics (when bernoulli, it means num of heads).
How to derive posterior predictive distribution.
Assume p(θ) = Beta(θ|α, β) and let s = s(D)and future data as D
′
= ( ˜x1, . . . , ˜xm) and s
′
=
∑m
i=1 I(˜xi = 1)
p(D
′
|D) =
∫ 1
0
p(D
′
|θ|Beta(α, β))dθ
=
Γ(αn + βn)
Γ(αn)Γ(βn)
∫ 1
0
θα0+s′
(1 − θ)βn+m−s′
−1
dθ
=
Γ(αn + βn)
Γ(αn)Γ(βn)
Γ(αn+m)Γ(βn+m)
Γ(αn+m + βn+m)
9.2.6. Maximum entropy derivation of the exponential family.
Explain one of justification for use of exponential family.
The principal of maximum entropy or maxent 
We should pick up the distribution with maximum entropy ”−
∑n
i=1 pi log pi”, subject to the constraints that the
moments of the distribution match the empirical moments of the specified functions.
 
Suppose all we know is
∑
x fk(x)p(x) = Fk.
Set the constraint as p(x)  0,
∑
x p(x) = 1and Lagrangian to minimize the entropy as
J(p, λ) −
∑
x
p(x) log p(x) + λ0(1 −
∑
x
) +
∑
x
λk(Fk −
∑
x
p(x) fk(x))
Setting
∂J
∂p(x)
= 0 yields
p(x) =
1
exp(1 + λ0)
exp(−
∑
k
λk fk(x))
Using
∑
x p(x) = 1, we have
1 =
∑
x
p(x) =
1
exp(1 + λ0)
∑
x
exp(−
∑
k
λk fk(x))
6 D. YONEOKA
Hence the normalization constant Z = exp(1 + λ0) is given by
Z =
∑
x
exp(−
∑
k
λk fk(x)),
which means the maxent distribution p(x) has the form of the exponential family, also known as the Gibbs distribution.
9.3. Generalized linear model (GLM).
• Linear and logistic regression are one of example of GLM (McCullagh and Nelder 1989)
• Models in which the output density is int the exponential family
9.3.1. Basics.
Unconditional distribution for scalar response variable:
p(y|θ, σ2
) = exp
[
yiθ − A(θ)
σ2
+ c(yi, σ2
)
]
,
where
• σ2
is the dispersion parameter
• θ is the natural parameter
• A is the partition function
• c is the normalization constant
To convert from the mean parameter to the natural parameter, we can use a link function
θ = Ψ(µ),
which function is uniquely determined by the form of the exponential family distribution.
In addition to that,
• Link function: µi = g−1
(wT
xi)
• Mean function: g(µi) = wT
xi
• when g = Ψ, it is called the canonical link function
9.3.2. ML and MAP estimation.
The log likelohood has the following form:
l(w) = log p(D|w) =
1
σ2
N∑
i=1
li,
where li θiyi − A(θi). We can compute the gradient vector using the chain rule as follow:
dli
dwi
=
dli
dθi
dθi
dµi
dµi
dηi
dηi
dwi
= (yi − A′
(θi))
dθi
dµi
dµi
dηi
xij
= (yi − µi)
dθi
dµi
dµi
dηi
xij
If we use a canonical link, θi = ηi, this simplifies to
∇wl(w) =
1
σ2


N∑
i=1
(yi − µi)xi

 .
In addition to that, for improved efficiency, we should use a second-order method. If we used a canonical link, the Hessian
is given by
H =
1
σ2
N∑
i=1
dµi
dθi
xixT
i =
1
σ2
XT
SX,
where S = diag(
dµ1
dθ1
, . . . ,
dµN
dθN
) is a diagonal weighting matrix. Specifically, we have the following Newton update:
wt+1 = (XT
StX)−1
XT
Stzt
zt = θt + S−1
t (y − µt),
where θt = Xwt and µt = g−1
(ηt).
Note1: If we extend to handle non-canonical links
CH.9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY 7
• The Hessian has another form
• The expected Hessian (Fisher information matrix) has the same form as (9.92)
– Using the expected Hessian instead of actual one is called as the Fisher scoring method
Note2: To perform MAP estimation with Gaussian distribution, See Section 8.3.6.
9.3.3. Bayesian inference.
Usually do with MCMC when GLM is estimated in bayes manner. See Dey et al. 2000 for detail.
9.4. Probit regression.
In this section, we focus on the probit regression case where g−1
(η) = Φ(η) =
∫ η
−∞
t2
√
2π
exp(
−1
2
)dt
9.4.1. ML/MAP estimation using gradient-based optimization.
Let µi = wT
xi, and let ˜y ∈ {−1, +1}, then the gradient of the log likelihood is given by
gi
d
dw
log p(˜yi|wT
xi) =
dµi
dw
d
dµi
log p(˜yi|wT
xi) = xi
˜yiϕ(µi)
Φ(˜yiµi)
and the Hessian is given by
Hi =
d
dw2
log p(˜y|wT
xi) = −xi
(
ϕ(µi)2
Φ(˜yµi)2
+
˜yiµiϕ(µi)
Φ(˜yiµi)
)
xT
i
Note: If we use prior p(w) ∼ N(0, V0),
• The gradient of log likelihood:
∑
i gi + 2V −1
0 w
• The Hessian of log likelihood:
∑
i Hi + 2V −1
0
9.4.2. Latent variable interpretation.
Assume we can observe only action which is decided based on the latent utilities, like:
u0i wT
0 xi + δ0i
u1i wT
1 xi + δ1i
yi I(u1i − u0i),
where δ’s are error term. This is called a random utility model or RUM (McFadden 1974; Train 2009).
If δ’s have a Gaussian distribution, then so does ϵi and let us define zi = u1i − u0i + ϵi, where ϵi = δ1i − δ0i, we can
rewrite
zi wxi + ϵi
ϵ ∼ N(0, 1)
yi = 1 = I(zi ≥ 0)
This is called a difference RUM or dRUM model.
When we marginalized out zi, we recover the pro bit model
p(yi|x, w) =
∫
I(zi ≥ 0)N(zi|wT
x, 1)dzi
= p(wT
x + ϵ ≥ 0) = p(ϵ ≥ −wT
x)
= 1 − Φ(wT
x) = Φ(wT
x)
Note: Interestingly, if we use Gumbel distribution for the δ’s, we induce a logistic distribution for ϵi, and the model
reduces to logistic regression.
9.4.3. Ordinal probit regression.
One advantage: “’Extendability to ordinal probit regression”
i.e., it is easy to extend to the case where the response variable is ordinal, thai is, it can take on C discrete values.
Example: If C=3.
Partition the real line to 3 intervals: (−∞, 0], (0, γ], (γ, ∞)
We can vary the parameter γ to ensure the right relative amount of probability mass falls in each intervals, so as to match
the empirical frequencies of each class label.
8 D. YONEOKA
9.4.4. Multinomial probit models.
Consider the case where the response variable can take on C unordered categorical values, yi ∈ {1, . . . ,C} The multinomial
pro bit model is defined as follows:
zic = xT
xic + ϵic
ϵ ∼ N(0, R)
yi = argmaxczic
Since only relative utilities matter, we constrain R to be a correlation matrix.
Note; If we use yic = I(zic  0) instead of yi = argmaxczic, we get a model known as multivariate probit.
9.5. Multi-task learning.
If we can assume the input-output mapping is similar across models, we can get better performance by fitting all the
parameter at the same time.
This is also called
• multi-task learning (Caruana 1998)
• transfer learning (Raina et al. 2005)
• learning to learn (Thrun and Pratt 1997)
In statistics, this is tacked using hierarchical bayesian models (Bakker and Heskes 2003) although there are other
possible methods (Chai 2010).
9.5.1. Hierarchical Bayes for multi-task learning.
Let yij be the response of the i’th item in group j, for i = 1 : Nj and j = 1 : J.
The goal of multi-task learning 
The goal is to fit the models p(yj|xj) for all j.
 
Suppose E[yi j|xij] = g(xT
ijβj), βj ∼ N(β∗, σ2
jI) and β∗ ∼ N(µ, σ∗I). In addition to that, for simplicity, that µ = 0 and
that σ2
j and σ2
∗ are all known.
The overall log probability has the form
logp(D|β) + logp(β) =
∑
j

logp(Dj|βj) −
||βj − β∗||2
2σ2
j

 −
||β2
∗||2
2σ2
∗
We can perform MAP estimation of β = (β1:J, β∗) using standard gradient methods. Alternatively, we can also perform
an iterative optimization methods.
Note:
• Likelihood and prior are convex, thus guaranteed to converge to the global optimum
• Once the models are trained, we can discard β∗ and use each model separately.
9.5.2. Application to personalized email spam filtering.
The goal is to find each βj to filter individual spam email. We can make two copies of each feature xi, one concatenated
with the user id and one not.
E[yij|xi, u] = (β∗, w1, . . . , wJ)T
[xi, I(u = 1)xi, . . . , I(u = J)xi],
where u is the user id. In other words,
E[yij|xi, u = j] = (βT
∗ + wj)T
xi.
Thus β∗ will be estimated from everyone’s email, whereas wj will just be estimated from user j’s email.
To write down with the above hierarchical bayesian mode, define wj = βj −β∗. Then the log probability of the original
model can be rewritten as
∑
j

logp(Dj|β∗ + wj) −
||wj||2
2σ2
j

 −
||β2
∗||2
2σ2
∗
Note; If we assume σ2
j = σ2
∗, the effect is the same as using the augmented feature trick, with the same regularizer
strength for both wj and β∗. However, one typically gets better performance by not requiring that σ2
j be equal to σ2
∗
(Finkel and Manning 2009).
CH.9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY 9
9.5.3. Application to domain addaption.
Domain adaptation is the problem of training a set of classifiers on data drawn from different distributions.
(Finkel and Manning 2009) used the above hierarchical Bayesian model for two NLP tasks. They reports
• Reasonably large improvements over fitting separate models to each dataset
• Small improvements over the approach of pooling all the data and fitting a single model
9.5.4. Other kinds of prior.
Let’s consider the possibility to use other prior that Gaussian.
For example, consider the task of conjoint analysis. We can use multi-task feature selection (Lenk et al. 1996; Agryriou
et al. 2008): we use a sparsity-promoting prior on βj, rather than a Gaussian prior.
Negative transfer: If we pool the parameters across tasks that are qualitatively different, the performance will be worse
than not using pooling.
→ To overcome this problem, use more flexible prior such as a mixture of Gaussians, for which we can get more robust
result against prior misspecification. (Xue et al. 2007; Jacob et al. 2008).
9.6. Generalized linear mixed models.
Similarly as above, we can allow the parameters to cary at the groups βj, or to be tied across α, which gives the form:
E[yij|xij, xi] = g
(
ϕ1(xij)T
βj + ϕ2(xj)T
β
′
j + ϕ3(xij)T
α + ϕ4(xj)T
α
)
GLMM: Frequentists call the terms βj random effects, since they vary randomly across groups, but they call α a fixed
effect.
9.6.1. Example: semi-parametric GLMM for medical data.
Suppose yij is the amount of spinal bone mineral density (SBMD) for person j at measurement i. Let xij be the age of
person, and let xij be their ethnicity.
Here semi-parametric models which combine linear regression with non-parametric regression (Ruppert et al. 2003)
because we also see that there is variation across individuals within each group.
Specifically, we will use
• ϕ1(xij) = 1 to account for the random effect of each person
• ϕ0(xij) = 0 since no other coefficients are person-specific
• ϕ3(xij) = [bk(xij)], where bk is the k’th spline basis functions (see Section 15.4.6.2), to account for the nonlinear
effect of age
• ϕ4(xij) = [I(xj = w), I(xj = a), I(xj = b), I(xj = h)] to account for the effect of the different ethnicities
• Use a linear link function
The overall model is
E[yij|xij, xj] = βj + αT
b(xij) + ϵij + α
′
wI(xj = w) + α
′
aI(xj = a) + α
′
bI(xj = b) + α
′
hI(xj = h),
where ϵij ∼ N(0, σ2
y).
This means
• α contains the non-parametric part of the model related to age
• α
′
contains the parametric part of the model related to ethnicity
• βj is a random offset for person j
We can perform posterior inference to compute p(α, α
′
, β, σ2
|D), whose prior of regression coefficient is Gaussian.
(Sec. 9.6.2) And we can also perform significance testing, by computing p(αp − αw|D) for each ethnic group g relative to
some baseline.
9.6.2. Computational Issues.
Difficulties in GLMM
• p(uij|θ) may not be conjugate to the prior p(θ) where θ = (α, β)
• There are two levels of unknown in the models, namely the regression coefficients θand the parameters related
with distribution of prior η = (µ, σ)
To adapt fully bayesian inference methods, variational bayes (Hall et al. 2011, Sec 21.5) and MCMC (German and
Hill 2007, Sec 24.1).
Or to use empirical bays. In the context of a GLMM, EM algorithm is useful, where
• E step: compute p(θ|η, D)
10 D. YONEOKA
• M step: optimize η
Note:
• Need a approximation in E step
– Numerical quadrature
– Monte Carlo (Breslow and Clayton 1993)
• Faster approach is to use variational EM (Braun and McAuliffe 2010)
• Among frequents, GEE (Generalized estimating equation) is popular.
– Not recommended. Because not statistically efficient as likelihood based methods (See 6.4.3)
– Provide only estimates of the population parameters α but not the random effects βj

Weitere ähnliche Inhalte

Was ist angesagt?

Lesson 25: Evaluating Definite Integrals (slides)
Lesson 25: Evaluating Definite Integrals (slides)Lesson 25: Evaluating Definite Integrals (slides)
Lesson 25: Evaluating Definite Integrals (slides)Matthew Leingang
 
The dual geometry of Shannon information
The dual geometry of Shannon informationThe dual geometry of Shannon information
The dual geometry of Shannon informationFrank Nielsen
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distancesChristian Robert
 
Lesson 27: Evaluating Definite Integrals
Lesson 27: Evaluating Definite IntegralsLesson 27: Evaluating Definite Integrals
Lesson 27: Evaluating Definite IntegralsMatthew Leingang
 
Formulas
FormulasFormulas
FormulasZHERILL
 
Lesson 26: The Fundamental Theorem of Calculus (slides)
Lesson 26: The Fundamental Theorem of Calculus (slides)Lesson 26: The Fundamental Theorem of Calculus (slides)
Lesson 26: The Fundamental Theorem of Calculus (slides)Matthew Leingang
 
Classification with mixtures of curved Mahalanobis metrics
Classification with mixtures of curved Mahalanobis metricsClassification with mixtures of curved Mahalanobis metrics
Classification with mixtures of curved Mahalanobis metricsFrank Nielsen
 
Harmonic Analysis and Deep Learning
Harmonic Analysis and Deep LearningHarmonic Analysis and Deep Learning
Harmonic Analysis and Deep LearningSungbin Lim
 
A new Perron-Frobenius theorem for nonnegative tensors
A new Perron-Frobenius theorem for nonnegative tensorsA new Perron-Frobenius theorem for nonnegative tensors
A new Perron-Frobenius theorem for nonnegative tensorsFrancesco Tudisco
 
Darmon Points: an Overview
Darmon Points: an OverviewDarmon Points: an Overview
Darmon Points: an Overviewmmasdeu
 
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Frank Nielsen
 
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...Francesco Tudisco
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
 
Lesson 21: Curve Sketching (slides)
Lesson 21: Curve Sketching (slides)Lesson 21: Curve Sketching (slides)
Lesson 21: Curve Sketching (slides)Matthew Leingang
 
Treewidth and Applications
Treewidth and ApplicationsTreewidth and Applications
Treewidth and ApplicationsASPAK2014
 
Roots equations
Roots equationsRoots equations
Roots equationsoscar
 

Was ist angesagt? (19)

Lesson 25: Evaluating Definite Integrals (slides)
Lesson 25: Evaluating Definite Integrals (slides)Lesson 25: Evaluating Definite Integrals (slides)
Lesson 25: Evaluating Definite Integrals (slides)
 
The dual geometry of Shannon information
The dual geometry of Shannon informationThe dual geometry of Shannon information
The dual geometry of Shannon information
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distances
 
Lesson 27: Evaluating Definite Integrals
Lesson 27: Evaluating Definite IntegralsLesson 27: Evaluating Definite Integrals
Lesson 27: Evaluating Definite Integrals
 
Matrix calculus
Matrix calculusMatrix calculus
Matrix calculus
 
Formulas
FormulasFormulas
Formulas
 
Lesson 26: The Fundamental Theorem of Calculus (slides)
Lesson 26: The Fundamental Theorem of Calculus (slides)Lesson 26: The Fundamental Theorem of Calculus (slides)
Lesson 26: The Fundamental Theorem of Calculus (slides)
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Classification with mixtures of curved Mahalanobis metrics
Classification with mixtures of curved Mahalanobis metricsClassification with mixtures of curved Mahalanobis metrics
Classification with mixtures of curved Mahalanobis metrics
 
Harmonic Analysis and Deep Learning
Harmonic Analysis and Deep LearningHarmonic Analysis and Deep Learning
Harmonic Analysis and Deep Learning
 
A new Perron-Frobenius theorem for nonnegative tensors
A new Perron-Frobenius theorem for nonnegative tensorsA new Perron-Frobenius theorem for nonnegative tensors
A new Perron-Frobenius theorem for nonnegative tensors
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Darmon Points: an Overview
Darmon Points: an OverviewDarmon Points: an Overview
Darmon Points: an Overview
 
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
 
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
Nodal Domain Theorem for the p-Laplacian on Graphs and the Related Multiway C...
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
Lesson 21: Curve Sketching (slides)
Lesson 21: Curve Sketching (slides)Lesson 21: Curve Sketching (slides)
Lesson 21: Curve Sketching (slides)
 
Treewidth and Applications
Treewidth and ApplicationsTreewidth and Applications
Treewidth and Applications
 
Roots equations
Roots equationsRoots equations
Roots equations
 

Ähnlich wie Murphy: Machine learning A probabilistic perspective: Ch.9

Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationAlexander Litvinenko
 
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMSSOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMSTahia ZERIZER
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Tomasz Kusmierczyk
 
Error control coding bch, reed-solomon etc..
Error control coding   bch, reed-solomon etc..Error control coding   bch, reed-solomon etc..
Error control coding bch, reed-solomon etc..Madhumita Tamhane
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Alexander Litvinenko
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionCharles Deledalle
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applicationsFrank Nielsen
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingFrank Nielsen
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Sangwoo Mo
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsAlexander Litvinenko
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodFrank Nielsen
 
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...HidenoriOgata
 
lesson10-thechainrule034slides-091006133832-phpapp01.pptx
lesson10-thechainrule034slides-091006133832-phpapp01.pptxlesson10-thechainrule034slides-091006133832-phpapp01.pptx
lesson10-thechainrule034slides-091006133832-phpapp01.pptxJohnReyManzano2
 
Mathematics for machine learning calculus formulasheet
Mathematics for machine learning calculus formulasheetMathematics for machine learning calculus formulasheet
Mathematics for machine learning calculus formulasheetNishant Upadhyay
 
Group theory notes
Group theory notesGroup theory notes
Group theory notesmkumaresan
 

Ähnlich wie Murphy: Machine learning A probabilistic perspective: Ch.9 (20)

Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantification
 
Equivariance
EquivarianceEquivariance
Equivariance
 
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMSSOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
 
The integral
The integralThe integral
The integral
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
 
Error control coding bch, reed-solomon etc..
Error control coding   bch, reed-solomon etc..Error control coding   bch, reed-solomon etc..
Error control coding bch, reed-solomon etc..
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...
 
stochastic processes assignment help
stochastic processes assignment helpstochastic processes assignment help
stochastic processes assignment help
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applications
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEs
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...
 
lesson10-thechainrule034slides-091006133832-phpapp01.pptx
lesson10-thechainrule034slides-091006133832-phpapp01.pptxlesson10-thechainrule034slides-091006133832-phpapp01.pptx
lesson10-thechainrule034slides-091006133832-phpapp01.pptx
 
Mathematics for machine learning calculus formulasheet
Mathematics for machine learning calculus formulasheetMathematics for machine learning calculus formulasheet
Mathematics for machine learning calculus formulasheet
 
A Note on TopicRNN
A Note on TopicRNNA Note on TopicRNN
A Note on TopicRNN
 
Group theory notes
Group theory notesGroup theory notes
Group theory notes
 

Mehr von Daisuke Yoneoka

Sequential Kernel Association Test (SKAT) for rare and common variants
Sequential Kernel Association Test (SKAT) for rare and common variantsSequential Kernel Association Test (SKAT) for rare and common variants
Sequential Kernel Association Test (SKAT) for rare and common variantsDaisuke Yoneoka
 
Higher criticism, SKAT and SKAT-o for whole genome studies
Higher criticism, SKAT and SKAT-o for whole genome studiesHigher criticism, SKAT and SKAT-o for whole genome studies
Higher criticism, SKAT and SKAT-o for whole genome studiesDaisuke Yoneoka
 
Deep directed generative models with energy-based probability estimation
Deep directed generative models with energy-based probability estimationDeep directed generative models with energy-based probability estimation
Deep directed generative models with energy-based probability estimationDaisuke Yoneoka
 
Supervised PCAとその周辺
Supervised PCAとその周辺Supervised PCAとその周辺
Supervised PCAとその周辺Daisuke Yoneoka
 
ML: Sparse regression CH.13
 ML: Sparse regression CH.13 ML: Sparse regression CH.13
ML: Sparse regression CH.13Daisuke Yoneoka
 
セミパラメトリック推論の基礎
セミパラメトリック推論の基礎セミパラメトリック推論の基礎
セミパラメトリック推論の基礎Daisuke Yoneoka
 
第七回統計学勉強会@東大駒場
第七回統計学勉強会@東大駒場第七回統計学勉強会@東大駒場
第七回統計学勉強会@東大駒場Daisuke Yoneoka
 
第二回統計学勉強会@東大駒場
第二回統計学勉強会@東大駒場第二回統計学勉強会@東大駒場
第二回統計学勉強会@東大駒場Daisuke Yoneoka
 
第四回統計学勉強会@東大駒場
第四回統計学勉強会@東大駒場第四回統計学勉強会@東大駒場
第四回統計学勉強会@東大駒場Daisuke Yoneoka
 
第五回統計学勉強会@東大駒場
第五回統計学勉強会@東大駒場第五回統計学勉強会@東大駒場
第五回統計学勉強会@東大駒場Daisuke Yoneoka
 
第三回統計学勉強会@東大駒場
第三回統計学勉強会@東大駒場第三回統計学勉強会@東大駒場
第三回統計学勉強会@東大駒場Daisuke Yoneoka
 
第一回統計学勉強会@東大駒場
第一回統計学勉強会@東大駒場第一回統計学勉強会@東大駒場
第一回統計学勉強会@東大駒場Daisuke Yoneoka
 
ブートストラップ法とその周辺とR
ブートストラップ法とその周辺とRブートストラップ法とその周辺とR
ブートストラップ法とその周辺とRDaisuke Yoneoka
 
Rで学ぶデータサイエンス第13章(ミニマックス確率マシン)
Rで学ぶデータサイエンス第13章(ミニマックス確率マシン)Rで学ぶデータサイエンス第13章(ミニマックス確率マシン)
Rで学ぶデータサイエンス第13章(ミニマックス確率マシン)Daisuke Yoneoka
 
Rで学ぶデータサイエンス第1章(判別能力の評価)
Rで学ぶデータサイエンス第1章(判別能力の評価)Rで学ぶデータサイエンス第1章(判別能力の評価)
Rで学ぶデータサイエンス第1章(判別能力の評価)Daisuke Yoneoka
 

Mehr von Daisuke Yoneoka (19)

MCMC法
MCMC法MCMC法
MCMC法
 
PCA on graph/network
PCA on graph/networkPCA on graph/network
PCA on graph/network
 
Sequential Kernel Association Test (SKAT) for rare and common variants
Sequential Kernel Association Test (SKAT) for rare and common variantsSequential Kernel Association Test (SKAT) for rare and common variants
Sequential Kernel Association Test (SKAT) for rare and common variants
 
Higher criticism, SKAT and SKAT-o for whole genome studies
Higher criticism, SKAT and SKAT-o for whole genome studiesHigher criticism, SKAT and SKAT-o for whole genome studies
Higher criticism, SKAT and SKAT-o for whole genome studies
 
Deep directed generative models with energy-based probability estimation
Deep directed generative models with energy-based probability estimationDeep directed generative models with energy-based probability estimation
Deep directed generative models with energy-based probability estimation
 
独立成分分析 ICA
独立成分分析 ICA独立成分分析 ICA
独立成分分析 ICA
 
Supervised PCAとその周辺
Supervised PCAとその周辺Supervised PCAとその周辺
Supervised PCAとその周辺
 
Sparse models
Sparse modelsSparse models
Sparse models
 
ML: Sparse regression CH.13
 ML: Sparse regression CH.13 ML: Sparse regression CH.13
ML: Sparse regression CH.13
 
セミパラメトリック推論の基礎
セミパラメトリック推論の基礎セミパラメトリック推論の基礎
セミパラメトリック推論の基礎
 
第七回統計学勉強会@東大駒場
第七回統計学勉強会@東大駒場第七回統計学勉強会@東大駒場
第七回統計学勉強会@東大駒場
 
第二回統計学勉強会@東大駒場
第二回統計学勉強会@東大駒場第二回統計学勉強会@東大駒場
第二回統計学勉強会@東大駒場
 
第四回統計学勉強会@東大駒場
第四回統計学勉強会@東大駒場第四回統計学勉強会@東大駒場
第四回統計学勉強会@東大駒場
 
第五回統計学勉強会@東大駒場
第五回統計学勉強会@東大駒場第五回統計学勉強会@東大駒場
第五回統計学勉強会@東大駒場
 
第三回統計学勉強会@東大駒場
第三回統計学勉強会@東大駒場第三回統計学勉強会@東大駒場
第三回統計学勉強会@東大駒場
 
第一回統計学勉強会@東大駒場
第一回統計学勉強会@東大駒場第一回統計学勉強会@東大駒場
第一回統計学勉強会@東大駒場
 
ブートストラップ法とその周辺とR
ブートストラップ法とその周辺とRブートストラップ法とその周辺とR
ブートストラップ法とその周辺とR
 
Rで学ぶデータサイエンス第13章(ミニマックス確率マシン)
Rで学ぶデータサイエンス第13章(ミニマックス確率マシン)Rで学ぶデータサイエンス第13章(ミニマックス確率マシン)
Rで学ぶデータサイエンス第13章(ミニマックス確率マシン)
 
Rで学ぶデータサイエンス第1章(判別能力の評価)
Rで学ぶデータサイエンス第1章(判別能力の評価)Rで学ぶデータサイエンス第1章(判別能力の評価)
Rで学ぶデータサイエンス第1章(判別能力の評価)
 

Kürzlich hochgeladen

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 

Kürzlich hochgeladen (20)

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 

Murphy: Machine learning A probabilistic perspective: Ch.9

  • 1. CH.9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY D. YONEOKA 9.1. Introduction. Outline. • Property of exponential family • Derive theorem and algorithm for appliucations • How to generate classiffier (GLM) 9.2. The exponential family. Why this is important? • Finite-size sufficient statistics (→ Able to compress information w/o loss) • Existence of conjugate priors (See Sec.9.2.5) • Least set of assumptions subject to some constraints (See Sec.9.2.6) • Core of GLM (See Sec.9.3) • Core of variation inference (See Sec.21.2) 9.2.1. Definition. Let pdf or pmt p(x|θ), for x = (x1, . . . , xm) ∈ χm and θ ∈ Θ ⊆ Rd , and it is said to be exponential family if as follow; p(x|θ) = 1 Z(θ) h(x) exp[θϕ(x)] = h(x) exp[θϕ(x) − A(θ)] = h(x) exp[η(θT )ϕ(x) − A(η(θ))] where Z(θ) = ∫ χm h(x) exp[θϕ(x)] A(θ) = logZ(θ) Let we call; • θ: natural parameter or canonical parameter • ϕ(x): sufficient statistic • Z(θ): partition function • A(θ): log partition function or cumulant function • h(x): scaling constant, offen = 1 • η(θ): mapping of θ to the canonical parameters Note; • If dim(θ) < dim(η(θ)), it is called a curved exponential family, that means we have more sufficient statistics than parameters. • If dim(θ) = dim(η(θ)), it is called canonical form. Natural parameter space is H = {θ : A(θ) < ∞} Written in its canonical form, a density in exponential family has some convexity properties. These convexity properties are useful in manipulating with moments and other functionals of sufficient statistics. 1
  • 2. 2 D. YONEOKA Theorem: H is a convex set Proof. We should prove that A is a convex function. Let 0 α ∞ and take θ and θ1 in H. Write A(αθ + (1 − α)θ1) = log ∫ χm exp {(αθ + (1 − α)θ1)ϕ(x)} h(x)dx = log ∫ χm exp(αθϕ(x))h(x)dx ∫ χm exp((1 − α)θ1ϕ(x))h(x)dx ≤ log [∫ χm ( exp(αθϕ(x) )1 α h(x)dx ]α [∫ χm ( exp((1 − α)θ1ϕ(x) ) 1 1−α h(x)dx ]1−α = αA(θ) + (1 − α)A(θ1) (from Holder’s inequality) ≤ ∞ Thus αA(θ) + (1 − α)A(θ1) ∈ H and the theorem holds. 9.2.2. Example. 9.2.2.1. Bernoulli, i.e., x ∈ {0, 1}. Ber(x|µ) = µx (1 − µ)x = exp[x log(µ) + (1 − x) log(1 − µ)] = exp[ϕ(x)T θ], where ϕ(x) = [I(x = 0), I(x = 1)] and θ = [log(µ), log(1 − µ)]. → but over-complete because I(x = 0) + I(x = 1)] = 1 → θ can not be identifiable. → To aquaire the identifiability, add assumption that θ is minimal Then, Ber(x|µ) = (1 − µ) exp(x log( µ 1 − µ )) Note: We can recover the mean parameter µ from canonical parameter θ = log µ 1 − µ using µ = sigm(θ) = 1 1 + e−θ 9.2.2.2. Multinoulli, i.e., xk = I(x = k). Cat(x|µ) = K∏ k=1 µxk k = exp   K∑ k=1 xk log µk   ⇐⇒ Cat(x|θ) = exp[θT ϕ(x) − A(θ)], where θ = [log µ1 µK , . . . , log µ1 µK ] ϕ(x) = [I(x = 1), . . . , I(x = K − 1)]. As with the above case, we can find as follow; µk = eθk 1 + ∑K−1 j=1 eθj µK = 1 ∑K−1 j=1 eθj A(θ) = log  1 + K−1∑ j=1 eθj   9.2.2.3. Univariate Gaussian. N(x|µ, σ2 ) = 1 (2πσ2)1/2 exp[− 1 2σ2 (x − µ)2 ] = 1 Z(θ) exp(θT ϕ(x)),
  • 3. CH.9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY 3 where θ = ( µ/σ2 −1 2σ2 ) ϕ(x) = ( x x2 ) Z(µ, σ2 ) = √ 2πσ exp[ µ2 2σ2 ] A(θ) = −θ2 1 4θ2 − 1 2 log(−2θ2) − 1 2 log(2π) 9.2.2.4. Non-example. For example, uniform distribution and t-distribution are not exponential family because these can not be expressed in the required form. 9.2.3. Log partition function. A(θ) is called cumulant function, which means derivatives of A can be used to generate the cumulant of the sufficient statistics, i.e., dA dθ = d dθ ( log ∫ exp(θϕ(x))h(x)dx ) = ∫ ϕ(x) exp(θϕ(x))h(x)dx exp(A(θ)) = ∫ ϕ(x) exp(θϕ(x) − A(θ))h(x)dx = ∫ ϕ(x)p(x)dx = E[ϕ(x)] = Expectation of the sufficient statistics d2 A dθ2 = ∫ ϕ(x) exp(θϕ(x) − A(θ))h(x)(ϕ(x) − A ′ (θ))dx = ∫ ϕ(x)p(x)(ϕ(x) − A ′ (θ))dx = ∫ ϕ2 (x)p(x)dx − A ′ (θ) ∫ ϕ(x)p(x)dx = E[ϕ2 (x)] − E[ϕ(x)]2 (∵ A ′ (θ) = dA dθ = E[ϕ(x)]) = Var[ϕ(x)] = Variance of the sufficient statistics 9.2.4. MLE for the exponential family. The likelihood of the exponential family can generally be expressed as follow; p(D|θ) =   N∏ i=1 h(xi)   g(θ)N exp  η(θ)T [ N∑ i=1 ϕ(xi)]   , on which the sufficient statistics indicates ϕ(D) = [ N∑ i=1 ϕ1(xi), . . . , N∑ i=1 ϕK(xi)] Pittman-Koopman-Darmois theorem Under certain condition, the exponential family is the only family of distribution with finite sufficient statistics. Note; One of the condition required in this theorem • The support of the distribution is not depend on the distribution parameters. – e.g., Uniform distribution p(x|θ) = 1 θ I(0 ≤ x ≤ θ) do not fit this condition. – The sufficient statistics; N and max xi – Have finite sufficient statistics but the support of the distribution is depend on the parameter – → Not exponential family Derivation of MLE. • Now we consider only a canonical exponential family, η(θ) = θ, the log likelihood is follow; log p(D|θ) = θT ϕ(D) − NA(θ)
  • 4. 4 D. YONEOKA • Concavity of log likelihood – Second derivative of −A(θ) is non-positive – θT ϕ(D) is linear in θ – → Log likelihood is concave – → Has a unique global maximum Set the gradient of the log likelihood = 0 MLE must satisfy ∇θ log p(D|θ) = ϕ(D) − NE[ϕ(X)] = 0 ⇐⇒ E[ϕ(X] = 1 N N∑ i=1 ϕ(xi), which is called moment matching. 9.2.5. Bayes for the exponential family. Existence of Conjugate prior Conjugate prior exists ⇐⇒ prior p(θ|τ) has the same form as the likelihood p(D|θ) ⇐⇒ prior is exponential family Note: To make sense, we require p(D|θ) = p(s(D)|θ), which means likelihood have finite sufficient statistics. 9.2.5.1. Likelihood. The likelihood of exponential family is given by p(D|θ) ∝ g(θ)N exp(η(θ)T sN), where sN = ∑N i=1. or p(D|θ) ∝ exp(NηT ¯s) − NA(η), where ¯s = 1 N sN 9.2.5.2. Prior. The natural conjugate prior has the form p(D|ν0, τ0) ∝ g(θ)ν0 exp(η(θ)T τ0) or by setting τ0 = ν0 ¯τ0 we get p(D|ν0, τ0) ∝ exp(ν0ηT ˆτ0 − ν0A(η)) 9.2.5.3. Posterior. The posterior is given by p(D|θ) = p(D|νN, τN) = p(D|ν0 + N, τ0 + sN). In the canonical form, this becomes p(D|θ) ∝ exp(η(ν0 ¯τ0 + N ¯s) − (ν0 + N)A(η)) = p(η|ν0 + N, ν0 ¯τ0 + N ¯s ν0 + N ) Note: the posterior hyper-parameters are • convex combination of the prior mean hyper-parameters • average of the sufficient statistics 9.2.5.4. Posterior predictive density. Define • the future data as D ′ = ( ˜x1, . . . , ˜xN′ ) • past data as D = (x1, . . . , xN′ ) • ˜τ0 = (ν0, τ0) • ˜s(D) = (N, s(D)) • ˜s(D ′ ) = (N ′ , s(D ′ ))
  • 5. CH.9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY 5 So the prior becomes p(θ| ˜τ0) = 1 Z( ˜τ0) g(θ)ν0 exp(η(θ)T τ0) Hence p(D ′ |D) = ∫ p(D ′ |θ)p(θ|D)dθ =   N ′ ∏ i=1 h( ˜xi)   Z( ˜τ0 + ˜s(D))−1 ∫ g(θ)ν0+N+N ′ exp   ∑ k ηk(θ)(τk + N∑ i=1 sk(xi) + N ′ ∑ i=1 sk( ˜xi)))   dθ =   N ′ ∏ i=1 h( ˜xi)   Z( ˜τ0 + ˜s(D) + ˜s(D ′ )) Z( ˜τ0 + ˜s(D)) 9.2.5.5. Eample: Bernoulli distribution. The likelihood is p(D|θ) = (1 − θ)N exp  log( θ 1 − θ ) ∑ i xi   Hence conjugate prior is p(θ|ν0, τ0) ∝ (1 − θ)ν 0 exp ( log( θ 1 − θ )τ0 ) = θτ0 (1 − θ)ν0−τ0 So the posterior is p(θ|D) ∝ θτ0+s (1 − θ)ν0−τ0+n−s = θτn (1 − θ)νn−τn , where s = ∑ i I(xi = 1) is sufficient statistics (when bernoulli, it means num of heads). How to derive posterior predictive distribution. Assume p(θ) = Beta(θ|α, β) and let s = s(D)and future data as D ′ = ( ˜x1, . . . , ˜xm) and s ′ = ∑m i=1 I(˜xi = 1) p(D ′ |D) = ∫ 1 0 p(D ′ |θ|Beta(α, β))dθ = Γ(αn + βn) Γ(αn)Γ(βn) ∫ 1 0 θα0+s′ (1 − θ)βn+m−s′ −1 dθ = Γ(αn + βn) Γ(αn)Γ(βn) Γ(αn+m)Γ(βn+m) Γ(αn+m + βn+m) 9.2.6. Maximum entropy derivation of the exponential family. Explain one of justification for use of exponential family. The principal of maximum entropy or maxent We should pick up the distribution with maximum entropy ”− ∑n i=1 pi log pi”, subject to the constraints that the moments of the distribution match the empirical moments of the specified functions. Suppose all we know is ∑ x fk(x)p(x) = Fk. Set the constraint as p(x) 0, ∑ x p(x) = 1and Lagrangian to minimize the entropy as J(p, λ) − ∑ x p(x) log p(x) + λ0(1 − ∑ x ) + ∑ x λk(Fk − ∑ x p(x) fk(x)) Setting ∂J ∂p(x) = 0 yields p(x) = 1 exp(1 + λ0) exp(− ∑ k λk fk(x)) Using ∑ x p(x) = 1, we have 1 = ∑ x p(x) = 1 exp(1 + λ0) ∑ x exp(− ∑ k λk fk(x))
  • 6. 6 D. YONEOKA Hence the normalization constant Z = exp(1 + λ0) is given by Z = ∑ x exp(− ∑ k λk fk(x)), which means the maxent distribution p(x) has the form of the exponential family, also known as the Gibbs distribution. 9.3. Generalized linear model (GLM). • Linear and logistic regression are one of example of GLM (McCullagh and Nelder 1989) • Models in which the output density is int the exponential family 9.3.1. Basics. Unconditional distribution for scalar response variable: p(y|θ, σ2 ) = exp [ yiθ − A(θ) σ2 + c(yi, σ2 ) ] , where • σ2 is the dispersion parameter • θ is the natural parameter • A is the partition function • c is the normalization constant To convert from the mean parameter to the natural parameter, we can use a link function θ = Ψ(µ), which function is uniquely determined by the form of the exponential family distribution. In addition to that, • Link function: µi = g−1 (wT xi) • Mean function: g(µi) = wT xi • when g = Ψ, it is called the canonical link function 9.3.2. ML and MAP estimation. The log likelohood has the following form: l(w) = log p(D|w) = 1 σ2 N∑ i=1 li, where li θiyi − A(θi). We can compute the gradient vector using the chain rule as follow: dli dwi = dli dθi dθi dµi dµi dηi dηi dwi = (yi − A′ (θi)) dθi dµi dµi dηi xij = (yi − µi) dθi dµi dµi dηi xij If we use a canonical link, θi = ηi, this simplifies to ∇wl(w) = 1 σ2   N∑ i=1 (yi − µi)xi   . In addition to that, for improved efficiency, we should use a second-order method. If we used a canonical link, the Hessian is given by H = 1 σ2 N∑ i=1 dµi dθi xixT i = 1 σ2 XT SX, where S = diag( dµ1 dθ1 , . . . , dµN dθN ) is a diagonal weighting matrix. Specifically, we have the following Newton update: wt+1 = (XT StX)−1 XT Stzt zt = θt + S−1 t (y − µt), where θt = Xwt and µt = g−1 (ηt). Note1: If we extend to handle non-canonical links
  • 7. CH.9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY 7 • The Hessian has another form • The expected Hessian (Fisher information matrix) has the same form as (9.92) – Using the expected Hessian instead of actual one is called as the Fisher scoring method Note2: To perform MAP estimation with Gaussian distribution, See Section 8.3.6. 9.3.3. Bayesian inference. Usually do with MCMC when GLM is estimated in bayes manner. See Dey et al. 2000 for detail. 9.4. Probit regression. In this section, we focus on the probit regression case where g−1 (η) = Φ(η) = ∫ η −∞ t2 √ 2π exp( −1 2 )dt 9.4.1. ML/MAP estimation using gradient-based optimization. Let µi = wT xi, and let ˜y ∈ {−1, +1}, then the gradient of the log likelihood is given by gi d dw log p(˜yi|wT xi) = dµi dw d dµi log p(˜yi|wT xi) = xi ˜yiϕ(µi) Φ(˜yiµi) and the Hessian is given by Hi = d dw2 log p(˜y|wT xi) = −xi ( ϕ(µi)2 Φ(˜yµi)2 + ˜yiµiϕ(µi) Φ(˜yiµi) ) xT i Note: If we use prior p(w) ∼ N(0, V0), • The gradient of log likelihood: ∑ i gi + 2V −1 0 w • The Hessian of log likelihood: ∑ i Hi + 2V −1 0 9.4.2. Latent variable interpretation. Assume we can observe only action which is decided based on the latent utilities, like: u0i wT 0 xi + δ0i u1i wT 1 xi + δ1i yi I(u1i − u0i), where δ’s are error term. This is called a random utility model or RUM (McFadden 1974; Train 2009). If δ’s have a Gaussian distribution, then so does ϵi and let us define zi = u1i − u0i + ϵi, where ϵi = δ1i − δ0i, we can rewrite zi wxi + ϵi ϵ ∼ N(0, 1) yi = 1 = I(zi ≥ 0) This is called a difference RUM or dRUM model. When we marginalized out zi, we recover the pro bit model p(yi|x, w) = ∫ I(zi ≥ 0)N(zi|wT x, 1)dzi = p(wT x + ϵ ≥ 0) = p(ϵ ≥ −wT x) = 1 − Φ(wT x) = Φ(wT x) Note: Interestingly, if we use Gumbel distribution for the δ’s, we induce a logistic distribution for ϵi, and the model reduces to logistic regression. 9.4.3. Ordinal probit regression. One advantage: “’Extendability to ordinal probit regression” i.e., it is easy to extend to the case where the response variable is ordinal, thai is, it can take on C discrete values. Example: If C=3. Partition the real line to 3 intervals: (−∞, 0], (0, γ], (γ, ∞) We can vary the parameter γ to ensure the right relative amount of probability mass falls in each intervals, so as to match the empirical frequencies of each class label.
  • 8. 8 D. YONEOKA 9.4.4. Multinomial probit models. Consider the case where the response variable can take on C unordered categorical values, yi ∈ {1, . . . ,C} The multinomial pro bit model is defined as follows: zic = xT xic + ϵic ϵ ∼ N(0, R) yi = argmaxczic Since only relative utilities matter, we constrain R to be a correlation matrix. Note; If we use yic = I(zic 0) instead of yi = argmaxczic, we get a model known as multivariate probit. 9.5. Multi-task learning. If we can assume the input-output mapping is similar across models, we can get better performance by fitting all the parameter at the same time. This is also called • multi-task learning (Caruana 1998) • transfer learning (Raina et al. 2005) • learning to learn (Thrun and Pratt 1997) In statistics, this is tacked using hierarchical bayesian models (Bakker and Heskes 2003) although there are other possible methods (Chai 2010). 9.5.1. Hierarchical Bayes for multi-task learning. Let yij be the response of the i’th item in group j, for i = 1 : Nj and j = 1 : J. The goal of multi-task learning The goal is to fit the models p(yj|xj) for all j. Suppose E[yi j|xij] = g(xT ijβj), βj ∼ N(β∗, σ2 jI) and β∗ ∼ N(µ, σ∗I). In addition to that, for simplicity, that µ = 0 and that σ2 j and σ2 ∗ are all known. The overall log probability has the form logp(D|β) + logp(β) = ∑ j  logp(Dj|βj) − ||βj − β∗||2 2σ2 j   − ||β2 ∗||2 2σ2 ∗ We can perform MAP estimation of β = (β1:J, β∗) using standard gradient methods. Alternatively, we can also perform an iterative optimization methods. Note: • Likelihood and prior are convex, thus guaranteed to converge to the global optimum • Once the models are trained, we can discard β∗ and use each model separately. 9.5.2. Application to personalized email spam filtering. The goal is to find each βj to filter individual spam email. We can make two copies of each feature xi, one concatenated with the user id and one not. E[yij|xi, u] = (β∗, w1, . . . , wJ)T [xi, I(u = 1)xi, . . . , I(u = J)xi], where u is the user id. In other words, E[yij|xi, u = j] = (βT ∗ + wj)T xi. Thus β∗ will be estimated from everyone’s email, whereas wj will just be estimated from user j’s email. To write down with the above hierarchical bayesian mode, define wj = βj −β∗. Then the log probability of the original model can be rewritten as ∑ j  logp(Dj|β∗ + wj) − ||wj||2 2σ2 j   − ||β2 ∗||2 2σ2 ∗ Note; If we assume σ2 j = σ2 ∗, the effect is the same as using the augmented feature trick, with the same regularizer strength for both wj and β∗. However, one typically gets better performance by not requiring that σ2 j be equal to σ2 ∗ (Finkel and Manning 2009).
  • 9. CH.9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY 9 9.5.3. Application to domain addaption. Domain adaptation is the problem of training a set of classifiers on data drawn from different distributions. (Finkel and Manning 2009) used the above hierarchical Bayesian model for two NLP tasks. They reports • Reasonably large improvements over fitting separate models to each dataset • Small improvements over the approach of pooling all the data and fitting a single model 9.5.4. Other kinds of prior. Let’s consider the possibility to use other prior that Gaussian. For example, consider the task of conjoint analysis. We can use multi-task feature selection (Lenk et al. 1996; Agryriou et al. 2008): we use a sparsity-promoting prior on βj, rather than a Gaussian prior. Negative transfer: If we pool the parameters across tasks that are qualitatively different, the performance will be worse than not using pooling. → To overcome this problem, use more flexible prior such as a mixture of Gaussians, for which we can get more robust result against prior misspecification. (Xue et al. 2007; Jacob et al. 2008). 9.6. Generalized linear mixed models. Similarly as above, we can allow the parameters to cary at the groups βj, or to be tied across α, which gives the form: E[yij|xij, xi] = g ( ϕ1(xij)T βj + ϕ2(xj)T β ′ j + ϕ3(xij)T α + ϕ4(xj)T α ) GLMM: Frequentists call the terms βj random effects, since they vary randomly across groups, but they call α a fixed effect. 9.6.1. Example: semi-parametric GLMM for medical data. Suppose yij is the amount of spinal bone mineral density (SBMD) for person j at measurement i. Let xij be the age of person, and let xij be their ethnicity. Here semi-parametric models which combine linear regression with non-parametric regression (Ruppert et al. 2003) because we also see that there is variation across individuals within each group. Specifically, we will use • ϕ1(xij) = 1 to account for the random effect of each person • ϕ0(xij) = 0 since no other coefficients are person-specific • ϕ3(xij) = [bk(xij)], where bk is the k’th spline basis functions (see Section 15.4.6.2), to account for the nonlinear effect of age • ϕ4(xij) = [I(xj = w), I(xj = a), I(xj = b), I(xj = h)] to account for the effect of the different ethnicities • Use a linear link function The overall model is E[yij|xij, xj] = βj + αT b(xij) + ϵij + α ′ wI(xj = w) + α ′ aI(xj = a) + α ′ bI(xj = b) + α ′ hI(xj = h), where ϵij ∼ N(0, σ2 y). This means • α contains the non-parametric part of the model related to age • α ′ contains the parametric part of the model related to ethnicity • βj is a random offset for person j We can perform posterior inference to compute p(α, α ′ , β, σ2 |D), whose prior of regression coefficient is Gaussian. (Sec. 9.6.2) And we can also perform significance testing, by computing p(αp − αw|D) for each ethnic group g relative to some baseline. 9.6.2. Computational Issues. Difficulties in GLMM • p(uij|θ) may not be conjugate to the prior p(θ) where θ = (α, β) • There are two levels of unknown in the models, namely the regression coefficients θand the parameters related with distribution of prior η = (µ, σ) To adapt fully bayesian inference methods, variational bayes (Hall et al. 2011, Sec 21.5) and MCMC (German and Hill 2007, Sec 24.1). Or to use empirical bays. In the context of a GLMM, EM algorithm is useful, where • E step: compute p(θ|η, D)
  • 10. 10 D. YONEOKA • M step: optimize η Note: • Need a approximation in E step – Numerical quadrature – Monte Carlo (Breslow and Clayton 1993) • Faster approach is to use variational EM (Braun and McAuliffe 2010) • Among frequents, GEE (Generalized estimating equation) is popular. – Not recommended. Because not statistically efficient as likelihood based methods (See 6.4.3) – Provide only estimates of the population parameters α but not the random effects βj