modul pembelajaran robotic Workshop _ by Slidesgo.pptx
Ā
Probability distributions for ml
1. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Probability Distributions for ML
Sung-Yub Kim
Dept of IE, Seoul National University
January 29, 2017
Sung-Yub Kim Probability Distributions for ML
2. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Bishop, C. M. Pattern Recognition and Machine Learning Information Science and Statistics, Springer, 2006.
Kevin P. Murphy. Machine Learning - A Probabilistic Perspective Adaptive Computation and Machine
Learning, MIT press, 2012.
Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning Computer Science and Intelligent
Systems, MIT Press, 2016.
Sung-Yub Kim Probability Distributions for ML
3. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Purpose: Density Estimation
Assumption: Data Points are independent and identically distributed.(i.i.d)
Parametric and Nonparametric
Parametric estimations are more intuitive but has very strong assumption.
Nonparametric estimation also has some parameters, but they control
model complexity.
Sung-Yub Kim Probability Distributions for ML
4. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Bernouli and Binomial Distribution
MLE of Bernouli parameter
The Beta Distribution
Bayesian Inference on binary variables
Diļ¬erence between prior and posterior
Bernouli Distribution(Ber(Īø))
Bernouli Distribution has only one parameter Īø which means the success
probability of the trial. PMF of bernouli dist is shown like
Ber(x|Īø) = ĪøI(x=1)
(1 ā Īø)I(x=0)
Binomial Distribution(Bin(n,Īø))
Binomial Distribution has two parameters n for number of trials, Īø for
success prob. PMF of binomial dist is shown like
Bin(k|n, Īø) =
n
k
Īøk
(1 ā Īø)nāk
Sung-Yub Kim Probability Distributions for ML
5. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Bernouli and Binomial Distribution
MLE of Bernouli parameter
The Beta Distribution
Bayesian Inference on binary variables
Diļ¬erence between prior and posterior
Likelihood of Data
By i.i.d assumption, we get
p(D|Āµ) =
N
n=1
p(xn|Āµ) =
N
n=1
Āµxn
(1 ā Āµ)1āxn
(1)
Log-likelihood of Data
Take logarithm, we get
ln p(D|Āµ) =
N
n=1
ln p(xn|Āµ) =
N
n=1
{xn ln Āµ + (1 ā xn) ln(1 ā Āµ)} (2)
MLE
Since maximizer is stationary point, we get
ĀµML := ĖĀµ =
1
N
N
n=1
xn (3)
Sung-Yub Kim Probability Distributions for ML
6. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Bernouli and Binomial Distribution
MLE of Bernouli parameter
The Beta Distribution
Bayesian Inference on binary variables
Diļ¬erence between prior and posterior
Prior Distribution
The weak point of MLE is you can be overļ¬tted to data. To overcome this
deļ¬ciency, we need to make some prior distribution.
But same time our prior distribution need to has a simple interpretation
and useful analytical properties.
Conjugate Prior
Conjugate prior for a likelihood is a prior distribution which your prior and
posterior distribution are same given your likelihood.
In this case, we need to make our prior proportional to powers of Āµ and
(1 ā Āµ). Therefore, we choose Beta Distribution
Beta(Āµ|a, b) =
Ī(a + b)
Ī(a)Ī(b)
Āµaā1
(1 ā Āµ)bā1
(4)
Beta Distribution has two parameters a,b each counts how many occurs
each classes(eļ¬ective number of observations). Also we can easily valid
that posterior is also beta distribution.
Sung-Yub Kim Probability Distributions for ML
7. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Bernouli and Binomial Distribution
MLE of Bernouli parameter
The Beta Distribution
Bayesian Inference on binary variables
Diļ¬erence between prior and posterior
Posterior Distribution
By some calculation,
p(Āµ|m, l, a, b) =
Ī(m + l + a + b)
Ī(m + a)Ī(l + b)
Āµm+aā1
(1 ā Āµ)l+bā1
(5)
where m,l are observed data.
Bayesian Inference
Now we can make some bayesian inference on binary variables. We want
to know
p(x = 1|D) =
1
0
p(x = 1|Āµ)p(Āµ|D)dĀµ =
1
0
Āµp(Āµ|D)dĀµ = E[Āµ|D] (6)
Therefore we get
p(x = 1|D) =
m + a
m + a + l + b
(7)
If observed data(m,l) are suļ¬ciently big, its asymptotic property is
identical to MLE, and this property is very general.
Sung-Yub Kim Probability Distributions for ML
8. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Bernouli and Binomial Distribution
MLE of Bernouli parameter
The Beta Distribution
Bayesian Inference on binary variables
Diļ¬erence between prior and posterior
Since
EĪø[Īø] = ED[EĪø[Īø|D]] (8)
we know that poseterior mean of Īø, averaged over the distribution generating
the data, is equal to the prior mean of Īø.
Also since
VarĪø[Īø] = ED[VarĪø[Īø|D]] + VarD[EĪø[Īø|D]] (9)
We know that on average, the posterior variance of Īø is smaller than the prior
variance.
Sung-Yub Kim Probability Distributions for ML
9. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Multinomials and Multinouli Distributions
MLE of Multinouli parameters
The Dirichlet Distribution and Bayesian Inference
Multinomial Distribution(Mu(x|n, Īø))
Multinomial distribution is diļ¬erent from binomial with respect to
dimension of ouput and Īø. In binomial, k means the number of success. In
multinomial each index of x means the number of state. Therefore we can
see binomial as multinomial when the dimension of x and Īø is 2.
Mu(x|n, Īø) =
n
x0, . . . , xKā1
Kā1
j=0
Īø
xj
j
Multinouli Distribution(Mu(x|1, Īø))
Sometimes we are intersted in the special case of Multinomial when the n
is 1 that is called Multinouli distribution:
Mu(x|1, Īø) =
Kā1
j=0
Īø
I(xj =1)
j
Sung-Yub Kim Probability Distributions for ML
10. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Multinomials and Multinouli Distributions
MLE of Multinouli parameters
The Dirichlet Distribution and Bayesian Inference
Likelihood of Data
By i.i.d assumption, we get
p(D|Āµ) =
N
n=1
K
k=1
Āµ
xnk
k =
K
k=1
Āµ n xnk
k =
K
k=1
Āµ
mk
k (10)
where mk = n xnk (suļ¬cient statistics)
Log-likelihood of Data
Take logarithm, we get
ln p(D|Āµ) =
K
k=1
mk ln Āµk (11)
MLE
Therefore, we need to solve following optimization problem for MLE
max{
K
k=1
mk ln Āµk |
K
k=1
Āµk = 1} (12)
Sung-Yub Kim Probability Distributions for ML
11. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Multinomials and Multinouli Distributions
MLE of Multinouli parameters
The Dirichlet Distribution and Bayesian Inference
MLE(cont.)
We already know that Lagrangian stationaty point is a necessary condition
for constrained optimization problem. Therefore,
ĀµL(Āµ; Ī») = 0, Ī»L(Āµ; Ī») = 0 (13)
where
L(Āµ; Ī») =
K
k=1
mk ln Āµk + Ī»(
K
k=1
Āµk ā 1) (14)
Therefore, we get
ĀµML
k =
mk
N
(15)
Sung-Yub Kim Probability Distributions for ML
12. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Multinomials and Multinouli Distributions
MLE of Multinouli parameters
The Dirichlet Distribution and Bayesian Inference
Dirichlet Distribution
By the same intuition in Beta distribution, we can get conjugate prior for
Multinouli
Dir(Āµ|Ī±) =
Ī(Ī±0)
Ī(Ī±1) Ā· Ā· Ā· Ī(Ī±K )
K
k=1
Āµ
Ī±k ā1
k (16)
where Ī±0 = k Ī±k
Bayesian Inference
By the same argument in binomial, we can get posterior probability
p(Āµ|D, Ī±) = Dir(Āµ|Ī± + m) =
Ī(Ī±0 + N)
Ī(Ī±1 + m1) Ā· Ā· Ā· Ī(Ī±K + mK )
K
k=1
Āµ
Ī±k +mk ā1
k
(17)
Sung-Yub Kim Probability Distributions for ML
13. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Studentās t-distribution
Univariate Gaussian Distribution(N(x|Āµ, Ļ2
) = N(x|Āµ, Ī²ā1
))
N(x|Āµ, Ļ2
) =
1
ā
2ĻĻ2
exp(ā
1
2Ļ2
(x ā Āµ)2
) (18)
N(x|Āµ, Ī²ā1
) =
Ī²
2Ļ
exp(ā
Ī²
2
(x ā Āµ)2
) (19)
Multivariate Gaussian Distribution(N(x|Āµ, Ī£) = N(x|Āµ, Ī²ā1
))
N(x|Āµ, Ī£) =
1
(2Ļ)
D
2 det(Ī£)
1
2
exp(ā
1
2
(x ā Āµ) Ī£ā1
(x ā Āµ)) (20)
N(x|Āµ, Ī²ā1
) =
1
(2Ļ)
D
2 det(Ī£)
1
2
exp(ā
1
2
(x ā Āµ) Ī²(x ā Āµ)) (21)
Sung-Yub Kim Probability Distributions for ML
14. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Studentās t-distribution
Mahalanobis Distance
By EVD, we can get
ā2
= (x ā Āµ) Ī£ā1
(x ā Āµ) =
D
i=1
y2
i
Ī»i
(22)
where yi = ui (x ā Āµ)
Change of Variable in Gaussian
By above, we can get
p(y) = p(x)|Jyāx | =
D
j=1
1
(2ĻĪ»j )
1
2
exp{ā
y2
j
2Ī»j
} (23)
which means product of D independent univariate Gaussian Distribution.
First and Second Moment of Gaussian
By using above, we can get
E[x] = Āµ, E[xx ] = ĀµĀµ + Ī£ (24)
Sung-Yub Kim Probability Distributions for ML
15. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Studentās t-distribution
Limitations of Gaussian and Solutions
There are two main limitations for Gaussian.
First, we have to infer so many covariance parameters.
Second, we cannot represent multi-modal ditriubtions. Therefore, we
deļ¬ne some auxilarily concepts.
Diagonal Covariance
Ī£ = diag(s2
) (25)
Isotropic Covariance
Ī£ = Ļ2
I (26)
Mixture Model
p(x) =
K
k=1
Ļk p(x|Ļk ) (27)
Sung-Yub Kim Probability Distributions for ML
16. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Studentās t-distribution
Partitions of Mahalanobis distance
First, partition the covariance matrix and precision matrix.
Ī£ =
Ī£aa Ī£ab
Ī£ba Ī£bb
, Ī£ā1
= Ī =
Īaa Īab
Ība Ībb
(28)
where aa, bb are symmetric and ab and ba are conjugate transpose.
Now, partition the Mahalanobis distance.
(x ā Āµ) Ī£ā1
(x ā Āµ)
= (xa ā Āµ) Ī£ā1
aa (xa ā Āµ) + (xa ā Āµ) Ī£ā1
ab (xb ā Āµ)
+(xb ā Āµ) Ī£ā1
ba (xa ā Āµ) + (xb ā Āµ) Ī£ā1
bb (xb ā Āµ)(29)
Schur Complement
Like gaussian elimination, we can use some block matrix elimination by
Schur Complement
A B
C D
ā1
=
M āMBDā1
āDā1
CM Dā1
+ Dā1
CMBDā1 (30)
where M = (A ā BDā1
C)ā1
Sung-Yub Kim Probability Distributions for ML
17. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Studentās t-distribution
Schur Complement(cont.)
Therefore, we get
Īaa = (Ī£aa ā Ī£abĪ£ā1
bb Ī£ba)ā1
(31)
Īab = ā(Ī£aa ā Ī£abĪ£ā1
bb Ī£ba)ā1
Ī£abĪ£ā1
bb (32)
Conditional Distribution
Therefore, we get
xa|xb ā¼ N(x|Āµa|b, Ī£a|b) (33)
where
Āµa|b = Āµa + Ī£abĪ£ā1
bb (xb ā xa) (34)
Ī£a|b = Ī£aa ā Ī£abĪ£ā1
bb Ī£ba (35)
Marginal Distribution
Removing xb by integrating, we can get marginal distribution of xa
p(xa) = ā
1
2
xa (Īaa ā ĪabĪbbĪba)xa + xa (Īaa ā ĪabĪbbĪba)Āµa + const (36)
Therefore, we get
xa ā¼ N(x|Āµa, Ī£aa) (37)
Sung-Yub Kim Probability Distributions for ML
18. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Studentās t-distribution
Given a marginal Gaussian for x and a conditional Gaussian for y given x in the
form
x ā¼ N(x|Āµ, Īā1
) (38)
y|x ā¼ N(y|Ax + b, Lā1
) (39)
Then we can get marginal distribution of y and the conditional distribution of x
given y are given by
y ā¼ N(y|AĀµ + b, Lā1
+ AĪā1
A ) (40)
x|y ā¼ N(x|Ī£{A L(y ā b) + AĀµ}, Ī£) (41)
where
Ī£ = (Ī + A LA)ā1
(42)
Sung-Yub Kim Probability Distributions for ML
19. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Studentās t-distribution
Log-likelihood for data
By same argument in categorical data, we can get log-likelihood for
Gaussian
ln p(D|Āµ, Ī£) = ā
ND
2
ln 2Ļ ā
N
2
ln |Ī£| ā
1
2
N
n=1
(xn ā Āµ) Ī£ā1
(xn ā Āµ) (43)
and this log-likelihood depends only on these quantities called Suļ¬cient
Statistics
N
n=1
xn,
N
n=1
xnxn (44)
MLE for Gaussian
Since MLE is a maximizer for log-likelihood, we can get
ĀµML =
1
N
N
n=1
xn (45)
Ī£ML =
1
N
N
n=1
(xn ā ĀµML)(xn ā ĀµML) (46)
Sung-Yub Kim Probability Distributions for ML
20. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Studentās t-distribution
Sequential estimation
Since we get MLE for gaussian analytically, we can do this sequentially like
ĀµN
ML = ĀµNā1
ML +
1
N
(xN ā ĀµNā1
ML ) (47)
Robbins-Monro Algorithm
By same intuition, we can generalize sequential learning. Robbins-Monro
algorithm gives us root Īø such that f (Īø) = E[z|Īø] = 0. The iterate process
of RM algorithm can be represented by
ĪøN
= ĪøNā1
ā aNā1z(ĪøNā1
) (48)
where z(ĪøNā1
) means observed value of z when Īø takes the value ĪøNā1
and aN is an sequence satisfy
lim
Nāā
aN = 0,
ā
N=1
aN = ā,
ā
N=1
aN < ā (49)
Sung-Yub Kim Probability Distributions for ML
21. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Studentās t-distribution
Generalized Sequential Learning
We can apply RM algorithm for sequential learning. In this case, our f (Īø)
is a gradient of log-likelihood function. Therefore, we can get
z(Īø) = ā
ā
āĪø
ln p(x|Īø) (50)
In Gaussian case, we put aN to Ļ2
/N.
Bayesian Inference for mean given variance
Since gaussian likelihood takes the form of the exponential of a quadratic
form in Āµ, we can choose a prior also Gaussian. Therefore, if we choose
Āµ ā¼ N(Āµ|Āµ0, Ļ2
0) (51)
for prior, we get following for posterior
Āµ|D ā¼ N(Āµ|ĀµN , Ļ2
N ) (52)
where
ĀµN =
Ļ2
NĻ2
0 + Ļ2
Āµ0 +
NĻ2
0
NĻ2
0 + Ļ2
ĀµML,
1
Ļ2
N
=
1
Ļ2
0
+
N
Ļ2
(53)
Sung-Yub Kim Probability Distributions for ML
22. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Studentās t-distribution
Bayesian Inference for mean given variance(cont.)
1. Posterior mean compromises between the priot and the MLE.
2. Precision is given by the precision of the prior plus one contribution of
the data precision from each of the observed data.
3. If we take Ļ2
0 ā ā then the posterior mean reduces to the MLE.
Bayesian Inference for variance given mean
Since gaussian likelihood takes the form of proportional to the product of
a power of precision and the exponential of a linear function of precision.
We choose gamma distribution which is deļ¬ned by
Gam(Ī»|a0, b0) =
1
Ī(a0)
ba
00Ī»a0ā1
exp(āb0Ī») (54)
Then we can get posterior
Ī»|D ā¼ Gam(Ī»|aN , bN ) (55)
where
aN = a0 +
N
2
, bN = b0 +
N
2
Ļ2
ML (56)
Sung-Yub Kim Probability Distributions for ML
23. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Studentās t-distribution
Bayesian Inference for variance given mean(cont.)
1. We can interpret the parameter 2a0 eļ¬ective prior observations for
number of data. 2. We can interpret the parameter b0/a0 eļ¬ective prior
observations for variance.
Bayesian Inference for no data
By apply same argument on mean and variance, we can get prior
p(Āµ, Ī») ā¼ N(Āµ|Āµ0, (Ī²Ī»)ā1
)Gam(Ī»|a, b) (57)
where
Āµ0 = c/Ī², a = 1 + Ī²/2, b = d ā c2
/2Ī² (58)
Note that precision of Āµ is a linear function of Ī»
For Multivariate case, we can similarly get prior
p(Āµ, Ī|Āµ0, Ī², W , Ī½) = N(Āµ|Āµ0, (Ī²Ī)ā1
)W(Ī|W , Ī½) (59)
where W is Wishart distribution.
Sung-Yub Kim Probability Distributions for ML
24. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Studentās t-distribution
Univariate t-distribution
If we integrate out the precision given that our prior for precision is
Gamma, we get t-distribution.
St(x|Āµ, Ī», Ī½) =
Ī(Ī½/2 + 1/2)
Ī(Ī½/2)
(
Ī»
ĻĪ½
)1/2
[1 +
Ī»(x ā Āµ)2
Ī½
]āĪ½/2ā1/2
(60)
where Ī½ = 2a(degrees of freedom) and Ī» = a/b.
We can think t-dstribution as an inļ¬nite mixture of Gaussians.
Since t-distribution has fat tail(than Gaussian), we can obtain more robust
model when we estimate.
Multivariate t-distribution
We also can get multivariate case of inļ¬nite mixture of Gaussians, then we
get multivariate t-distribution
St(x|Āµ, Ī, Ī½) =
Ī(Ī½/2 + D/2)
Ī(Ī½/2)
(
Ī1/2
(ĻĪ½)D/2
)[1 +
ā2
Ī½
]āĪ½/2āD/2
(61)
Sung-Yub Kim Probability Distributions for ML
25. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Distribution for the exponential family
Sigmoid and Softmax
MLE for the exponential family
Conjugate priors for exponential family
Noninformative priors
The Exponential Family
The exponential family of distributions over x, given parameters Ī·, is
deļ¬ned to be the set of distributions of the form
p(x|Ī·) = g(Ī·)h(x) exp{Ī· u(x)} (62)
where Ī· is natural parameters of the distribution, and u(x) is a function
of x.
The fnuction g(Ī·) can be interpereted as the normalization factor.
Sung-Yub Kim Probability Distributions for ML
26. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Distribution for the exponential family
Sigmoid and Softmax
MLE for the exponential family
Conjugate priors for exponential family
Noninformative priors
Logistic Sigmoid
In case of bernouli distribution, our parameter is Āµ, although our natural
parameter is Ī·. Those two parameter can be connected by following
Ī· = ln(
Āµ
1 ā Āµ
), Āµ := Ļ(Ī·) =
exp(Āµ)
1 + exp(Āµ)
(63)
And we call this Ļ(Ī·) sigmoid function.
Softmax function
By same argument, we can ļ¬nd some realtionship between our parameter
and natural parameter. That is Softmax function.
Āµk =
exp(Ī·k )
K
j=1 exp(Ī·j )
(64)
Note that in this case, u(x) = 1, h(x) = 1, g(x) = ( K
j=1 exp(Ī·j ))ā1
Sung-Yub Kim Probability Distributions for ML
27. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Distribution for the exponential family
Sigmoid and Softmax
MLE for the exponential family
Conjugate priors for exponential family
Noninformative priors
Gaussian
Gaussian also can be interpreted as the exponential family by
u(x) =
x
x2 (65)
Ī· =
Āµ/Ļ2
ā1/2Ļ2 (66)
g(Ī·) = (ā2Ī·2)1/2
exp(
Ī·2
1
4Ī·2
) (67)
Sung-Yub Kim Probability Distributions for ML
28. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Distribution for the exponential family
Sigmoid and Softmax
MLE for the exponential family
Conjugate priors for exponential family
Noninformative priors
Problem of estimating the natural parameter
We can generalize the argument in MLE in other cases.
First, we consider the log-likelihood of the data.
ln p(D|Ī·) =
N
n=1
h(xn) + N ln g(Ī·) + Ī·
N
n=1
u(xn) (68)
Next, we need to ļ¬nd the stationary point of the log-likelihood.
N Ī· ln g(Ī·) +
N
n=1
u(xn) = 0 (69)
Therfore, we get MLE
ā Ī· ln g(Ī·) =
1
N
N
n=1
u(xn) (70)
We see that the solution for the MLE depedns on the data only through
Ļnu(xn), which is therefore called the suļ¬cient statistic of the
exponential family.
Sung-Yub Kim Probability Distributions for ML
29. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Distribution for the exponential family
Sigmoid and Softmax
MLE for the exponential family
Conjugate priors for exponential family
Noninformative priors
Conjugate prior
For any member of the exponential family, there exists a conjugate prior
that can be written in the form
p(Ī·|Ļ, Ī½) = f (Ļ, Ī½)g(Ī·)Ī½
exp{Ī½Ī· Ļ} (71)
where f (Ļ, Ī½) is a normalization factor, and g(Ī·) is the same function as
the exponential family.
Posterior distribution
If we choose prior as conjugate prior, we get
p(Ī·|D, Ļ, Ī½) ā g(Ī·)Ī½+N
exp{Ī· (
N
n=1
u(xn) + Ī½Ļ)} (72)
Therefore, we see that the parameter Ī½ can be interpreted as the eļ¬ective
number of pseudo-observations in the prior, each of which has a value
for the suļ¬cient statistics u(x) given by Ļ.
Sung-Yub Kim Probability Distributions for ML
30. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Distribution for the exponential family
Sigmoid and Softmax
MLE for the exponential family
Conjugate priors for exponential family
Noninformative priors
Noninformative Priors
We may seek a form of prior distribution, called a noninformative prior,
which is intended to have as little inļ¬uence on the posterior distribution as
possible.
Generalizations of Noninformative priors
It leads to two generalizations, namely the principle of transformation
groups as in the Jeļ¬reys prior, and the principle of maximum entropy.
Sung-Yub Kim Probability Distributions for ML
31. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Histogram Technique
Kernel Density Estimation
Nearest-Neighbour methods
Histogram Technique
Standard histograms simply partition x into distinct bins of width āi and
then count the number ni of observations of x falling in bin i. In order to
turn this count into a normalized probability density, we simply divide by
the total number N of observations and by the width āi of the bins to
obtain probability values for each bin given by
pi =
ni
Nāi
(73)
Limitations of Hitogram
The estimated density has discontinuities that are due to the bin edges
rather than any property of the underlying distribution that generated the
data.
Histogram approach also sacling with dimensionality.
Lessons of Histogram
First, to estimate the probability density at a particular location, we should
consider the data points that lie within some local neighbourhood of that
point.
Second, the value of the smoothing parameter should be neither too large
nor too small in order to obtain good results.
Sung-Yub Kim Probability Distributions for ML
32. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Histogram Technique
Kernel Density Estimation
Nearest-Neighbour methods
Motivation
For large N, the bernouli trial that data point fall within small region
mathcalR will be sharply peaked around the mean and so
K NP (74)
If, however, we also assume that the region R is suļ¬ciently small that the
probability density p(x) is roughlt over the region, then we have
P p(x)V (75)
where V is the volume of R. Therefore,
p(x) =
K
NV
(76)
Note that in our assumption, R is suļ¬ciently small tha the density is
approximately constant over the region and the yet suļ¬ciently large that
the number K of points falling inside the region is suļ¬cient for the
binomial distribution to be sharply peaked.
Sung-Yub Kim Probability Distributions for ML
33. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Histogram Technique
Kernel Density Estimation
Nearest-Neighbour methods
Kernel Density Estimation(KDE)
If we ļ¬x V and determine K from the data, we use kernel approach. For
instance, we ļ¬x V to 1 and count the data point by following function
k(u) =
1, if |ui | ā¤ 1/2, i = 1, Ā· Ā· Ā· , D,
0, otherwise
(77)
which called Parzen window In this case, we can use this by
K =
N
n=1
k(
x ā xn
h
) (78)
and it leads density function
p(x) =
1
N
N
n=1
1
hD
k(
x ā xn
h
) (79)
We can also use another kernel like Gaussian kernel. If we do so, then we
get
p(x) =
1
N
N
n=1
1
(2Ļh2)D/2
exp{ā
x ā xn
2h2
} (80)
Sung-Yub Kim Probability Distributions for ML
34. Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Histogram Technique
Kernel Density Estimation
Nearest-Neighbour methods
Limitation of KDE
One of the diļ¬culties with the kernel approach to density estimation is
that the parameter h governing the kernel width is ļ¬xed for all kernels. In
regions of high data density, a large value of h may lead to over-smoothing
and in lower data density, a small value of h may lead to overļ¬tting. Thus
the optimal choice for h may be dependent on location within data space.
Nereat-Neighbor(NN)
Therefore we consider a ļ¬xing K and use the data to ļ¬nd an appropriate V
and we call this method K-NN methods.
In this case, the value of K governs the degree of smoothing and we need
to optimizae(hyper-parameter optimize) K.
Erro of KNN
Note that for suļ¬ciently big N, the error rate is never more than twice the
minimum achievable error rate of an optimal classiļ¬er.
Sung-Yub Kim Probability Distributions for ML