Combining Models - In this slides, we look at a way to combine the answers form various weak classifiers to build a robust classifier. At the slides we look at the following subjects:
1.- Model Combination Vs Bayesian Model
2.- Bootstrap Data Sets
And the cherry on the top the AdaBoost
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
24 Machine Learning Combining Models - Ada Boost
1. Machine Learning for Data Mining
Combining Models
Andres Mendez-Vazquez
July 20, 2015
1 / 65
2. Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
2 / 65
3. Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by combining
multiple classifiers together in some way.
Example: Committees
We might train L different classifiers and then make predictions using the
average of the predictions made by each classifier.
Example: Boosting
It involves training multiple models in sequence in which the error function
used to train a particular model depends on the performance of the
previous models.
3 / 65
4. Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by combining
multiple classifiers together in some way.
Example: Committees
We might train L different classifiers and then make predictions using the
average of the predictions made by each classifier.
Example: Boosting
It involves training multiple models in sequence in which the error function
used to train a particular model depends on the performance of the
previous models.
3 / 65
5. Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by combining
multiple classifiers together in some way.
Example: Committees
We might train L different classifiers and then make predictions using the
average of the predictions made by each classifier.
Example: Boosting
It involves training multiple models in sequence in which the error function
used to train a particular model depends on the performance of the
previous models.
3 / 65
6. Images/cinvestav-
In addition
Instead of averaging the predictions of a set of models
You can use an alternative form of combination that selects one of the
models to make the prediction.
Where
The choice of model is a function of the input variables.
Thus
Different Models become responsible for making decisions in different
regions of the input space.
4 / 65
7. Images/cinvestav-
In addition
Instead of averaging the predictions of a set of models
You can use an alternative form of combination that selects one of the
models to make the prediction.
Where
The choice of model is a function of the input variables.
Thus
Different Models become responsible for making decisions in different
regions of the input space.
4 / 65
8. Images/cinvestav-
In addition
Instead of averaging the predictions of a set of models
You can use an alternative form of combination that selects one of the
models to make the prediction.
Where
The choice of model is a function of the input variables.
Thus
Different Models become responsible for making decisions in different
regions of the input space.
4 / 65
10. Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
11. Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
12. Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
13. Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
14. Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
15. Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
16. Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
17. Images/cinvestav-
This is used in the mixture of distributions
Thus (Mixture of Experts)
p (t|x) =
M
k=1
πk (x) p (t|x, k) (1)
where πk (x) = p (k|x) represent the input-dependent mixing coefficients.
This type of models
They can be viewed as mixture distribution in which the component
densities and the mixing coefficients are conditioned on the input variables
and are known as mixture experts.
7 / 65
18. Images/cinvestav-
This is used in the mixture of distributions
Thus (Mixture of Experts)
p (t|x) =
M
k=1
πk (x) p (t|x, k) (1)
where πk (x) = p (k|x) represent the input-dependent mixing coefficients.
This type of models
They can be viewed as mixture distribution in which the component
densities and the mixing coefficients are conditioned on the input variables
and are known as mixture experts.
7 / 65
19. Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
8 / 65
23. Images/cinvestav-
Example
For this consider the following
Mixture of Gaussians with a binary latent variable z indicating to which
component a point belongs to.
p (x, z) (2)
Thus
Corresponding density over the observed variable x using marginalization:
p (x) =
z
p (x, z) (3)
10 / 65
24. Images/cinvestav-
Example
For this consider the following
Mixture of Gaussians with a binary latent variable z indicating to which
component a point belongs to.
p (x, z) (2)
Thus
Corresponding density over the observed variable x using marginalization:
p (x) =
z
p (x, z) (3)
10 / 65
25. Images/cinvestav-
In the case of Gaussian distributions
We have
p (x) =
M
k=1
πkN (x|µkΣk) (4)
Now, for independent, identically distributed data
X = {x1, x2, ..., xN }
p (X) =
N
n=1
p (xn) =
N
n=1 zn
p (xn, zn) (5)
Now, suppose
We have several different models indexed by h = 1, ..., H with prior
probabilities.
For instance one model might be a mixture of Gaussians and another
model might be a mixture of Cauchy distributions.
11 / 65
26. Images/cinvestav-
In the case of Gaussian distributions
We have
p (x) =
M
k=1
πkN (x|µkΣk) (4)
Now, for independent, identically distributed data
X = {x1, x2, ..., xN }
p (X) =
N
n=1
p (xn) =
N
n=1 zn
p (xn, zn) (5)
Now, suppose
We have several different models indexed by h = 1, ..., H with prior
probabilities.
For instance one model might be a mixture of Gaussians and another
model might be a mixture of Cauchy distributions.
11 / 65
27. Images/cinvestav-
In the case of Gaussian distributions
We have
p (x) =
M
k=1
πkN (x|µkΣk) (4)
Now, for independent, identically distributed data
X = {x1, x2, ..., xN }
p (X) =
N
n=1
p (xn) =
N
n=1 zn
p (xn, zn) (5)
Now, suppose
We have several different models indexed by h = 1, ..., H with prior
probabilities.
For instance one model might be a mixture of Gaussians and another
model might be a mixture of Cauchy distributions.
11 / 65
28. Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
29. Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
30. Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
31. Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
32. Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
33. Images/cinvestav-
The Differences
Bayesian model averaging
The whole data set is generated by a single model.
Model combination
Different data points within the data set can potentially be generated from
different values of the latent variable z and hence by different components.
13 / 65
34. Images/cinvestav-
The Differences
Bayesian model averaging
The whole data set is generated by a single model.
Model combination
Different data points within the data set can potentially be generated from
different values of the latent variable z and hence by different components.
13 / 65
35. Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias and
variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the model
to the individual data points.
14 / 65
36. Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias and
variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the model
to the individual data points.
14 / 65
37. Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias and
variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the model
to the individual data points.
14 / 65
38. Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias and
variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the model
to the individual data points.
14 / 65
39. Images/cinvestav-
For example
We can do the follwoing
When we averaged a set of low-bias models, we obtained accurate
predictions of the underlying sinusoidal function from which the data were
generated.
15 / 65
40. Images/cinvestav-
However
Big Problem
We have normally a single data set
Thus
We need to introduce certain variability between the different committee
members.
One approach
You can use bootstrap data sets.
16 / 65
41. Images/cinvestav-
However
Big Problem
We have normally a single data set
Thus
We need to introduce certain variability between the different committee
members.
One approach
You can use bootstrap data sets.
16 / 65
42. Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
17 / 65
43. Images/cinvestav-
What to do with the Bootstrap Data Set
Given a Regression Problem
Here, we are trying to predict the value of a single continuous variable.
What?
Suppose our original data set consists of N data points X = {x1, ..., xN } .
Thus
We can create a new data set XB by drawing N points at random from X,
with replacement, so that some points in X may be replicated in XB,
whereas other points in X may be absent from XB.
18 / 65
44. Images/cinvestav-
What to do with the Bootstrap Data Set
Given a Regression Problem
Here, we are trying to predict the value of a single continuous variable.
What?
Suppose our original data set consists of N data points X = {x1, ..., xN } .
Thus
We can create a new data set XB by drawing N points at random from X,
with replacement, so that some points in X may be replicated in XB,
whereas other points in X may be absent from XB.
18 / 65
45. Images/cinvestav-
What to do with the Bootstrap Data Set
Given a Regression Problem
Here, we are trying to predict the value of a single continuous variable.
What?
Suppose our original data set consists of N data points X = {x1, ..., xN } .
Thus
We can create a new data set XB by drawing N points at random from X,
with replacement, so that some points in X may be replicated in XB,
whereas other points in X may be absent from XB.
18 / 65
47. Images/cinvestav-
Thus
Use each of them to train a copy ym (x) of a predictive regression
model to predict a single continuous variable
Then,
ycom (x) =
1
M
M
m=1
ym (x) (7)
This is also known as Bootstrap Aggregation or Bagging.
20 / 65
48. Images/cinvestav-
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x) = h (x) + m (x) (8)
The average sum-of-squares error over the data takes the form
Ex {ym (x) − h (x)}2
= Ex { m (x)}2
(9)
What is Ex?
It denotes a frequentest expectation with respect to the distribution of the
input vector x.
21 / 65
49. Images/cinvestav-
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x) = h (x) + m (x) (8)
The average sum-of-squares error over the data takes the form
Ex {ym (x) − h (x)}2
= Ex { m (x)}2
(9)
What is Ex?
It denotes a frequentest expectation with respect to the distribution of the
input vector x.
21 / 65
50. Images/cinvestav-
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x) = h (x) + m (x) (8)
The average sum-of-squares error over the data takes the form
Ex {ym (x) − h (x)}2
= Ex { m (x)}2
(9)
What is Ex?
It denotes a frequentest expectation with respect to the distribution of the
input vector x.
21 / 65
51. Images/cinvestav-
Meaning
Thus, the average error is
EAV =
1
M
M
m=1
Ex { m (x)}2
(10)
Similarly the Expected error over the committee
ECOM = Ex
1
M
M
m=1
(ym (x) − h (x))
2
= Ex
1
M
M
m=1
m (x)
2
(11)
22 / 65
52. Images/cinvestav-
Meaning
Thus, the average error is
EAV =
1
M
M
m=1
Ex { m (x)}2
(10)
Similarly the Expected error over the committee
ECOM = Ex
1
M
M
m=1
(ym (x) − h (x))
2
= Ex
1
M
M
m=1
m (x)
2
(11)
22 / 65
53. Images/cinvestav-
Assume that the errors have zero mean and are
uncorrelated
Assume that the errors have zero mean and are uncorrelated
Ex [ m (x)] =0
Ex [ m (x) l (x)] = 0, for m = l
23 / 65
54. Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) + Ex
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))
=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
55. Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) + Ex
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))
=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
56. Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) + Ex
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))
=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
57. Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) + Ex
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))
=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
58. Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) + Ex
M
h=1
M
k=1
h=k
h (x) k (x)
=
1
M2
Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))
=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
59. Images/cinvestav-
We finally obtain
We obtain
ECOM =
1
M
EAV (12)
Looks great BUT!!!
Unfortunately, it depends on the key assumption that the errors due to the
individual models are uncorrelated.
25 / 65
60. Images/cinvestav-
We finally obtain
We obtain
ECOM =
1
M
EAV (12)
Looks great BUT!!!
Unfortunately, it depends on the key assumption that the errors due to the
individual models are uncorrelated.
25 / 65
61. Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is generally small.
Something Notable
However, It can be shown that the expected committee error will not
exceed the expected error of the constituent models, so
ECOM ≤ EAV (13)
We need something better
A more sophisticated technique known as boosting.
26 / 65
62. Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is generally small.
Something Notable
However, It can be shown that the expected committee error will not
exceed the expected error of the constituent models, so
ECOM ≤ EAV (13)
We need something better
A more sophisticated technique known as boosting.
26 / 65
63. Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is generally small.
Something Notable
However, It can be shown that the expected committee error will not
exceed the expected error of the constituent models, so
ECOM ≤ EAV (13)
We need something better
A more sophisticated technique known as boosting.
26 / 65
64. Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
27 / 65
65. Images/cinvestav-
Boosting
What Boosting does?
It combines several classifiers to produce a form of a committee.
We will describe AdaBoost
“Adaptive Boosting” developed by Freund and Schapire (1995).
28 / 65
66. Images/cinvestav-
Boosting
What Boosting does?
It combines several classifiers to produce a form of a committee.
We will describe AdaBoost
“Adaptive Boosting” developed by Freund and Schapire (1995).
28 / 65
67. Images/cinvestav-
Sequential Training
Main difference between boosting and committee methods
The base classifiers are trained in sequence.
Explanation
Consider a two-class classification problem:
1 Samples x1, x2,..., xN
2 Binary labels (-1,1) t1, t2, ..., tN
29 / 65
68. Images/cinvestav-
Sequential Training
Main difference between boosting and committee methods
The base classifiers are trained in sequence.
Explanation
Consider a two-class classification problem:
1 Samples x1, x2,..., xN
2 Binary labels (-1,1) t1, t2, ..., tN
29 / 65
69. Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs in an accurate way!!!
Thus
For each pattern xi each expert classifier outputs a classification
yj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (xi))
C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14)
30 / 65
70. Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs in an accurate way!!!
Thus
For each pattern xi each expert classifier outputs a classification
yj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (xi))
C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14)
30 / 65
71. Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs in an accurate way!!!
Thus
For each pattern xi each expert classifier outputs a classification
yj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (xi))
C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14)
30 / 65
74. Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select them, if they have certain properties.
Assigning a weight to their contribution to the set of experts.
32 / 65
75. Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select them, if they have certain properties.
Assigning a weight to their contribution to the set of experts.
32 / 65
76. Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select them, if they have certain properties.
Assigning a weight to their contribution to the set of experts.
32 / 65
77. Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
78. Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
79. Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
80. Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
81. Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
82. Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits, but as long that the penalty of a success
is smaller than the penalty for a miss exp {−β} < exp {β}.
Why?
Show that if we assign cost a to misses and cost b to hits, where
a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants
c and d. Thus, it does not compromise generality.
34 / 65
83. Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits, but as long that the penalty of a success
is smaller than the penalty for a miss exp {−β} < exp {β}.
Why?
Show that if we assign cost a to misses and cost b to hits, where
a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants
c and d. Thus, it does not compromise generality.
34 / 65
84. Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits, but as long that the penalty of a success
is smaller than the penalty for a miss exp {−β} < exp {β}.
Why?
Show that if we assign cost a to misses and cost b to hits, where
a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants
c and d. Thus, it does not compromise generality.
34 / 65
85. Images/cinvestav-
Exponential Loss Function
Something Notable
This kind of error function different from the usual squared Euclidean
distance to the classification target is called an exponential loss function.
AdaBoost uses exponential error loss as error criterion.
35 / 65
86. Images/cinvestav-
Exponential Loss Function
Something Notable
This kind of error function different from the usual squared Euclidean
distance to the classification target is called an exponential loss function.
AdaBoost uses exponential error loss as error criterion.
35 / 65
87. Images/cinvestav-
The Matrix S
When we test the M classifiers in the pool, we build a matrix S
We record the misses (with a ONE) and hits (with a ZERO) of each
classifiers.
Row i in the matrix is reserved for the data point xi - Column m is
reserved for the mth classifier in the pool.
Classifiers
1 2 · · · M
x1 0 1 · · · 1
x2 0 0 · · · 1
x3 1 1 · · · 0
...
...
...
...
xN 0 0 · · · 0
36 / 65
88. Images/cinvestav-
The Matrix S
When we test the M classifiers in the pool, we build a matrix S
We record the misses (with a ONE) and hits (with a ZERO) of each
classifiers.
Row i in the matrix is reserved for the data point xi - Column m is
reserved for the mth classifier in the pool.
Classifiers
1 2 · · · M
x1 0 1 · · · 1
x2 0 0 · · · 1
x3 1 1 · · · 0
...
...
...
...
xN 0 0 · · · 0
36 / 65
89. Images/cinvestav-
The Matrix S
When we test the M classifiers in the pool, we build a matrix S
We record the misses (with a ONE) and hits (with a ZERO) of each
classifiers.
Row i in the matrix is reserved for the data point xi - Column m is
reserved for the mth classifier in the pool.
Classifiers
1 2 · · · M
x1 0 1 · · · 1
x2 0 0 · · · 1
x3 1 1 · · · 0
...
...
...
...
xN 0 0 · · · 0
36 / 65
90. Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
37 / 65
91. Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
37 / 65
92. Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
37 / 65
93. Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
37 / 65
94. Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
Basically
The selection process concentrates in selecting new classifiers for the
committee by focusing on those which can help with the still misclassified
examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
38 / 65
95. Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
Basically
The selection process concentrates in selecting new classifiers for the
committee by focusing on those which can help with the still misclassified
examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
38 / 65
96. Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
Basically
The selection process concentrates in selecting new classifiers for the
committee by focusing on those which can help with the still misclassified
examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
38 / 65
97. Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
Basically
The selection process concentrates in selecting new classifiers for the
committee by focusing on those which can help with the still misclassified
examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
38 / 65
98. Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select the current
best out of the pool.
At mth iteration
We have already included m − 1 classifiers in the committee and we want
to select the next one.
Thus, we have the following cost function which is actually the
output of the committee
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15)
39 / 65
99. Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select the current
best out of the pool.
At mth iteration
We have already included m − 1 classifiers in the committee and we want
to select the next one.
Thus, we have the following cost function which is actually the
output of the committee
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15)
39 / 65
100. Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select the current
best out of the pool.
At mth iteration
We have already included m − 1 classifiers in the committee and we want
to select the next one.
Thus, we have the following cost function which is actually the
output of the committee
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15)
39 / 65
101. Images/cinvestav-
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iteration m = 1
C(m−1) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =
N
i=1
exp −ti C(m−1) (xi) + αmym (xi) (17)
40 / 65
102. Images/cinvestav-
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iteration m = 1
C(m−1) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =
N
i=1
exp −ti C(m−1) (xi) + αmym (xi) (17)
40 / 65
103. Images/cinvestav-
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iteration m = 1
C(m−1) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =
N
i=1
exp −ti C(m−1) (xi) + αmym (xi) (17)
40 / 65
104. Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} (18)
Where, for i = 1, 2, ..., N
w
(m)
i = exp −tiC(m−1) (xi) (19)
41 / 65
105. Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} (18)
Where, for i = 1, 2, ..., N
w
(m)
i = exp −tiC(m−1) (xi) (19)
41 / 65
106. Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} (18)
Where, for i = 1, 2, ..., N
w
(m)
i = exp −tiC(m−1) (xi) (19)
41 / 65
107. Images/cinvestav-
Thus
In the first iteration w
(1)
i = 1 for i = 1, ..., N
During later iterations, the vector w(m) represents the weight assigned to
each data point in the training set at iteration m.
Splitting equation (Eq. 18)
E =
ti=ym(xi)
w
(m)
i exp {−αm} +
ti=ym(xi)
w
(m)
i exp {αm} (20)
Meaning
The total cost is the weighted cost of all hits plus the weighted cost of all
misses.
42 / 65
108. Images/cinvestav-
Thus
In the first iteration w
(1)
i = 1 for i = 1, ..., N
During later iterations, the vector w(m) represents the weight assigned to
each data point in the training set at iteration m.
Splitting equation (Eq. 18)
E =
ti=ym(xi)
w
(m)
i exp {−αm} +
ti=ym(xi)
w
(m)
i exp {αm} (20)
Meaning
The total cost is the weighted cost of all hits plus the weighted cost of all
misses.
42 / 65
109. Images/cinvestav-
Thus
In the first iteration w
(1)
i = 1 for i = 1, ..., N
During later iterations, the vector w(m) represents the weight assigned to
each data point in the training set at iteration m.
Splitting equation (Eq. 18)
E =
ti=ym(xi)
w
(m)
i exp {−αm} +
ti=ym(xi)
w
(m)
i exp {αm} (20)
Meaning
The total cost is the weighted cost of all hits plus the weighted cost of all
misses.
42 / 65
110. Images/cinvestav-
Now
Writing the first summand as Wc exp {−αm} and the second as
We exp {αm}
E = Wc exp {−αm} + We exp {αm} (21)
Now, for the selection of ym the exact value of αm > 0 is irrelevant
Since a fixed αm minimizing E is equivalent to minimizing exp {αm} E and
exp {αm} E = Wc + We exp {2αm} (22)
43 / 65
111. Images/cinvestav-
Now
Writing the first summand as Wc exp {−αm} and the second as
We exp {αm}
E = Wc exp {−αm} + We exp {αm} (21)
Now, for the selection of ym the exact value of αm > 0 is irrelevant
Since a fixed αm minimizing E is equivalent to minimizing exp {αm} E and
exp {αm} E = Wc + We exp {2αm} (22)
43 / 65
112. Images/cinvestav-
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm} E = Wc + We − We + We exp {2αm} (23)
Thus
exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24)
Now
Now, Wc + We is the total sum W of the weights of all data points which
is constant in the current iteration.
44 / 65
113. Images/cinvestav-
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm} E = Wc + We − We + We exp {2αm} (23)
Thus
exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24)
Now
Now, Wc + We is the total sum W of the weights of all data points which
is constant in the current iteration.
44 / 65
114. Images/cinvestav-
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm} E = Wc + We − We + We exp {2αm} (23)
Thus
exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24)
Now
Now, Wc + We is the total sum W of the weights of all data points which
is constant in the current iteration.
44 / 65
115. Images/cinvestav-
Thus
The right hand side of the equation is minimized
When at the m-th iteration we pick the classifier with the lowest total cost
We (that is the lowest rate of weighted error).
Intuitively
The next selected ym should be the one with the lowest penalty given the
current set of weights.
45 / 65
116. Images/cinvestav-
Thus
The right hand side of the equation is minimized
When at the m-th iteration we pick the classifier with the lowest total cost
We (that is the lowest rate of weighted error).
Intuitively
The next selected ym should be the one with the lowest penalty given the
current set of weights.
45 / 65
117. Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −Wc exp {−αm} + We exp {αm} (25)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (26)
The optimal value is thus
αm =
1
2
ln
Wc
We
(27)
46 / 65
118. Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −Wc exp {−αm} + We exp {αm} (25)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (26)
The optimal value is thus
αm =
1
2
ln
Wc
We
(27)
46 / 65
119. Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −Wc exp {−αm} + We exp {αm} (25)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (26)
The optimal value is thus
αm =
1
2
ln
Wc
We
(27)
46 / 65
120. Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm =
1
2
ln
W − We
We
=
1
2
ln
1 − em
em
(29)
With the percentage rate of error given the weights of the data points
em =
We
W
(30)
47 / 65
121. Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm =
1
2
ln
W − We
We
=
1
2
ln
1 − em
em
(29)
With the percentage rate of error given the weights of the data points
em =
We
W
(30)
47 / 65
122. Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm =
1
2
ln
W − We
We
=
1
2
ln
1 − em
em
(29)
With the percentage rate of error given the weights of the data points
em =
We
W
(30)
47 / 65
123. Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
48 / 65
124. Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
48 / 65
125. Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
48 / 65
126. Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
48 / 65
127. Images/cinvestav-
Sequential Training
Thus
AdaBoost trains a new classifier using a data set in which the weighting
coefficients are adjusted according to the performance of the previously
trained classifier so as to give greater weight to the misclassified data
points.
49 / 65
129. Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
51 / 65
130. Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
131. Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
132. Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
133. Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
134. Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
135. Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
136. Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(34)
Now update the weights of the data for the next iteration
If ti = ym (xi) i.e. a miss
w
(m+1)
i = w
(m)
i exp {αm} = w
(m)
i
1 − em
em
(35)
If ti == ym (xi) i.e. a hit
w
(m+1)
i = w
(m)
i exp {−αm} = w
(m)
i
em
1 − em
(36)
53 / 65
137. Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(34)
Now update the weights of the data for the next iteration
If ti = ym (xi) i.e. a miss
w
(m+1)
i = w
(m)
i exp {αm} = w
(m)
i
1 − em
em
(35)
If ti == ym (xi) i.e. a hit
w
(m+1)
i = w
(m)
i exp {−αm} = w
(m)
i
em
1 − em
(36)
53 / 65
138. Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(34)
Now update the weights of the data for the next iteration
If ti = ym (xi) i.e. a miss
w
(m+1)
i = w
(m)
i exp {αm} = w
(m)
i
1 − em
em
(35)
If ti == ym (xi) i.e. a hit
w
(m+1)
i = w
(m)
i exp {−αm} = w
(m)
i
em
1 − em
(36)
53 / 65
140. Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second
From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are
increased for data points that are misclassified.
Third
The quantity em represent weighted measures of the error rate.
Thus αm gives more weight to the more accurate classifiers.
55 / 65
141. Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second
From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are
increased for data points that are misclassified.
Third
The quantity em represent weighted measures of the error rate.
Thus αm gives more weight to the more accurate classifiers.
55 / 65
142. Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second
From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are
increased for data points that are misclassified.
Third
The quantity em represent weighted measures of the error rate.
Thus αm gives more weight to the more accurate classifiers.
55 / 65
143. Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose members are trained to minimize the error function given the
current weights
Now
If indeed a finite set of classifiers is given, we only need to test the
classifiers once for each data point
The Scouting Matrix S
It can be reused at each iteration by multiplying the transposed vector of
weights w(m) with S to obtain We of each machine
56 / 65
144. Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose members are trained to minimize the error function given the
current weights
Now
If indeed a finite set of classifiers is given, we only need to test the
classifiers once for each data point
The Scouting Matrix S
It can be reused at each iteration by multiplying the transposed vector of
weights w(m) with S to obtain We of each machine
56 / 65
145. Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose members are trained to minimize the error function given the
current weights
Now
If indeed a finite set of classifiers is given, we only need to test the
classifiers once for each data point
The Scouting Matrix S
It can be reused at each iteration by multiplying the transposed vector of
weights w(m) with S to obtain We of each machine
56 / 65
146. Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
57 / 65
147. Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
57 / 65
148. Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
57 / 65
149. Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
57 / 65
152. Images/cinvestav-
We have the following cases
We have the following
If em −→ 1, we have that all the sample were not correctly classified!!!
Thus
We get that for all miss-classified sample lim
em→1
1−em
em
−→ 0, then
αm −→ −∞
Finally
We get that for all miss-classified sample w
(m+1)
i = w
(m)
i exp {αm} −→ 0
We only need to reverse the answers to get the perfect classifier and
select it as the only committee member.
60 / 65
153. Images/cinvestav-
We have the following cases
We have the following
If em −→ 1, we have that all the sample were not correctly classified!!!
Thus
We get that for all miss-classified sample lim
em→1
1−em
em
−→ 0, then
αm −→ −∞
Finally
We get that for all miss-classified sample w
(m+1)
i = w
(m)
i exp {αm} −→ 0
We only need to reverse the answers to get the perfect classifier and
select it as the only committee member.
60 / 65
154. Images/cinvestav-
We have the following cases
We have the following
If em −→ 1, we have that all the sample were not correctly classified!!!
Thus
We get that for all miss-classified sample lim
em→1
1−em
em
−→ 0, then
αm −→ −∞
Finally
We get that for all miss-classified sample w
(m+1)
i = w
(m)
i exp {αm} −→ 0
We only need to reverse the answers to get the perfect classifier and
select it as the only committee member.
60 / 65
155. Images/cinvestav-
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or bad
classified :
exp {−αmtiym (xi)} → 1 (39)
So the weight does not change at all.
What about em → 0
We have that αm → +∞
Thus, we have
Samples always correctly classified
w
(m+1)
i = w
(m)
i exp {−αmtiym (xi)} → 0
Thus, the only committee that we need is m, we do not need m + 1
61 / 65
156. Images/cinvestav-
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or bad
classified :
exp {−αmtiym (xi)} → 1 (39)
So the weight does not change at all.
What about em → 0
We have that αm → +∞
Thus, we have
Samples always correctly classified
w
(m+1)
i = w
(m)
i exp {−αmtiym (xi)} → 0
Thus, the only committee that we need is m, we do not need m + 1
61 / 65
157. Images/cinvestav-
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or bad
classified :
exp {−αmtiym (xi)} → 1 (39)
So the weight does not change at all.
What about em → 0
We have that αm → +∞
Thus, we have
Samples always correctly classified
w
(m+1)
i = w
(m)
i exp {−αmtiym (xi)} → 0
Thus, the only committee that we need is m, we do not need m + 1
61 / 65
158. Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
62 / 65