Weitere ähnliche Inhalte
Ähnlich wie Predicting Real-valued Outputs: An introduction to regression
Ähnlich wie Predicting Real-valued Outputs: An introduction to regression (20)
Mehr von guestfee8698 (13)
Kürzlich hochgeladen (20)
Predicting Real-valued Outputs: An introduction to regression
- 1. Predicting Real-valued
outputs: an introduction
to Regression
Andrew W. Moore
Note to other teachers and users of
these slides. Andrew would be delighted
Professor
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
School of Computer Science
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in
Carnegie Mellon University
your own lecture, please include this l
teria
d ma
message, or the following link to the
rdere Nets
source repository of Andrew’s tutorials:
is reo
www.cs.cmu.edu/~awm l
This he Neura “Favorite
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully t
d the rithms”
awm@cs.cmu.edu from
re an
received.
o
lectu ssion Alg
412-268-7599 e
Regr
e
r
lectu
Copyright © 2001, 2003, Andrew W. Moore
Single-
Parameter
Linear
Regression
Copyright © 2001, 2003, Andrew W. Moore 2
- 2. Linear Regression
DATASET
inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2
↑
x3 = 2 y3 = 2
w
↓
←1→
x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1
Linear regression assumes that the expected value of
the output given an input, E[y|x], is linear.
Simplest case: Out(x) = wx for some unknown w.
Given the data, we can estimate w.
Copyright © 2001, 2003, Andrew W. Moore 3
1-parameter linear regression
Assume that the data is formed by
yi = wxi + noisei
where…
• the noise signals are independent
• the noise has a normal distribution with mean 0
and unknown variance σ2
p(y|w,x) has a normal distribution with
• mean wx
• variance σ2
Copyright © 2001, 2003, Andrew W. Moore 4
- 3. Bayesian Linear Regression
p(y|w,x) = Normal (mean wx, var σ2)
We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn)
which are EVIDENCE about w.
We want to infer w from the data.
p(w|x1, x2, x3,…xn, y1, y2…yn)
•You can use BAYES rule to work out a posterior
distribution for w given the data.
•Or you could do Maximum Likelihood Estimation
Copyright © 2001, 2003, Andrew W. Moore 5
Maximum likelihood estimation of w
Asks the question:
“For which value of w is this data most likely to have
happened?”
<=>
For what w is
p(y1, y2…yn |x1, x2, x3,…xn, w) maximized?
<=>
For what w is n
∏ p( y w, x ) maximized?
i i
i =1
Copyright © 2001, 2003, Andrew W. Moore 6
- 4. For what w is
n
∏ p ( y i w , x i ) maximized?
i =1
For what w is
n
1
∏ exp(− 2 (
yi − wxi
) 2 ) maximized?
σ
i =1
For what w is
2
1 y − wx i
n
∑ −i maximized?
σ
2
i =1
For what w is 2
n
∑ (y )
− wx minimized?
i i
i =1
Copyright © 2001, 2003, Andrew W. Moore 7
Linear Regression
The maximum
likelihood w is
the one that E(w)
w
minimizes sum-
Ε = ∑( yi − wxi )
2
of-squares of
residuals i
(∑ x )w
= ∑ yi − (2∑ xi yi )w +
2 2 2
i
i
We want to minimize a quadratic function of w.
Copyright © 2001, 2003, Andrew W. Moore 8
- 5. Linear Regression
Easy to show the sum of
squares is minimized
when
∑x y i i
w=
∑x
2
i
The maximum likelihood
Out(x) = wx
model is
We can use it for
prediction
Copyright © 2001, 2003, Andrew W. Moore 9
Linear Regression
Easy to show the sum of
squares is minimized
when
∑x y
p(w)
i i
w= w
∑x
2
Note:
i In Bayesian stats you’d have
ended up with a prob dist of w
The maximum likelihood
Out(x) = wx
model is And predictions would have given a prob
dist of expected output
Often useful to know your confidence.
We can use it for
Max likelihood can give some kinds of
prediction
confidence too.
Copyright © 2001, 2003, Andrew W. Moore 10
- 6. Multivariate
Linear
Regression
Copyright © 2001, 2003, Andrew W. Moore 11
Multivariate Regression
What if the inputs are vectors?
3.
.4
6.
2-d input
5
.
example
.8
x2 . 10
x1
Dataset has form
x1 y1
x2 y2
x3 y3
.: :
.
xR yR
Copyright © 2001, 2003, Andrew W. Moore 12
- 7. Multivariate Regression
Write matrix X and Y thus:
.....x1 ..... x11 x1m y1
x12 ...
.....x ..... x y
x22 ... x2 m
= 21
x= y = 2
2
M
M
M
.....x R ..... xR1 xR 2 ... xRm yR
(there are R datapoints. Each input has m components)
The linear regression model assumes a vector w such that
Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]
The max. likelihood w is w = (XTX) -1(XTY)
Copyright © 2001, 2003, Andrew W. Moore 13
Multivariate Regression
Write matrix X and Y thus:
.....x1 ..... x11 x1m y1
x12 ...
.....x ..... x ... x2 m
x22 y = y2
= 21
x= 2
M
M
M
.....x R ..... xR1 xR 2 ... xRm yR
IMPORTANT EXERCISE:
(there are R datapoints. Each input hasPROVE IT !!!!!
m components)
The linear regression model assumes a vector w such that
Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]
The max. likelihood w is w = (XTX) -1(XTY)
Copyright © 2001, 2003, Andrew W. Moore 14
- 8. Multivariate Regression (con’t)
The max. likelihood w is w = (XTX)-1(XTY)
R
∑x x ki kj
XTX is an m x m matrix: i,j’th elt is
k =1
R
∑x y
XTY is an m-element vector: i’th elt ki k
k =1
Copyright © 2001, 2003, Andrew W. Moore 15
Constant Term
in Linear
Regression
Copyright © 2001, 2003, Andrew W. Moore 16
- 9. What about a constant term?
We may expect
linear data that does
not go through the
origin.
Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.
Can you guess??
Copyright © 2001, 2003, Andrew W. Moore 17
The constant term
• The trick is to create a fake input “X0” that
always takes the value 1
X1 X2 Y X0 X1 X2 Y
2 4 16 1 2 4 16
3 4 17 1 3 4 17
5 5 20 1 5 5 20
Before: After:
Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2
…has to be a poor = w0+w1X1+ w2X2
In this example,
You should be able
model …has a fine constant
to see the MLE w0
, w1 and w2 by
term
inspection
Copyright © 2001, 2003, Andrew W. Moore 18
- 10. . ..
i ty
astic
ed
rosc
e
H et
Linear
Regression with
varying noise
Copyright © 2001, 2003, Andrew W. Moore 19
Regression with varying noise
• Suppose you know the variance of the noise that
was added to each datapoint.
y=3
σ=2
σi2
xi yi
½ ½ 4 y=2
σ=1/2
1 1 1
σ=1
2 1 1/4 y=1
σ=1/2
σ=2
2 3 4 y=0
3 2 1/4 x=0 x=1 x=2 x=3
MLE
the w?
yi ~ N ( wxi , σ i2 ) at’s f
W h at e o
Assume
stim
e
Copyright © 2001, 2003, Andrew W. Moore 20
- 11. MLE estimation with varying noise
argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ 12 , σ 2 ,..., σ R , w) =
2 2
1 2 R
w Assuming independence
( yi − wxi ) 2
R among noise and then
argmin ∑ = plugging in equation for
σ i2 Gaussian and simplifying.
i =1
w
x ( y − wx ) Setting dLL/dw
R
w such that ∑ i i 2 i = 0 = equal to zero
σi
i =1
R xi yi Trivial algebra
∑ 2
i =1 σ i
R xi2
∑ 2
σ
i =1 i
Copyright © 2001, 2003, Andrew W. Moore 21
This is Weighted Regression
• We are asking to minimize the weighted sum of
squares
y=3
σ=2
( yi − wxi ) 2
R
argmin ∑ σ i2 y=2
i =1 σ=1/2
w
σ=1
y=1
σ=1/2
σ=2
y=0
x=0 x=1 x=2 x=3
1
where weight for i’th datapoint is σ i2
Copyright © 2001, 2003, Andrew W. Moore 22
- 12. Non-linear
Regression
Copyright © 2001, 2003, Andrew W. Moore 23
Non-linear Regression
• Suppose you know that y is related to a function of x in
such a way that the predicted values have a non-linear
dependence on w, e.g:
y=3
xi yi
½ ½ y=2
1 2.5
2 3 y=1
3 2 y=0
3 3 x=0 x=1 x=2 x=3
MLE
yi ~ N ( w + xi , σ ) the w?
2 at’s f
W h at e o
Assume
stim
e
Copyright © 2001, 2003, Andrew W. Moore 24
- 13. Non-linear MLE estimation
argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) =
1 2 R
w Assuming i.i.d. and
argmin ∑ (y )
R then plugging in
2
− w + xi = equation for Gaussian
i
and simplifying.
i =1
w
y − w + xi Setting dLL/dw
R
w such that ∑ i = 0 = equal to zero
w + xi
i =1
Copyright © 2001, 2003, Andrew W. Moore 25
Non-linear MLE estimation
argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) =
1 2 R
w Assuming i.i.d. and
argmin ∑ (y )
R then plugging in
2
− w + xi = equation for Gaussian
i
and simplifying.
i =1
w
y − w + xi Setting dLL/dw
R
w such that ∑ i = 0 = equal to zero
w + xi
i =1
We’re down the
algebraic toilet
t
wha
ess
u
So g e do?
w
Copyright © 2001, 2003, Andrew W. Moore 26
- 14. Non-linear MLE estimation
argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) =
1 2 R
w Assuming i.i.d. and
argmin ∑ ( )
R
Common (but not only) approach: then plugging in
2
yi − w + xi = equation for Gaussian
Numerical Solutions: and simplifying.
i =1
w
• Line Search
• Simulated Annealing
yi − w + xi Setting dLL/dw
R
∑
w such = 0 =
• Gradient Descent that equal to zero
w + xi
• Conjugate Gradient i =1
• Levenberg Marquart
We’re down the
• Newton’s Method
algebraic toilet
Also, special purpose statistical-
t
wha
optimization-specific tricks such as
uess ?
So g e do
E.M. (See Gaussian Mixtures lecture
for introduction) w
Copyright © 2001, 2003, Andrew W. Moore 27
Polynomial
Regression
Copyright © 2001, 2003, Andrew W. Moore 28
- 15. Polynomial Regression
So far we’ve mainly been dealing with linear regression
X1 X2 Y X= 3 2 y= 7
3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
Z= 1 y=
3 2 7
1 1 1 3
: : β=(ZTZ)-1(ZTy)
:
z1=(1,3,2).. y1=7..
yest = β0+ β1 x1+ β2 x2
zk=(1,xk1,xk2)
Copyright © 2001, 2003, Andrew W. Moore 29
Quadratic Regression
It’s trivial to do linear fits of fixed nonlinear basis functions
X1 X2 Y X= 3 2 y= 7
3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
y=
1 3 2 9 6 4
Z=
7
1 1 1 1 1 1
β=(ZTZ)-1(ZTy)
3
: :
:
yest = β0+ β1 x1+ β2 x2+
z=(1 , x1, x2 , x1 2, x1x2,x22,)
β3 x12 + β4 x1x2 + β5 x22
Copyright © 2001, 2003, Andrew W. Moore 30
- 16. Quadratic Regression
It’s trivial to do linear fitsof afixed nonlinear basisterm.
Each component of z vector is called a functions
X1 X2 Each column of X= 3 matrix is called a term column
Y y= 7
the Z 2
3 2 How many terms in 1 quadratic regression with m
7 a1 3
1 inputs?
1 3 :: :
: •1 constant term 1=(3,2)..
: : x y1=7..
1 •m linear terms6 4 y=
329
Z=
7
1 •(m+1)-choose-2 =1
1 1 1 1 m(m+1)/2 quadratic terms
β=(ZT -1 T
(m+2)-choose-2 terms in total = O(m2) Z) (Z y)
3
: :
:
y = β0+ β1 x1+ β2 x2+
est
z=(1 , x1, x2 , x12, x1x2,x22,) -1 T
Note that solving β=(ZTZ) (Z y) is thusβO(m6)+ β x 2
β3 x12 + 4 x1x2 52
Copyright © 2001, 2003, Andrew W. Moore 31
Qth-degree polynomial Regression
X1 X2 Y X= 3 y= 7
2
3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
y= 7
1 3 2 9 6 …
Z=
3
1 1 1 1 1 …
β=(ZTZ)-1(ZTy)
:
: …
yest = β0+
z=(all products of powers of inputs in
which sum of powers is q or less,) β1 x1+…
Copyright © 2001, 2003, Andrew W. Moore 32
- 17. m inputs, degree Q: how many terms?
= the number of unique terms of the form
m
x1q1 x2 2 ...xmm where ∑ qi ≤ Q
q
q
i =1
= the number of unique terms of the m
form
1q0 x1q1 x2 2 ...xmm where ∑ qi = Q
q
q
i =0
= the number of lists of non-negative integers [q0,q1,q2,..qm]
Σqi = Q
in which
= the number of ways of placing Q red disks on a row of
squares of length Q+m = (Q+m)-choose-Q
Q=11, m=4
q0=2 q1=2 q2=0 q3=4 q4=3
Copyright © 2001, 2003, Andrew W. Moore 33
Radial Basis
Functions
Copyright © 2001, 2003, Andrew W. Moore 34
- 18. Radial Basis Functions (RBFs)
X1 X2 Y X= 3 y= 7
2
3 2 7 1 1 3
1 1 3 : : :
: :: x1=(3,2).. y1=7..
… … … … … … y= 7
Z=
3
………………
β=(ZTZ)-1(ZTy)
:
………………
yest = β0+
z=(list of radial basis function evaluations)
β1 x1+…
Copyright © 2001, 2003, Andrew W. Moore 35
1-d RBFs
y
c1 c1 c1
x
yest = β1 φ1(x) + β2 φ2(x) + β3 φ3(x)
where
φi(x) = KernelFunction( | x - ci | / KW)
Copyright © 2001, 2003, Andrew W. Moore 36
- 19. Example
y
c1 c1 c1
x
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where
φi(x) = KernelFunction( | x - ci | / KW)
Copyright © 2001, 2003, Andrew W. Moore 37
RBFs with Linear Regression
KW also held constant
(initialized to be large
enough that there’s decent
c
Ally i ’s are held constant
overlap between basis
(initialized randomly or
functions*
on a grid in m-
c1 c1 c1
x
dimensional input space) *Usually much better than the crappy
overlap on my diagram
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where
φi(x) = KernelFunction( | x - ci | / KW)
Copyright © 2001, 2003, Andrew W. Moore 38
- 20. RBFs with Linear Regression
KW also held constant
(initialized to be large
enough that there’s decent
c
Ally i ’s are held constant
overlap between basis
(initialized randomly or
functions*
on a grid in m-
c1 c1 c1
x
dimensional input space) *Usually much better than the crappy
overlap on my diagram
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where
φi(x) = KernelFunction( | x - ci | / KW)
then given Q basis functions, define the matrix Z such that Zkj =
KernelFunction( | xk - ci | / KW) where xk is the kth vector of inputs
And as before, β=(ZTZ)-1(ZTy)
Copyright © 2001, 2003, Andrew W. Moore 39
RBFs with NonLinear Regression
Allow the ci ’s to adapt to KW allowed to adapt to the data.
y data (initialized
the (Some folks even let each basis
function have its own
randomly or on a grid in
c1 c c1
KWj,permitting fine detail in
m-dimensional input 1
x
dense regions of input space)
space)
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where
φi(x) = KernelFunction( | x - ci | / KW)
But how do we now find all the βj’s, ci ’s and KW ?
Copyright © 2001, 2003, Andrew W. Moore 40
- 21. RBFs with NonLinear Regression
Allow the ci ’s to adapt to KW allowed to adapt to the data.
y data (initialized
the (Some folks even let each basis
function have its own
randomly or on a grid in
c1 c c1
KWj,permitting fine detail in
m-dimensional input 1
x
dense regions of input space)
space)
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where
φi(x) = KernelFunction( | x - ci | / KW)
But how do we now find all the βj’s, ci ’s and KW ?
Answer: Gradient Descent
Copyright © 2001, 2003, Andrew W. Moore 41
RBFs with NonLinear Regression
Allow the ci ’s to adapt to KW allowed to adapt to the data.
y data (initialized
the (Some folks even let each basis
function have its own
randomly or on a grid in
c1 c c1
KWj,permitting fine detail in
m-dimensional input 1
x
dense regions of input space)
space)
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where
φi(x) = KernelFunction( | x - ci | / KW)
But how do we now find all the βj’s, ci ’s and KW ?
(But I’d like to see, or hope someone’s already done, a
Answer: Gradient Descent
hybrid, where the ci ’s and KW are updated with gradient
descent while the βj’s use matrix inversion)
Copyright © 2001, 2003, Andrew W. Moore 42
- 22. Radial Basis Functions in 2-d
Two inputs.
Outputs (heights
Center
sticking out of page)
not shown.
Sphere of
significant
x2
influence of
center
x1
Copyright © 2001, 2003, Andrew W. Moore 43
Happy RBFs in 2-d
Blue dots denote
coordinates of
input vectors
Center
Sphere of
significant
x2
influence of
center
x1
Copyright © 2001, 2003, Andrew W. Moore 44
- 23. Crabby RBFs in 2-d What’s the
problem in this
example?
Blue dots denote
coordinates of
input vectors
Center
Sphere of
significant
x2
influence of
center
x1
Copyright © 2001, 2003, Andrew W. Moore 45
More crabby RBFs And what’s the
problem in this
example?
Blue dots denote
coordinates of
input vectors
Center
Sphere of
significant
x2
influence of
center
x1
Copyright © 2001, 2003, Andrew W. Moore 46
- 24. Hopeless! Even before seeing the data, you should
understand that this is a disaster!
Center
Sphere of
significant
x2 influence of
center
x1
Copyright © 2001, 2003, Andrew W. Moore 47
Unhappy Even before seeing the data, you should
understand that this isn’t good either..
Center
Sphere of
significant
x2 influence of
center
x1
Copyright © 2001, 2003, Andrew W. Moore 48
- 25. Robust
Regression
Copyright © 2001, 2003, Andrew W. Moore 49
Robust Regression
y
x
Copyright © 2001, 2003, Andrew W. Moore 50
- 26. Robust Regression
This is the best fit that
Quadratic Regression can
manage
y
x
Copyright © 2001, 2003, Andrew W. Moore 51
Robust Regression
…but this is what we’d
probably prefer
y
x
Copyright © 2001, 2003, Andrew W. Moore 52
- 27. LOESS-based Robust Regression
After the initial fit, score
each datapoint according to
how well it’s fitted…
y You are a very good
datapoint.
x
Copyright © 2001, 2003, Andrew W. Moore 53
LOESS-based Robust Regression
After the initial fit, score
each datapoint according to
how well it’s fitted…
y You are a very good
datapoint.
You are not too
shabby.
x
Copyright © 2001, 2003, Andrew W. Moore 54
- 28. LOESS-based Robust Regression
After the initial fit, score
each datapoint according to
how well it’s fitted…
But you are
pathetic.
y You are a very good
datapoint.
You are not too
shabby.
x
Copyright © 2001, 2003, Andrew W. Moore 55
Robust Regression
For k = 1 to R…
•Let (xk,yk) be the kth datapoint
y
•Let yestk be predicted value of
yk
x
•Let wk be a weight for
datapoint k that is large if the
datapoint fits well and small if it
fits badly:
wk = KernelFn([yk- yestk]2)
Copyright © 2001, 2003, Andrew W. Moore 56
- 29. Robust Regression
For k = 1 to R…
•Let (xk,yk) be the kth datapoint
y
•Let yestk be predicted value of
yk
x
•Let wk be a weight for
datapoint k that is large if the
Then redo the regression
datapoint fits well and small if it
using weighted datapoints.
fits badly:
Weighted regression was described earlier in
the “vary noise” section, and is also discussed
wk = KernelFn([yk- yestk]2)
in the “Memory-based Learning” Lecture.
Guess what happens next?
Copyright © 2001, 2003, Andrew W. Moore 57
Robust Regression
For k = 1 to R…
•Let (xk,yk) be the kth datapoint
y
•Let yestk be predicted value of
yk
x
•Let wk be a weight for
datapoint k that is large if the
Then redo the regression
datapoint fits well and small if it
using weighted datapoints.
fits badly:
I taught you how to do this in the “Instance-
based” lecture (only then the weights
wk = KernelFn([yk- yestk]2)
depended on distance in input-space)
Repeat whole thing until
converged!
Copyright © 2001, 2003, Andrew W. Moore 58
- 30. Robust Regression---what we’re
doing
What regular regression does:
Assume yk was originally generated using the
following recipe:
yk = β0+ β1 xk+ β2 xk2 +N(0,σ2)
Computational task is to find the Maximum
Likelihood β0 , β1 and β2
Copyright © 2001, 2003, Andrew W. Moore 59
Robust Regression---what we’re
doing
What LOESS robust regression does:
Assume yk was originally generated using the
following recipe:
With probability p:
yk = β0+ β1 xk+ β2 xk2 +N(0,σ2)
But otherwise
yk ~ N(µ,σhuge2)
Computational task is to find the Maximum
Likelihood β0 , β1 , β2 , p, µ and σhuge
Copyright © 2001, 2003, Andrew W. Moore 60
- 31. Robust Regression---what we’re
doing
What LOESS robust regression does:
Mysteriously, the
Assume yk was originally generated using the
reweighting procedure
following recipe: does this computation
for us.
With probability p:
yk = β0+ β1 xk+ β2 xk2 +N(0,σ2) Your first glimpse of
two spectacular letters:
But otherwise
yk ~ N(µ,σhuge2) E.M.
Computational task is to find the Maximum
Likelihood β0 , β1 , β2 , p, µ and σhuge
Copyright © 2001, 2003, Andrew W. Moore 61
Regression
Trees
Copyright © 2001, 2003, Andrew W. Moore 62
- 32. Regression Trees
• “Decision trees for regression”
Copyright © 2001, 2003, Andrew W. Moore 63
A regression tree leaf
Predict age = 47
Mean age of records
matching this leaf node
Copyright © 2001, 2003, Andrew W. Moore 64
- 33. A one-split regression tree
Gender?
Female Male
Predict age = 39 Predict age = 36
Copyright © 2001, 2003, Andrew W. Moore 65
Choosing the attribute to split on
Gender Rich? Num. Num. Beany Age
Children Babies
Female No 2 1 38
Male No 0 0 24
Male Yes 0 5+ 72
: : : : :
• We can’t use
information gain.
• What should we use?
Copyright © 2001, 2003, Andrew W. Moore 66
- 34. Choosing the attribute to split on
Gender Rich? Num. Num. Beany Age
Children Babies
Female No 2 1 38
Male No 0 0 24
Male Yes 0 5+ 72
: : : : :
MSE(Y|X) = The expected squared error if we must predict a record’s Y
value given only knowledge of the record’s X value
If we’re told x=j, the smallest expected error comes from predicting the
mean of the Y-values among those records in which x=j. Call this mean
quantity µyx=j
Then…
1 NX
MSE (Y | X )= ∑ ∑ (x y=kj )− µ yx= j ) 2
R j =1 ( k such that k
Copyright © 2001, 2003, Andrew W. Moore 67
Choosing the attribute to split on
Gender Rich? Num. Num. Beany Age
Children Babies
NoRegression tree attribute selection: greedily
Female 2 1 38
choose the attribute that minimizes MSE(Y|X)
Male No 0 0 24
Guess 0what we do about real-valued inputs?
Male Yes 5+ 72
: : : : :
Guess how we prevent overfitting
MSE(Y|X) = The expected squared error if we must predict a record’s Y
value given only knowledge of the record’s X value
If we’re told x=j, the smallest expected error comes from predicting the
mean of the Y-values among those records in which x=j. Call this mean
quantity µyx=j
Then…
1 NX
MSE (Y | X )= ∑ ∑ (x y=kj )− µ yx= j ) 2
R j =1 ( k such that k
Copyright © 2001, 2003, Andrew W. Moore 68
- 35. Pruning Decision
…property-owner = Yes
Do I deserve
to live?
Gender?
Female Male
Predict age = 39 Predict age = 36
# property-owning females = 56712 # property-owning males = 55800
Mean age among POFs = 39 Mean age among POMs = 36
Age std dev among POFs = 12 Age std dev among POMs = 11.5
Use a standard Chi-squared test of the null-
hypothesis “these two populations have the same
mean” and Bob’s your uncle.
Copyright © 2001, 2003, Andrew W. Moore 69
Linear Regression Trees
…property-owner = Yes
Also known as
“Model Trees”
Gender?
Female Male
Predict age = Predict age =
26 + 6 * NumChildren - 24 + 7 * NumChildren -
2 * YearsEducation 2.5 * YearsEducation
Leaves contain linear Split attribute chosen to minimize
functions (trained using MSE of regressed children.
linear regression on all
Pruning with a different Chi-
records matching that leaf)
squared
Copyright © 2001, 2003, Andrew W. Moore 70
- 36. Linear Regression Trees
…property-owner = Yes
Also known as
“Model Trees”
Gender?
d
ste
ny en te
Female Male
ae
re b
gno has he
i
lly Predict raget =
ha u ing d
t
tes
a
Predict age = ic t
typ ibute ree d ste ttribu
You
26 + 6 * NumChildrenttr the 24 l+ nte ued a e
t
- u7 *l NumChildren -
ail: rical a in al v
et o se e* l-va abo
2 * YearsEducation up
D 2.5 a YearsEducation
teg her But u e r d
te
ca hig tes
us
n.
sio , and been
n
or
Leaves contain lineares s Split attribute chosen to minimize
reg ibute ey’ve of regressed children.
functions (trained using h MSE
r
at t t
linear regression on alln if
e
ev Pruning with a different Chi-
records matching that leaf)
squared
Copyright © 2001, 2003, Andrew W. Moore 71
Test your understanding
Assuming regular regression trees, can you sketch a
graph of the fitted function yest(x) over this diagram?
y
x
Copyright © 2001, 2003, Andrew W. Moore 72
- 37. Test your understanding
Assuming linear regression trees, can you sketch a graph
of the fitted function yest(x) over this diagram?
y
x
Copyright © 2001, 2003, Andrew W. Moore 73
Multilinear
Interpolation
Copyright © 2001, 2003, Andrew W. Moore 74
- 38. Multilinear Interpolation
Consider this dataset. Suppose we wanted to create a
continuous and piecewise linear fit to the data
y
x
Copyright © 2001, 2003, Andrew W. Moore 75
Multilinear Interpolation
Create a set of knot points: selected X-coordinates
(usually equally spaced) that cover the data
y
q1 q2 q3 q4 q5
x
Copyright © 2001, 2003, Andrew W. Moore 76
- 39. Multilinear Interpolation
We are going to assume the data was generated by a
noisy version of a function that can only bend at the
knots. Here are 3 examples (none fits the data well)
y
q1 q2 q3 q4 q5
x
Copyright © 2001, 2003, Andrew W. Moore 77
How to find the best fit?
Idea 1: Simply perform a separate regression in each
segment for each part of the curve
What’s the problem with this idea?
y
q1 q2 q3 q4 q5
x
Copyright © 2001, 2003, Andrew W. Moore 78
- 40. How to find the best fit?
Let’s look at what goes on in the red segment
(q3 − x) (q − x)
y est ( x) = h2 + 2 h3 where w = q3 − q2
w w
h2
y
h3
q1 q2 q3 q4 q5
x
Copyright © 2001, 2003, Andrew W. Moore 79
How to find the best fit?
In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x)
x − q2 q −x
where φ2 ( x) = 1 − , φ3 ( x) = 1 − 3
w w
h2
y
φ2(x)
h3
q1 q2 q3 q4 q5
x
Copyright © 2001, 2003, Andrew W. Moore 80
- 41. How to find the best fit?
In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x)
x − q2 q −x
where φ2 ( x) = 1 − , φ3 ( x) = 1 − 3
w w
h2
y
φ2(x)
h3
q1 q2 q3 q4 q5
φ3(x)
x
Copyright © 2001, 2003, Andrew W. Moore 81
How to find the best fit?
In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x)
| x − q2 | | x − q3 |
where φ2 ( x) = 1 − , φ3 ( x) = 1 −
w w
h2
y
φ2(x)
h3
q1 q2 q3 q4 q5
φ3(x)
x
Copyright © 2001, 2003, Andrew W. Moore 82
- 42. How to find the best fit?
In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x)
| x − q2 | | x − q3 |
where φ2 ( x) = 1 − , φ3 ( x) = 1 −
w w
h2
y
φ2(x)
h3
q1 q2 q3 q4 q5
φ3(x)
x
Copyright © 2001, 2003, Andrew W. Moore 83
How to find the best fit? NK
In general y est ( x) = ∑ hi φi ( x)
i =1
| x − qi |
1 − if | x − qi |<w
where φi ( x) = w
0 otherwise
h2
y
φ2(x)
h3
q1 q2 q3 q4 q5
φ3(x)
x
Copyright © 2001, 2003, Andrew W. Moore 84
- 43. How to find the best fit? NK
In general y est ( x) = ∑ hi φi ( x)
i =1
| x − qi |
1 − if | x − qi |<w
where φi ( x) = w
0 otherwise
h2 And this is simply a basis function
regression problem!
y We know how to find the least
φ2(x)
h3 squares hiis!
q1 q2 q3 q4 q5
φ3(x)
x
Copyright © 2001, 2003, Andrew W. Moore 85
In two dimensions…
Blue dots show
locations of input
vectors (outputs
not depicted)
x2
x1
Copyright © 2001, 2003, Andrew W. Moore 86
- 44. In two dimensions… Each purple dot
is a knot point.
Blue dots show It will contain
locations of input the height of
vectors (outputs the estimated
not depicted) surface
x2
x1
Copyright © 2001, 2003, Andrew W. Moore 87
In two dimensions… Each purple dot
is a knot point.
Blue dots show It will contain
locations of input the height of
vectors (outputs the estimated
9 3
not depicted) surface
But how do we
do the
interpolation to
ensure that the
surface is
7 8
continuous?
x2
x1
Copyright © 2001, 2003, Andrew W. Moore 88
- 45. In two dimensions… Each purple dot
is a knot point.
Blue dots showthe It will contain
To predict
locations of input the height of
value here…
vectors (outputs the estimated
9 3
not depicted) surface
But how do we
do the
interpolation to
ensure that the
surface is
7 8
continuous?
x2
x1
Copyright © 2001, 2003, Andrew W. Moore 89
In two dimensions… Each purple dot
is a knot point.
Blue dots show It will contain
locations of input the height of
vectors (outputs the estimated
97 3
not depicted) surface
But how do we
do the
To predict the
interpolation to
value here… ensure that the
First interpolate surface is
7 8
its value on two continuous?
opposite edges… 7.33
x2
x1
Copyright © 2001, 2003, Andrew W. Moore 90
- 46. In two dimensions… Each purple dot
is a knot point.
Blue dots show It will contain
locations of input the height of
vectors (outputs the estimated
97 3
not depicted) surface
To predict the But how do we
value here… do the
7.05 interpolation to
First interpolate
ensure that the
its value on two
surface is
7 8
opposite edges…
continuous?
Then interpolate
7.33
between those
x
two values 2
x1
Copyright © 2001, 2003, Andrew W. Moore 91
In two dimensions… Each purple dot
is a knot point.
Blue dots show It will contain
locations of input the height of
vectors (outputs the estimated
97 3
not depicted) surface
To predict the But how do we
value here… do the
7.05 interpolation to
First interpolate
ensure that the
its value on two
Notes: 8 surface is
7
opposite edges…
continuous?
Then interpolate
7.33 can easily be generalized
This
between those
to m dimensions.
x
two values 2
It should be easy to see that it
ensures continuity
x1 The patches are not linear
Copyright © 2001, 2003, Andrew W. Moore 92
- 47. Doing the regression Given data, how
do we find the
optimal knot
heights?
Happily, it’s
9 3 simply a two-
dimensional
basis function
problem.
(Working out
7 8 the basis
functions is
tedious,
x2 unilluminating,
and easy)
What’s the
problem in
higher
x1 dimensions?
Copyright © 2001, 2003, Andrew W. Moore 93
MARS: Multivariate
Adaptive Regression
Splines
Copyright © 2001, 2003, Andrew W. Moore 94
- 48. MARS
• Multivariate Adaptive Regression Splines
• Invented by Jerry Friedman (one of
Andrew’s heroes)
• Simplest version:
Let’s assume the function we are learning is of the
following form:
m
y (x) = ∑ g k ( xk )
est
k =1
Instead of a linear combination of the inputs, it’s a linear
combination of non-linear functions of individual inputs
Copyright © 2001, 2003, Andrew W. Moore 95
MARS m
y (x) = ∑ g k ( xk )
est
k =1
Instead of a linear combination of the inputs, it’s a linear
combination of non-linear functions of individual inputs
Idea: Each
gk is one of
these
y
q1 q2 q3 q4 q5
x
Copyright © 2001, 2003, Andrew W. Moore 96
- 49. MARS m
y (x) = ∑ g k ( xk )
est
k =1
Instead of a linear combination of the inputs, it’s a linear
combination of non-linear functions of individual inputs
m NK
y (x) = ∑∑ h k φ k ( xk ) qkj : The location of
est
the j’th knot in the
jj
k =1 j =1 k’th dimension
hkj : The regressed
| xk − q k | height of the j’th
1 − j
if | xk − q k |<wk knot in the k’th
where φ k ( x) = j
y wk
j dimension
0 otherwise
wk: The spacing
between knots in
the kth dimension
q1 q2 q3 q4 q5
x
Copyright © 2001, 2003, Andrew W. Moore 97
That’s not complicated enough!
• Okay, now let’s get serious. We’ll allow
arbitrary “two-way interactions”:
m m m
y (x) = ∑ g k ( xk ) + ∑ ∑g
est
( xk , xt )
kt
k =1 k =1 t = k +1
Can still be expressed as a linear
The function we’re combination of basis functions
learning is allowed to be
Thus learnable by linear regression
a sum of non-linear
functions over all one-d Full MARS: Uses cross-validation to
and 2-d subsets of choose a subset of subspaces, knot
attributes resolution and other parameters.
Copyright © 2001, 2003, Andrew W. Moore 98
- 50. If you like MARS…
…See also CMAC (Cerebellar Model Articulated
Controller) by James Albus (another of
Andrew’s heroes)
• Many of the same gut-level intuitions
• But entirely in a neural-network, biologically
plausible way
• (All the low dimensional functions are by
means of lookup tables, trained with a delta-
rule and using a clever blurred update and
hash-tables)
Copyright © 2001, 2003, Andrew W. Moore 99
Where are we now?
Inputs
Inference
p(E1|E2) Joint DE
Engine Learn
Dec Tree, Gauss/Joint BC, Gauss Naïve BC,
Inputs
Predict
Classifier category
Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve
Inputs Inputs
Prob-
Density
DE
ability
Estimator
Linear Regression, Polynomial Regression, RBFs,
Predict
Regressor Robust Regression Regression Trees, Multilinear
real no.
Interp, MARS
Copyright © 2001, 2003, Andrew W. Moore 100
- 51. Citations
Radial Basis Functions Regression Trees etc
T. Poggio and F. Girosi, L. Breiman and J. H. Friedman and
Regularization Algorithms for R. A. Olshen and C. J. Stone,
Learning That Are Equivalent to Classification and Regression
Multilayer Networks, Science, Trees, Wadsworth, 1984
247, 978--982, 1989 J. R. Quinlan, Combining Instance-
LOESS Based and Model-Based
Learning, Machine Learning:
W. S. Cleveland, Robust Locally
Proceedings of the Tenth
Weighted Regression and
International Conference, 1993
Smoothing Scatterplots, Journal
of the American Statistical MARS
Association, 74, 368, 829-836, J. H. Friedman, Multivariate
December, 1979 Adaptive Regression Splines,
Department for Statistics,
Stanford University, 1988,
Technical Report No. 102
Copyright © 2001, 2003, Andrew W. Moore 101