SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
Predicting Real-valued
          outputs: an introduction
               to Regression
                                                  Andrew W. Moore
Note to other teachers and users of
these slides. Andrew would be delighted

                                                       Professor
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them

                                              School of Computer Science
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in

                                              Carnegie Mellon University
your own lecture, please include this                                                                 l
                                                                                                 teria
                                                                                            d ma
message, or the following link to the
                                                                                       rdere Nets
source repository of Andrew’s tutorials:
                                                                                is reo
                                                    www.cs.cmu.edu/~awm                      l
                                                                          This he Neura “Favorite
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully                                               t
                                                                                        d the rithms”
                                                      awm@cs.cmu.edu       from
                                                                                  re an
received.
                                                                                               o
                                                                            lectu ssion Alg
                                                        412-268-7599               e
                                                                             Regr
                                                                                     e
                                                                                    r
                                                                              lectu
                    Copyright © 2001, 2003, Andrew W. Moore




                                  Single-
                                Parameter
                                  Linear
                                Regression
  Copyright © 2001, 2003, Andrew W. Moore                                                                 2
Linear Regression
                                                     DATASET

                                           inputs          outputs
                                          x1 = 1          y1 = 1
                                          x2 = 3          y2 = 2.2
              ↑
                                          x3 = 2          y3 = 2
              w
              ↓
           ←1→
                                          x4 = 1.5        y4 = 1.9
                                          x5 = 4          y5 = 3.1

Linear regression assumes that the expected value of
the output given an input, E[y|x], is linear.
Simplest case: Out(x) = wx for some unknown w.
Given the data, we can estimate w.
Copyright © 2001, 2003, Andrew W. Moore                              3




      1-parameter linear regression
Assume that the data is formed by
                 yi = wxi + noisei

where…
• the noise signals are independent
• the noise has a normal distribution with mean 0
  and unknown variance σ2

p(y|w,x) has a normal distribution with
• mean wx
• variance σ2
Copyright © 2001, 2003, Andrew W. Moore                              4
Bayesian Linear Regression
                p(y|w,x) = Normal (mean wx, var σ2)

We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn)
which are EVIDENCE about w.

We want to infer w from the data.
             p(w|x1, x2, x3,…xn, y1, y2…yn)
•You can use BAYES rule to work out a posterior
distribution for w given the data.
•Or you could do Maximum Likelihood Estimation

Copyright © 2001, 2003, Andrew W. Moore                        5




  Maximum likelihood estimation of w

Asks the question:
“For which value of w is this data most likely to have
  happened?”
      <=>
For what w is
  p(y1, y2…yn |x1, x2, x3,…xn, w) maximized?
      <=>
For what w is      n

                                    ∏ p( y w, x ) maximized?
                                             i   i
                                      i =1
Copyright © 2001, 2003, Andrew W. Moore                        6
For what w is
                       n

                    ∏          p ( y i w , x i ) maximized?
                      i =1

For what w is
            n
                                           1
                      ∏ exp(− 2 (
                                               yi − wxi
                                                            ) 2 ) maximized?
                                                σ
                       i =1
For what w is
                                                                    2
                                1  y − wx i 
                           n

                      ∑        −i            maximized?
                                      σ
                                2           
                       i =1
For what w is                                               2
                           n

                       ∑ (y                             )
                                         − wx                   minimized?
                                     i              i
                        i =1


 Copyright © 2001, 2003, Andrew W. Moore                                                   7




                               Linear Regression


 The maximum
   likelihood w is
   the one that                                             E(w)
                                                                        w
   minimizes sum-
                                               Ε = ∑( yi − wxi )
                                                                            2
   of-squares of
   residuals                                                i

                                                                                (∑ x )w
                                               = ∑ yi − (2∑ xi yi )w +
                                                                2                      2   2
                                                                                   i
                                                        i
    We want to minimize a quadratic function of w.
 Copyright © 2001, 2003, Andrew W. Moore                                                   8
Linear Regression
Easy to show the sum of
  squares is minimized
  when
                    ∑x y              i        i
                 w=
                    ∑x
                                              2
                                          i

The maximum likelihood
                        Out(x) = wx
  model is


 We can use it for
  prediction
Copyright © 2001, 2003, Andrew W. Moore                                                    9




                          Linear Regression
Easy to show the sum of
  squares is minimized
  when
                           ∑x y
                                                      p(w)
                                      i        i
                 w=                                                w

                           ∑x
                                              2
                                                   Note:
                                          i                    In Bayesian stats you’d have
                                                      ended up with a prob dist of w
The maximum likelihood
                        Out(x) = wx
  model is                                         And predictions would have given a prob
                                                      dist of expected output

                                                   Often useful to know your confidence.
 We can use it for
                                                      Max likelihood can give some kinds of
  prediction
                                                      confidence too.

Copyright © 2001, 2003, Andrew W. Moore                                                10
Multivariate
                     Linear
                   Regression
Copyright © 2001, 2003, Andrew W. Moore                                           11




                 Multivariate Regression
What if the inputs are vectors?
                                                         3.
                                       .4
                                 6.
                                                                      2-d input
                                                     5
                                                 .
                                                                      example
                                  .8
                    x2                                    . 10

                                 x1
  Dataset has form
                                            x1                   y1
                                            x2                   y2
                                            x3                   y3
                                            .:                   :
                                            .
                                            xR                   yR
Copyright © 2001, 2003, Andrew W. Moore                                           12
Multivariate Regression
Write matrix X and Y thus:

                  .....x1 .....   x11               x1m        y1 
                                            x12    ...
                 .....x .....  x                              y 
                                            x22    ... x2 m 
                                  =  21
               x=                                            y =  2
                          2

                                                          
                                                                 M
                                                    M
                         M
                                                          
                                                                 
                 .....x R .....  xR1     xR 2   ... xRm        yR 

 (there are R datapoints. Each input has m components)
 The linear regression model assumes a vector w such that
             Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]
 The max. likelihood w is w = (XTX) -1(XTY)

Copyright © 2001, 2003, Andrew W. Moore                                    13




                 Multivariate Regression
Write matrix X and Y thus:

                  .....x1 .....   x11               x1m        y1 
                                            x12    ...
                 .....x .....  x                ... x2 m      
                                            x22              y =  y2 
                                  =  21
               x=        2

                                                          
                                                                 M
                                                    M
                         M
                                                          
                                                                 
                 .....x R .....  xR1     xR 2   ... xRm        yR 

                                                   IMPORTANT EXERCISE:
 (there are R datapoints. Each input hasPROVE IT !!!!!
                                         m components)
 The linear regression model assumes a vector w such that
             Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]
 The max. likelihood w is w = (XTX) -1(XTY)

Copyright © 2001, 2003, Andrew W. Moore                                    14
Multivariate Regression (con’t)

The max. likelihood w is w = (XTX)-1(XTY)
                                           R

                                          ∑x x       ki kj
XTX is an m x m matrix: i,j’th elt is
                                          k =1
                                                 R

                                               ∑x        y
XTY is an m-element vector: i’th elt                   ki k
                                               k =1


Copyright © 2001, 2003, Andrew W. Moore                       15




         Constant Term
           in Linear
          Regression
Copyright © 2001, 2003, Andrew W. Moore                       16
What about a constant term?
We may expect
linear data that does
not go through the
origin.

Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.

Can you guess??
Copyright © 2001, 2003, Andrew W. Moore                                               17




                         The constant term
• The trick is to create a fake input “X0” that
  always takes the value 1

     X1       X2        Y                                      X0   X1   X2   Y
     2        4         16                                     1    2    4    16
     3        4         17                                     1    3    4    17
     5        5         20                                     1    5    5    20
   Before:                                                     After:
   Y=w1X1+ w2X2                                                Y= w0X0+w1X1+ w2X2
   …has to be a poor                                            = w0+w1X1+ w2X2
                                          In this example,
                                          You should be able
   model                                                       …has a fine constant
                                          to see the MLE w0
                                          , w1 and w2 by
                                                               term
                                          inspection
Copyright © 2001, 2003, Andrew W. Moore                                               18
. ..
                               i ty
                          astic
                       ed
                   rosc
                 e
         H et
         Linear
     Regression with
      varying noise
Copyright © 2001, 2003, Andrew W. Moore                                              19




           Regression with varying noise
• Suppose you know the variance of the noise that
  was added to each datapoint.
                                          y=3
                                                                   σ=2
                           σi2
    xi         yi
    ½          ½           4              y=2
                                                                            σ=1/2
    1          1           1
                                                            σ=1
    2          1           1/4            y=1
                                                                  σ=1/2
                                                      σ=2
    2          3           4              y=0

    3          2           1/4                  x=0     x=1       x=2      x=3


                                                                                   MLE
                                                                                the w?
                                   yi ~ N ( wxi , σ i2 )                   at’s    f
                                                                        W h at e o
          Assume
                                                                          stim
                                                                         e
Copyright © 2001, 2003, Andrew W. Moore                                              20
MLE estimation with varying noise
 argmax log p( y , y ,..., y                       | x1 , x2 ,..., xR , σ 12 , σ 2 ,..., σ R , w) =
                                                                                 2         2
                                1    2         R

         w                                                                     Assuming independence
                                                     ( yi − wxi ) 2
                                               R                               among noise and then
                              argmin ∑                                 =       plugging in equation for
                                                          σ i2                 Gaussian and simplifying.
                                              i =1
                                    w
                                                    
                                        x ( y − wx )                                 Setting dLL/dw
                                     R
                      w such that ∑ i i 2 i = 0  =                                 equal to zero
                                                    
                                             σi
                                                    
                                   i =1


                                            R xi yi                          Trivial algebra
                                           ∑ 2 
                                                     
                                            i =1 σ i 
                                             R xi2 
                                            ∑ 2 
                                             σ
                                             i =1 i 
Copyright © 2001, 2003, Andrew W. Moore                                                                21




             This is Weighted Regression
• We are asking to minimize the weighted sum of
  squares
                                                      y=3
                                                                                    σ=2
                              ( yi − wxi )    2
                        R

argmin ∑                            σ i2              y=2
                       i =1                                                                  σ=1/2
         w
                                                                         σ=1
                                                      y=1
                                                                                   σ=1/2
                                                                   σ=2
                                                      y=0
                                                            x=0          x=1      x=2       x=3



                                                                   1
  where weight for i’th datapoint is                              σ i2
Copyright © 2001, 2003, Andrew W. Moore                                                                22
Non-linear
                     Regression
Copyright © 2001, 2003, Andrew W. Moore                                            23




                      Non-linear Regression
• Suppose you know that y is related to a function of x in
  such a way that the predicted values have a non-linear
  dependence on w, e.g:
                                          y=3
    xi         yi
    ½          ½                          y=2

    1          2.5
    2          3                          y=1


    3          2                          y=0

    3          3                                x=0   x=1       x=2      x=3


                                                                                 MLE
                   yi ~ N ( w + xi , σ )                                      the w?
                                                            2            at’s    f
                                                                      W h at e o
 Assume
                                                                        stim
                                                                       e
Copyright © 2001, 2003, Andrew W. Moore                                            24
Non-linear MLE estimation
          argmax log p( y , y ,..., y                           | x1 , x2 ,..., xR , σ , w) =
                                          1        2        R

                  w                                                                   Assuming i.i.d. and

                         argmin ∑ (y                                    )
                                               R                                      then plugging in
                                                                         2
                                                           − w + xi          =        equation for Gaussian
                                                       i
                                                                                      and simplifying.
                                              i =1
                                w
                                                     
                                        y − w + xi                                     Setting dLL/dw
                                     R
                      w such that ∑ i             = 0 =                              equal to zero
                                                     
                                           w + xi
                                                     
                                   i =1




Copyright © 2001, 2003, Andrew W. Moore                                                                  25




                Non-linear MLE estimation
          argmax log p( y , y ,..., y                           | x1 , x2 ,..., xR , σ , w) =
                                          1        2        R

                  w                                                                   Assuming i.i.d. and

                         argmin ∑ (y                                    )
                                               R                                      then plugging in
                                                                         2
                                                           − w + xi          =        equation for Gaussian
                                                       i
                                                                                      and simplifying.
                                              i =1
                                w
                                                     
                                        y − w + xi                                     Setting dLL/dw
                                     R
                      w such that ∑ i             = 0 =                              equal to zero
                                                     
                                           w + xi
                                                     
                                   i =1


                                                                             We’re down the
                                                                             algebraic toilet
                                                                                                   t
                                                                                               wha
                                                                                           ess
                                                                                         u
                                                                                     So g e do?
                                                                                         w
Copyright © 2001, 2003, Andrew W. Moore                                                                  26
Non-linear MLE estimation
          argmax log p( y , y ,..., y                 | x1 , x2 ,..., xR , σ , w) =
                                          1   2   R

                  w                                                         Assuming i.i.d. and

                         argmin ∑ (                           )
                              R
Common (but not only) approach:                      then plugging in
                                              2
                                  yi − w + xi =      equation for Gaussian
Numerical Solutions:                                 and simplifying.
                            i =1
                     w
• Line Search
• Simulated Annealing
                                               
                                 yi − w + xi          Setting dLL/dw
                            R

                                          ∑
              w such                        = 0 =
• Gradient Descent that                               equal to zero
                                               
                                     w + xi
                                               
• Conjugate Gradient      i =1

• Levenberg Marquart
                                                We’re down the
• Newton’s Method
                                                                   algebraic toilet
Also, special purpose statistical-
                                                                                        t
                                                                                    wha
    optimization-specific tricks such as
                                                                               uess ?
                                                                           So g e do
    E.M. (See Gaussian Mixtures lecture
    for introduction)                                                          w
Copyright © 2001, 2003, Andrew W. Moore                                                           27




                      Polynomial
                      Regression
Copyright © 2001, 2003, Andrew W. Moore                                                           28
Polynomial Regression
So far we’ve mainly been dealing with linear regression
X1 X2 Y               X= 3 2        y= 7

3      2       7                               1       1         3
1      1       3                               :       :         :
:      :       :                           x1=(3,2)..          y1=7..
           Z= 1                               y=
                        3       2                  7
                1       1       1                  3
                :               :                                    β=(ZTZ)-1(ZTy)
                                                   :
           z1=(1,3,2)..             y1=7..
                                                               yest = β0+ β1 x1+ β2 x2
           zk=(1,xk1,xk2)
Copyright © 2001, 2003, Andrew W. Moore                                                  29




                    Quadratic Regression
It’s trivial to do linear fits of fixed nonlinear basis functions
X1 X2 Y                     X= 3 2        y= 7

3      2       7                               1       1         3
1      1       3                               :       :         :
:  :           :                           x1=(3,2)..          y1=7..
                                                       y=
   1           3       2       9          6    4
Z=
                                                           7
   1           1       1       1          1    1
                                                                     β=(ZTZ)-1(ZTy)
                                                           3
       :                                       :
                                                           :
                                                               yest = β0+ β1 x1+ β2 x2+
z=(1 , x1, x2 , x1                  2,    x1x2,x22,)
                                                               β3 x12 + β4 x1x2 + β5 x22

Copyright © 2001, 2003, Andrew W. Moore                                                  30
Quadratic Regression
It’s trivial to do linear fitsof afixed nonlinear basisterm.
          Each component of z vector is called a functions
X1 X2 Each column of X= 3 matrix is called a term column
            Y                             y= 7
                            the Z 2
3      2 How many terms in 1 quadratic regression with m
           7               a1           3
       1 inputs?
1          3               ::           :
   : •1 constant term 1=(3,2)..
:      :             x             y1=7..
   1 •m linear terms6 4 y=
       329
Z=
                                7
   1 •(m+1)-choose-2 =1
       1 1 1 1 m(m+1)/2 quadratic terms
                                        β=(ZT -1 T
     (m+2)-choose-2 terms in total = O(m2) Z) (Z y)
                                3
   :                    :
                                :
                                     y = β0+ β1 x1+ β2 x2+
                                    est
z=(1 , x1, x2 , x12, x1x2,x22,) -1 T
      Note that solving β=(ZTZ) (Z y) is thusβO(m6)+ β x 2
                                     β3 x12 + 4 x1x2  52

Copyright © 2001, 2003, Andrew W. Moore                                             31




Qth-degree polynomial Regression
X1 X2 Y                                   X= 3             y= 7
                                                  2
3      2       7                              1   1           3
1      1       3                              :   :           :
:  :           :                          x1=(3,2)..        y1=7..
                                                  y= 7
   1           3       2       9          6   …
Z=
                                                       3
   1           1       1       1          1   …
                                                                     β=(ZTZ)-1(ZTy)
                                                       :
       :                                      …
                                                                       yest = β0+
z=(all products of powers of inputs in
which sum of powers is q or less,)                                      β1 x1+…

Copyright © 2001, 2003, Andrew W. Moore                                             32
m inputs, degree Q: how many terms?
= the number of unique terms of the form
                                  m
                            x1q1 x2 2 ...xmm where ∑ qi ≤ Q
                                          q
                                  q

                                                 i =1
= the number of unique terms of the m
                                    form
                         1q0 x1q1 x2 2 ...xmm where ∑ qi = Q
                                           q
                                   q

                                                   i =0
 = the number of lists of non-negative integers [q0,q1,q2,..qm]
                 Σqi = Q
 in which
 = the number of ways of placing Q red disks on a row of
 squares of length Q+m      = (Q+m)-choose-Q


                                                                 Q=11, m=4
     q0=2          q1=2 q2=0              q3=4            q4=3
Copyright © 2001, 2003, Andrew W. Moore                                  33




                  Radial Basis
                   Functions
Copyright © 2001, 2003, Andrew W. Moore                                  34
Radial Basis Functions (RBFs)
X1 X2 Y                                   X= 3        y= 7
                                                 2
3      2       7                            1    1      3
1      1       3                            :    :      :
:  ::       x1=(3,2)..   y1=7..
   … … … … … … y= 7
Z=
                       3
   ………………
                                                                  β=(ZTZ)-1(ZTy)
                       :
   ………………
                                                                    yest = β0+
z=(list of radial basis function evaluations)
                                                                     β1 x1+…

Copyright © 2001, 2003, Andrew W. Moore                                          35




                                          1-d RBFs

                 y

                                c1               c1          c1
                          x

    yest = β1 φ1(x) + β2 φ2(x) + β3 φ3(x)
    where
    φi(x) = KernelFunction( | x - ci | / KW)




Copyright © 2001, 2003, Andrew W. Moore                                          36
Example

                 y

                                c1           c1                c1
                          x

  yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
  where
  φi(x) = KernelFunction( | x - ci | / KW)




Copyright © 2001, 2003, Andrew W. Moore                                                  37




         RBFs with Linear Regression
                                                    KW also held constant
                                                    (initialized to be large
                                                  enough that there’s decent
                c
            Ally i ’s are held constant
                                                    overlap between basis
             (initialized randomly or
                                                           functions*
                   on a grid in m-
                          c1          c1                      c1
                      x
            dimensional input space)              *Usually much better than the crappy
                                                         overlap on my diagram
  yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
  where
  φi(x) = KernelFunction( | x - ci | / KW)




Copyright © 2001, 2003, Andrew W. Moore                                                  38
RBFs with Linear Regression
                                               KW also held constant
                                             (initialized to be large
                                           enough that there’s decent
                c
            Ally i ’s are held constant
                                             overlap between basis
             (initialized randomly or
                                                    functions*
                   on a grid in m-
                          c1          c1               c1
                      x
            dimensional input space)         *Usually much better than the crappy
                                                    overlap on my diagram
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where
φi(x) = KernelFunction( | x - ci | / KW)
then given Q basis functions, define the matrix Z such that Zkj =
KernelFunction( | xk - ci | / KW) where xk is the kth vector of inputs
And as before, β=(ZTZ)-1(ZTy)

Copyright © 2001, 2003, Andrew W. Moore                                             39




  RBFs with NonLinear Regression

            Allow the ci ’s to adapt to    KW allowed to adapt to the data.
               y data (initialized
               the                         (Some folks even let each basis
                                           function have its own
            randomly or on a grid in
                        c1            c                 c1
                                           KWj,permitting fine detail in
               m-dimensional input 1
                    x
                                           dense regions of input space)
                      space)
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where
φi(x) = KernelFunction( | x - ci | / KW)
But how do we now find all the βj’s, ci ’s and KW ?



Copyright © 2001, 2003, Andrew W. Moore                                             40
RBFs with NonLinear Regression

            Allow the ci ’s to adapt to                   KW allowed to adapt to the data.
               y data (initialized
               the                                        (Some folks even let each basis
                                                          function have its own
            randomly or on a grid in
                        c1            c                                c1
                                                          KWj,permitting fine detail in
               m-dimensional input 1
                    x
                                                          dense regions of input space)
                      space)
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where
φi(x) = KernelFunction( | x - ci | / KW)
But how do we now find all the βj’s, ci ’s and KW ?

                                                                Answer: Gradient Descent

Copyright © 2001, 2003, Andrew W. Moore                                                     41




  RBFs with NonLinear Regression

            Allow the ci ’s to adapt to                   KW allowed to adapt to the data.
               y data (initialized
               the                                        (Some folks even let each basis
                                                          function have its own
            randomly or on a grid in
                        c1            c                                c1
                                                          KWj,permitting fine detail in
               m-dimensional input 1
                    x
                                                          dense regions of input space)
                      space)
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where
φi(x) = KernelFunction( | x - ci | / KW)
But how do we now find all the βj’s, ci ’s and KW ?
     (But I’d like to see, or hope someone’s already done, a
                                                                Answer: Gradient Descent
     hybrid, where the ci ’s and KW are updated with gradient
     descent while the βj’s use matrix inversion)

Copyright © 2001, 2003, Andrew W. Moore                                                     42
Radial Basis Functions in 2-d
 Two inputs.
 Outputs (heights
                                           Center
 sticking out of page)
 not shown.




                                              Sphere of
                                              significant
                x2
                                              influence of
                                              center

                              x1
 Copyright © 2001, 2003, Andrew W. Moore                 43




                          Happy RBFs in 2-d
Blue dots denote
coordinates of
input vectors
                                           Center




                                              Sphere of
                                              significant
                x2
                                              influence of
                                              center

                              x1
 Copyright © 2001, 2003, Andrew W. Moore                 44
Crabby RBFs in 2-d               What’s the
                                           problem in this
                                           example?
Blue dots denote
coordinates of
input vectors
                                             Center




                                                Sphere of
                                                significant
                x2
                                                influence of
                                                center

                              x1
 Copyright © 2001, 2003, Andrew W. Moore                     45




            More crabby RBFs               And what’s the
                                           problem in this
                                           example?
Blue dots denote
coordinates of
input vectors
                                             Center




                                                 Sphere of
                                                 significant
                x2
                                                 influence of
                                                 center

                              x1
 Copyright © 2001, 2003, Andrew W. Moore                     46
Hopeless!                           Even before seeing the data, you should
                                          understand that this is a disaster!


                                                                    Center




                                                                    Sphere of
                                                                    significant
               x2                                                   influence of
                                                                    center

                             x1
Copyright © 2001, 2003, Andrew W. Moore                                        47




       Unhappy                            Even before seeing the data, you should
                                          understand that this isn’t good either..


                                                                    Center




                                                                    Sphere of
                                                                    significant
               x2                                                   influence of
                                                                    center

                             x1
Copyright © 2001, 2003, Andrew W. Moore                                        48
Robust
                     Regression
Copyright © 2001, 2003, Andrew W. Moore       49




                          Robust Regression




               y



                             x
Copyright © 2001, 2003, Andrew W. Moore       50
Robust Regression
                                          This is the best fit that
                                          Quadratic Regression can
                                          manage




               y



                             x
Copyright © 2001, 2003, Andrew W. Moore                               51




                          Robust Regression
                                          …but this is what we’d
                                          probably prefer




               y



                             x
Copyright © 2001, 2003, Andrew W. Moore                               52
LOESS-based Robust Regression
                                          After the initial fit, score
                                          each datapoint according to
                                          how well it’s fitted…




               y                            You are a very good
                                                datapoint.




                             x
Copyright © 2001, 2003, Andrew W. Moore                             53




   LOESS-based Robust Regression
                                          After the initial fit, score
                                          each datapoint according to
                                          how well it’s fitted…




               y                            You are a very good
                                                datapoint.

                        You are not too
                           shabby.


                             x
Copyright © 2001, 2003, Andrew W. Moore                             54
LOESS-based Robust Regression
                                                      After the initial fit, score
                                                      each datapoint according to
                                                      how well it’s fitted…
                                   But you are
                                    pathetic.




               y                                        You are a very good
                                                            datapoint.

                        You are not too
                           shabby.


                             x
Copyright © 2001, 2003, Andrew W. Moore                                             55




                          Robust Regression
                                                 For k = 1 to R…
                                                 •Let (xk,yk) be the kth datapoint
y
                                                 •Let yestk be predicted value of
                                                 yk
      x
                                                 •Let wk be a weight for
                                                 datapoint k that is large if the
                                                 datapoint fits well and small if it
                                                 fits badly:
                                                        wk = KernelFn([yk- yestk]2)



Copyright © 2001, 2003, Andrew W. Moore                                             56
Robust Regression
                                                  For k = 1 to R…
                                                  •Let (xk,yk) be the kth datapoint
y
                                                  •Let yestk be predicted value of
                                                  yk
      x
                                                  •Let wk be a weight for
                                                  datapoint k that is large if the
Then redo the regression
                                                  datapoint fits well and small if it
using weighted datapoints.
                                                  fits badly:
Weighted regression was described earlier in
the “vary noise” section, and is also discussed
                                                         wk = KernelFn([yk- yestk]2)
in the “Memory-based Learning” Lecture.

Guess what happens next?


Copyright © 2001, 2003, Andrew W. Moore                                              57




                          Robust Regression
                                                  For k = 1 to R…
                                                  •Let (xk,yk) be the kth datapoint
y
                                                  •Let yestk be predicted value of
                                                  yk
      x
                                                  •Let wk be a weight for
                                                  datapoint k that is large if the
Then redo the regression
                                                  datapoint fits well and small if it
using weighted datapoints.
                                                  fits badly:
I taught you how to do this in the “Instance-
based” lecture (only then the weights
                                                         wk = KernelFn([yk- yestk]2)
depended on distance in input-space)

Repeat whole thing until
converged!

Copyright © 2001, 2003, Andrew W. Moore                                              58
Robust Regression---what we’re
              doing
  What regular regression does:
  Assume yk was originally generated using the
  following recipe:
                   yk = β0+ β1 xk+ β2 xk2 +N(0,σ2)
  Computational task is to find the Maximum
  Likelihood β0 , β1 and β2




Copyright © 2001, 2003, Andrew W. Moore              59




  Robust Regression---what we’re
              doing
  What LOESS robust regression does:
  Assume yk was originally generated using the
  following recipe:
           With probability p:
                   yk = β0+ β1 xk+ β2 xk2 +N(0,σ2)
           But otherwise
                                 yk ~ N(µ,σhuge2)
  Computational task is to find the Maximum
  Likelihood β0 , β1 , β2 , p, µ and σhuge

Copyright © 2001, 2003, Andrew W. Moore              60
Robust Regression---what we’re
              doing
  What LOESS robust regression does:
                                                     Mysteriously, the
  Assume yk was originally generated using the
                                             reweighting procedure
  following recipe:                          does this computation
                                                     for us.
           With probability p:
                   yk = β0+ β1 xk+ β2 xk2 +N(0,σ2)     Your first glimpse of
                                                     two spectacular letters:
  But otherwise
                                 yk ~ N(µ,σhuge2)              E.M.
  Computational task is to find the Maximum
  Likelihood β0 , β1 , β2 , p, µ and σhuge

Copyright © 2001, 2003, Andrew W. Moore                                  61




                     Regression
                       Trees
Copyright © 2001, 2003, Andrew W. Moore                                  62
Regression Trees
• “Decision trees for regression”




Copyright © 2001, 2003, Andrew W. Moore                               63




                    A regression tree leaf


                               Predict age = 47


                                             Mean age of records
                                            matching this leaf node




Copyright © 2001, 2003, Andrew W. Moore                               64
A one-split regression tree


                                          Gender?
                              Female                     Male

              Predict age = 39                  Predict age = 36




Copyright © 2001, 2003, Andrew W. Moore                            65




 Choosing the attribute to split on
  Gender        Rich?          Num.     Num. Beany   Age
                               Children Babies
  Female        No             2          1          38
  Male          No             0          0          24
  Male          Yes            0          5+         72
  :             :              :          :          :

                                                • We can’t use
                                                  information gain.
                                                • What should we use?



Copyright © 2001, 2003, Andrew W. Moore                            66
Choosing the attribute to split on
   Gender        Rich?          Num.     Num. Beany   Age
                                Children Babies
   Female        No             2          1          38
   Male          No             0          0          24
   Male          Yes            0          5+         72
   :             :              :          :          :
MSE(Y|X) = The expected squared error if we must predict a record’s Y
   value given only knowledge of the record’s X value
If we’re told x=j, the smallest expected error comes from predicting the
   mean of the Y-values among those records in which x=j. Call this mean
   quantity µyx=j
Then…
                                     1 NX
                        MSE (Y | X )= ∑             ∑ (x y=kj )− µ yx= j ) 2
                                     R j =1 ( k such that k
 Copyright © 2001, 2003, Andrew W. Moore                                       67




  Choosing the attribute to split on
   Gender        Rich?          Num.     Num. Beany   Age
                                Children Babies
                 NoRegression tree attribute selection: greedily
   Female               2        1            38
                     choose the attribute that minimizes MSE(Y|X)
   Male          No       0        0            24
                   Guess 0what we do about real-valued inputs?
   Male          Yes            5+           72
   :             :       :      :            :
                     Guess how we prevent overfitting
MSE(Y|X) = The expected squared error if we must predict a record’s Y
   value given only knowledge of the record’s X value
If we’re told x=j, the smallest expected error comes from predicting the
   mean of the Y-values among those records in which x=j. Call this mean
   quantity µyx=j
Then…
                                     1 NX
                        MSE (Y | X )= ∑             ∑ (x y=kj )− µ yx= j ) 2
                                     R j =1 ( k such that k
 Copyright © 2001, 2003, Andrew W. Moore                                       68
Pruning Decision
               …property-owner = Yes
                                                             Do I deserve
                                                               to live?
                                          Gender?
                              Female                  Male

                 Predict age = 39                   Predict age = 36
 # property-owning females = 56712             # property-owning males = 55800
 Mean age among POFs = 39                      Mean age among POMs = 36
 Age std dev among POFs = 12                   Age std dev among POMs = 11.5

           Use a standard Chi-squared test of the null-
           hypothesis “these two populations have the same
           mean” and Bob’s your uncle.

Copyright © 2001, 2003, Andrew W. Moore                                          69




                Linear Regression Trees
               …property-owner = Yes
                                                              Also known as
                                                              “Model Trees”
                                          Gender?
                              Female                  Male

      Predict age =                             Predict age =
      26 + 6 * NumChildren -                    24 + 7 * NumChildren -
      2 * YearsEducation                        2.5 * YearsEducation

Leaves contain linear                        Split attribute chosen to minimize
functions (trained using                     MSE of regressed children.
linear regression on all
                                             Pruning with a different Chi-
records matching that leaf)
                                             squared
Copyright © 2001, 2003, Andrew W. Moore                                          70
Linear Regression Trees
               …property-owner = Yes
                                                    Also known as
                                                    “Model Trees”
                                          Gender?
                                                                 d
                                                              ste
                                                    ny en te
                   Female                          Male
                                                   ae
                                                re     b
                                           gno has         he
                                         i
                                      lly Predict raget =
                                              ha u ing d
                                                t
                                                                   tes
                                    a
    Predict age =                 ic        t
                              typ ibute ree d ste ttribu
                       You
    26 + 6 * NumChildrenttr the 24 l+ nte ued a e
                                             t
                                  -               u7 *l NumChildren -
                  ail: rical a in              al             v
                et o                      se e* l-va abo
    2 * YearsEducation up
              D                             2.5 a YearsEducation
                  teg her But u e r                      d
                                                      te
                ca hig                             tes
                                         us
                               n.
                            sio , and been
                  n
                 or
Leaves contain lineares s               Split attribute chosen to minimize
                  reg ibute ey’ve of regressed children.
functions (trained using h MSE
                        r
                    at t        t
linear regression on alln if
                          e
                      ev                Pruning with a different Chi-
records matching that leaf)
                                        squared
Copyright © 2001, 2003, Andrew W. Moore                                71




                Test your understanding
   Assuming regular regression trees, can you sketch a
   graph of the fitted function yest(x) over this diagram?




               y



                             x
Copyright © 2001, 2003, Andrew W. Moore                                72
Test your understanding
   Assuming linear regression trees, can you sketch a graph
   of the fitted function yest(x) over this diagram?




               y



                             x
Copyright © 2001, 2003, Andrew W. Moore                       73




                Multilinear
               Interpolation
Copyright © 2001, 2003, Andrew W. Moore                       74
Multilinear Interpolation
   Consider this dataset. Suppose we wanted to create a
   continuous and piecewise linear fit to the data




               y



                              x
Copyright © 2001, 2003, Andrew W. Moore                       75




                   Multilinear Interpolation
   Create a set of knot points: selected X-coordinates
   (usually equally spaced) that cover the data




               y


                         q1               q2   q3   q4   q5
                              x
Copyright © 2001, 2003, Andrew W. Moore                       76
Multilinear Interpolation
   We are going to assume the data was generated by a
   noisy version of a function that can only bend at the
   knots. Here are 3 examples (none fits the data well)




               y


                         q1               q2        q3       q4        q5
                              x
Copyright © 2001, 2003, Andrew W. Moore                                             77




                How to find the best fit?
   Idea 1: Simply perform a separate regression in each
   segment for each part of the curve
                                               What’s the problem with this idea?




               y


                         q1               q2        q3       q4        q5
                              x
Copyright © 2001, 2003, Andrew W. Moore                                             78
How to find the best fit?
Let’s look at what goes on in the red segment
                                   (q3 − x)     (q − x)
                   y est ( x) =             h2 + 2      h3 where w = q3 − q2
                                      w            w



                     h2




               y
                     h3


                          q1              q2    q3           q4            q5
                               x
Copyright © 2001, 2003, Andrew W. Moore                                                 79




                How to find the best fit?
In the red segment…                              y est ( x) = h2 φ2 ( x) + h3φ3 ( x)
                                                           x − q2                q −x
                                     where φ2 ( x) = 1 −          , φ3 ( x) = 1 − 3
                                                             w                      w



                     h2




               y
                                                                                φ2(x)
                     h3


                          q1              q2    q3           q4            q5
                               x
Copyright © 2001, 2003, Andrew W. Moore                                                 80
How to find the best fit?
In the red segment…                              y est ( x) = h2 φ2 ( x) + h3φ3 ( x)
                                                           x − q2                q −x
                                     where φ2 ( x) = 1 −          , φ3 ( x) = 1 − 3
                                                             w                      w



                     h2




               y
                                                                                 φ2(x)
                     h3


                          q1              q2    q3            q4            q5
                                                                    φ3(x)
                               x
Copyright © 2001, 2003, Andrew W. Moore                                                     81




                How to find the best fit?
In the red segment…                              y est ( x) = h2 φ2 ( x) + h3φ3 ( x)
                                                         | x − q2 |                 | x − q3 |
                                   where φ2 ( x) = 1 −              , φ3 ( x) = 1 −
                                                             w                          w



                     h2




               y
                                                                                 φ2(x)
                     h3


                          q1              q2    q3            q4            q5
                                                                    φ3(x)
                               x
Copyright © 2001, 2003, Andrew W. Moore                                                     82
How to find the best fit?
In the red segment…                              y est ( x) = h2 φ2 ( x) + h3φ3 ( x)
                                                         | x − q2 |                 | x − q3 |
                                   where φ2 ( x) = 1 −              , φ3 ( x) = 1 −
                                                             w                          w



                     h2




               y
                                                                                  φ2(x)
                     h3


                          q1              q2    q3            q4             q5
                                                                     φ3(x)
                               x
Copyright © 2001, 2003, Andrew W. Moore                                                     83




                How to find the best fit?                     NK
In general                                       y est ( x) = ∑ hi φi ( x)
                                                              i =1

                                                     | x − qi |
                                                    1 −         if | x − qi |<w
                                    where φi ( x) =       w
                                                        0         otherwise
                                                    

                     h2




               y
                                                                                  φ2(x)
                     h3


                          q1              q2    q3            q4             q5
                                                                     φ3(x)
                               x
Copyright © 2001, 2003, Andrew W. Moore                                                     84
How to find the best fit?                   NK
 In general                                       y est ( x) = ∑ hi φi ( x)
                                                              i =1

                                                       | x − qi |
                                                      1 −         if | x − qi |<w
                                      where φi ( x) =       w
                                                          0         otherwise
                                                      

                       h2    And this is simply a basis function
                                    regression problem!

                 y               We know how to find the least
                                                                                   φ2(x)
                       h3               squares hiis!

                            q1              q2   q3           q4              q5
                                                                     φ3(x)
                                 x
  Copyright © 2001, 2003, Andrew W. Moore                                                  85




                       In two dimensions…
Blue dots show
locations of input
vectors (outputs
not depicted)




                 x2



                                 x1
  Copyright © 2001, 2003, Andrew W. Moore                                                  86
In two dimensions…           Each purple dot
                                                    is a knot point.
Blue dots show                                      It will contain
locations of input                                  the height of
vectors (outputs                                    the estimated
not depicted)                                       surface




                 x2



                               x1
  Copyright © 2001, 2003, Andrew W. Moore                        87




                       In two dimensions…           Each purple dot
                                                    is a knot point.
Blue dots show                                      It will contain
locations of input                                  the height of
vectors (outputs                                    the estimated
                                            9   3
not depicted)                                       surface
                                                    But how do we
                                                    do the
                                                    interpolation to
                                                    ensure that the
                                                    surface is
                                            7   8
                                                    continuous?


                 x2



                               x1
  Copyright © 2001, 2003, Andrew W. Moore                        88
In two dimensions…                  Each purple dot
                                                           is a knot point.
Blue dots showthe                                          It will contain
   To predict
locations of input                                         the height of
   value here…
vectors (outputs                                           the estimated
                                            9          3
not depicted)                                              surface
                                                           But how do we
                                                           do the
                                                           interpolation to
                                                           ensure that the
                                                           surface is
                                            7          8
                                                           continuous?


                 x2



                               x1
  Copyright © 2001, 2003, Andrew W. Moore                               89




                       In two dimensions…                  Each purple dot
                                                           is a knot point.
Blue dots show                                             It will contain
locations of input                                         the height of
vectors (outputs                                           the estimated
                                            97         3
not depicted)                                              surface
                                                           But how do we
                                                           do the
   To predict the
                                                           interpolation to
   value here…                                             ensure that the
   First interpolate                                       surface is
                                            7          8
   its value on two                                        continuous?
   opposite edges…                              7.33
                 x2



                               x1
  Copyright © 2001, 2003, Andrew W. Moore                               90
In two dimensions…                            Each purple dot
                                                                     is a knot point.
Blue dots show                                                       It will contain
locations of input                                                   the height of
vectors (outputs                                                     the estimated
                                            97           3
not depicted)                                                        surface
   To predict the                                                    But how do we
   value here…                                                       do the
                                                7.05                 interpolation to
   First interpolate
                                                                     ensure that the
   its value on two
                                                                     surface is
                                            7            8
   opposite edges…
                                                                     continuous?
   Then interpolate
                                                7.33
   between those
              x
   two values 2




                               x1
  Copyright © 2001, 2003, Andrew W. Moore                                         91




                       In two dimensions…                            Each purple dot
                                                                     is a knot point.
Blue dots show                                                       It will contain
locations of input                                                   the height of
vectors (outputs                                                     the estimated
                                            97           3
not depicted)                                                        surface
   To predict the                                                    But how do we
   value here…                                                       do the
                                                7.05                 interpolation to
   First interpolate
                                                                     ensure that the
   its value on two
                                                  Notes: 8           surface is
                                            7
   opposite edges…
                                                                     continuous?
   Then interpolate
                                                7.33 can easily be generalized
                                                  This
   between those
                                                  to m dimensions.
              x
   two values 2
                                                  It should be easy to see that it
                                                  ensures continuity
                               x1                 The patches are not linear
  Copyright © 2001, 2003, Andrew W. Moore                                         92
Doing the regression          Given data, how
                                                  do we find the
                                                  optimal knot
                                                  heights?
                                                  Happily, it’s
                                          9   3   simply a two-
                                                  dimensional
                                                  basis function
                                                  problem.
                                                  (Working out
                                          7   8   the basis
                                                  functions is
                                                  tedious,
               x2                                 unilluminating,
                                                  and easy)
                                                  What’s the
                                                  problem in
                                                  higher
                             x1                   dimensions?
Copyright © 2001, 2003, Andrew W. Moore                         93




    MARS: Multivariate
   Adaptive Regression
         Splines
Copyright © 2001, 2003, Andrew W. Moore                         94
MARS
• Multivariate Adaptive Regression Splines
• Invented by Jerry Friedman (one of
  Andrew’s heroes)
• Simplest version:

            Let’s assume the function we are learning is of the
             following form:
                                               m
                                   y (x) = ∑ g k ( xk )
                                    est

                                               k =1
 Instead of a linear combination of the inputs, it’s a linear
 combination of non-linear functions of individual inputs

Copyright © 2001, 2003, Andrew W. Moore                                               95




                                          MARS                          m
                                                           y (x) = ∑ g k ( xk )
                                                            est

                                                                       k =1
     Instead of a linear combination of the inputs, it’s a linear
     combination of non-linear functions of individual inputs



                                                                       Idea: Each
                                                                       gk is one of
                                                                            these

     y


               q1             q2          q3          q4          q5
                    x
Copyright © 2001, 2003, Andrew W. Moore                                               96
MARS                                    m
                                                               y (x) = ∑ g k ( xk )
                                                                est

                                                                                  k =1
     Instead of a linear combination of the inputs, it’s a linear
     combination of non-linear functions of individual inputs

                          m NK
     y (x) = ∑∑ h k φ k ( xk )                                              qkj : The location of
        est
                                                                            the j’th knot in the
                  jj
                         k =1 j =1                                          k’th dimension
                                                                            hkj : The regressed
                  | xk − q k |                                             height of the j’th
                 1 −       j
                                if | xk − q k |<wk                          knot in the k’th
where φ k ( x) =                           j
  y                    wk
        j                                                                   dimension
                     0             otherwise
                                                                           wk: The spacing
                                                                            between knots in
                                                                            the kth dimension
                q1            q2          q3          q4               q5
                     x
Copyright © 2001, 2003, Andrew W. Moore                                                       97




   That’s not complicated enough!
• Okay, now let’s get serious. We’ll allow
  arbitrary “two-way interactions”:
                            m                   m      m
       y (x) = ∑ g k ( xk ) + ∑                      ∑g
          est
                                                                    ( xk , xt )
                                                               kt
                           k =1                k =1 t = k +1

                                          Can still be expressed as a linear
   The function we’re                     combination of basis functions
learning is allowed to be
                                          Thus learnable by linear regression
   a sum of non-linear
 functions over all one-d                 Full MARS: Uses cross-validation to
   and 2-d subsets of                     choose a subset of subspaces, knot
        attributes                        resolution and other parameters.

Copyright © 2001, 2003, Andrew W. Moore                                                       98
If you like MARS…
…See also CMAC (Cerebellar Model Articulated
 Controller) by James Albus (another of
 Andrew’s heroes)
                 • Many of the same gut-level intuitions
                 • But entirely in a neural-network, biologically
                   plausible way
                    • (All the low dimensional functions are by
                      means of lookup tables, trained with a delta-
                      rule and using a clever blurred update and
                      hash-tables)


Copyright © 2001, 2003, Andrew W. Moore                                                  99




                           Where are we now?
 Inputs




                    Inference
                                p(E1|E2) Joint DE
                   Engine Learn
                                            Dec Tree, Gauss/Joint BC, Gauss Naïve BC,
 Inputs




                                  Predict
                   Classifier    category

                                            Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve
 Inputs Inputs




                                  Prob-
                    Density
                                            DE
                                  ability
                   Estimator
                                            Linear Regression, Polynomial Regression, RBFs,
                                 Predict
                   Regressor                Robust Regression Regression Trees, Multilinear
                                 real no.
                                            Interp, MARS


Copyright © 2001, 2003, Andrew W. Moore                                                  100
Citations
Radial Basis Functions                         Regression Trees etc
T. Poggio and F. Girosi,                       L. Breiman and J. H. Friedman and
   Regularization Algorithms for                   R. A. Olshen and C. J. Stone,
   Learning That Are Equivalent to                 Classification and Regression
   Multilayer Networks, Science,                   Trees, Wadsworth, 1984
   247, 978--982, 1989                         J. R. Quinlan, Combining Instance-
LOESS                                              Based and Model-Based
                                                   Learning, Machine Learning:
W. S. Cleveland, Robust Locally
                                                   Proceedings of the Tenth
   Weighted Regression and
                                                   International Conference, 1993
   Smoothing Scatterplots, Journal
   of the American Statistical                 MARS
   Association, 74, 368, 829-836,              J. H. Friedman, Multivariate
   December, 1979                                  Adaptive Regression Splines,
                                                   Department for Statistics,
                                                   Stanford University, 1988,
                                                   Technical Report No. 102



Copyright © 2001, 2003, Andrew W. Moore                                        101

Weitere ähnliche Inhalte

Andere mochten auch

Parameter Estimation for Semiparametric Models with CMARS and Its Applications
Parameter Estimation for Semiparametric Models with CMARS and Its ApplicationsParameter Estimation for Semiparametric Models with CMARS and Its Applications
Parameter Estimation for Semiparametric Models with CMARS and Its ApplicationsSSA KPI
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to marsSalford Systems
 
Presentation Machine Learning
Presentation Machine LearningPresentation Machine Learning
Presentation Machine LearningPeriklis Gogas
 
Introduction to MARS (1999)
Introduction to MARS (1999)Introduction to MARS (1999)
Introduction to MARS (1999)Salford Systems
 
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science -  Part XV - MARS, Logistic Regression, & Survival AnalysisData Science -  Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science - Part XV - MARS, Logistic Regression, & Survival AnalysisDerek Kane
 

Andere mochten auch (6)

Parameter Estimation for Semiparametric Models with CMARS and Its Applications
Parameter Estimation for Semiparametric Models with CMARS and Its ApplicationsParameter Estimation for Semiparametric Models with CMARS and Its Applications
Parameter Estimation for Semiparametric Models with CMARS and Its Applications
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
 
Presentation Machine Learning
Presentation Machine LearningPresentation Machine Learning
Presentation Machine Learning
 
Introduction to MARS (1999)
Introduction to MARS (1999)Introduction to MARS (1999)
Introduction to MARS (1999)
 
Tokyowebmining41
Tokyowebmining41Tokyowebmining41
Tokyowebmining41
 
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science -  Part XV - MARS, Logistic Regression, & Survival AnalysisData Science -  Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
 

Ähnlich wie Predicting Real-valued Outputs: An introduction to regression

Information Gain
Information GainInformation Gain
Information Gainguest32311f
 
Lesson 27: Integration by Substitution, part II (Section 10 version)
Lesson 27: Integration by Substitution, part II (Section 10 version)Lesson 27: Integration by Substitution, part II (Section 10 version)
Lesson 27: Integration by Substitution, part II (Section 10 version)Matthew Leingang
 
Lesson 22: Optimization I (Section 10 Version)
Lesson 22: Optimization I (Section 10 Version)Lesson 22: Optimization I (Section 10 Version)
Lesson 22: Optimization I (Section 10 Version)Matthew Leingang
 
IVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methodsIVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methodsCharles Deledalle
 
Lecture6
Lecture6Lecture6
Lecture6voracle
 
2021 03-01-on the relationship between self-attention and convolutional layers
2021 03-01-on the relationship between self-attention and convolutional layers2021 03-01-on the relationship between self-attention and convolutional layers
2021 03-01-on the relationship between self-attention and convolutional layersJAEMINJEONG5
 
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskinComputer Science Club
 
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...Kirill Eremenko
 
A simple confidence interval for meta analysis
A simple confidence interval for meta analysisA simple confidence interval for meta analysis
A simple confidence interval for meta analysisrsd kol abundjani
 
Lesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General RegionsLesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General RegionsMatthew Leingang
 
Large variance and fat tail of damage by natural disaster
Large variance and fat tail of damage by natural disasterLarge variance and fat tail of damage by natural disaster
Large variance and fat tail of damage by natural disasterHang-Hyun Jo
 
Lesson 19: Optimization Problems
Lesson 19: Optimization ProblemsLesson 19: Optimization Problems
Lesson 19: Optimization ProblemsMatthew Leingang
 
Multimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsMultimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsXavier Anguera
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithmsguestfee8698
 
Lesson 22: Optimization I (Section 4 version)
Lesson 22: Optimization I (Section 4 version)Lesson 22: Optimization I (Section 4 version)
Lesson 22: Optimization I (Section 4 version)Matthew Leingang
 

Ähnlich wie Predicting Real-valued Outputs: An introduction to regression (20)

VC dimensio
VC dimensioVC dimensio
VC dimensio
 
Information Gain
Information GainInformation Gain
Information Gain
 
Lesson 27: Integration by Substitution, part II (Section 10 version)
Lesson 27: Integration by Substitution, part II (Section 10 version)Lesson 27: Integration by Substitution, part II (Section 10 version)
Lesson 27: Integration by Substitution, part II (Section 10 version)
 
Lesson 22: Optimization I (Section 10 Version)
Lesson 22: Optimization I (Section 10 Version)Lesson 22: Optimization I (Section 10 Version)
Lesson 22: Optimization I (Section 10 Version)
 
IVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methodsIVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methods
 
Electromagnetic.pdf
Electromagnetic.pdfElectromagnetic.pdf
Electromagnetic.pdf
 
Lecture6
Lecture6Lecture6
Lecture6
 
2021 03-01-on the relationship between self-attention and convolutional layers
2021 03-01-on the relationship between self-attention and convolutional layers2021 03-01-on the relationship between self-attention and convolutional layers
2021 03-01-on the relationship between self-attention and convolutional layers
 
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
 
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
 
A simple confidence interval for meta analysis
A simple confidence interval for meta analysisA simple confidence interval for meta analysis
A simple confidence interval for meta analysis
 
Lesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General RegionsLesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General Regions
 
Partitions
PartitionsPartitions
Partitions
 
Large variance and fat tail of damage by natural disaster
Large variance and fat tail of damage by natural disasterLarge variance and fat tail of damage by natural disaster
Large variance and fat tail of damage by natural disaster
 
Morphing Image
Morphing Image Morphing Image
Morphing Image
 
Derivatives
DerivativesDerivatives
Derivatives
 
Lesson 19: Optimization Problems
Lesson 19: Optimization ProblemsLesson 19: Optimization Problems
Lesson 19: Optimization Problems
 
Multimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsMultimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applications
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithms
 
Lesson 22: Optimization I (Section 4 version)
Lesson 22: Optimization I (Section 4 version)Lesson 22: Optimization I (Section 4 version)
Lesson 22: Optimization I (Section 4 version)
 

Mehr von guestfee8698

Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesguestfee8698
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Modelsguestfee8698
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clusteringguestfee8698
 
Short Overview of Bayes Nets
Short Overview of Bayes NetsShort Overview of Bayes Nets
Short Overview of Bayes Netsguestfee8698
 
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiersguestfee8698
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networksguestfee8698
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networksguestfee8698
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)guestfee8698
 
Gaussian Bayes Classifiers
Gaussian Bayes ClassifiersGaussian Bayes Classifiers
Gaussian Bayes Classifiersguestfee8698
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Minersguestfee8698
 

Mehr von guestfee8698 (13)

PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clustering
 
Short Overview of Bayes Nets
Short Overview of Bayes NetsShort Overview of Bayes Nets
Short Overview of Bayes Nets
 
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networks
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
 
Bayesian Networks
Bayesian NetworksBayesian Networks
Bayesian Networks
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)
 
Cross-Validation
Cross-ValidationCross-Validation
Cross-Validation
 
Gaussian Bayes Classifiers
Gaussian Bayes ClassifiersGaussian Bayes Classifiers
Gaussian Bayes Classifiers
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Miners
 

Kürzlich hochgeladen

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Kürzlich hochgeladen (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

Predicting Real-valued Outputs: An introduction to regression

  • 1. Predicting Real-valued outputs: an introduction to Regression Andrew W. Moore Note to other teachers and users of these slides. Andrew would be delighted Professor if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them School of Computer Science to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in Carnegie Mellon University your own lecture, please include this l teria d ma message, or the following link to the rdere Nets source repository of Andrew’s tutorials: is reo www.cs.cmu.edu/~awm l This he Neura “Favorite http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully t d the rithms” awm@cs.cmu.edu from re an received. o lectu ssion Alg 412-268-7599 e Regr e r lectu Copyright © 2001, 2003, Andrew W. Moore Single- Parameter Linear Regression Copyright © 2001, 2003, Andrew W. Moore 2
  • 2. Linear Regression DATASET inputs outputs x1 = 1 y1 = 1 x2 = 3 y2 = 2.2 ↑ x3 = 2 y3 = 2 w ↓ ←1→ x4 = 1.5 y4 = 1.9 x5 = 4 y5 = 3.1 Linear regression assumes that the expected value of the output given an input, E[y|x], is linear. Simplest case: Out(x) = wx for some unknown w. Given the data, we can estimate w. Copyright © 2001, 2003, Andrew W. Moore 3 1-parameter linear regression Assume that the data is formed by yi = wxi + noisei where… • the noise signals are independent • the noise has a normal distribution with mean 0 and unknown variance σ2 p(y|w,x) has a normal distribution with • mean wx • variance σ2 Copyright © 2001, 2003, Andrew W. Moore 4
  • 3. Bayesian Linear Regression p(y|w,x) = Normal (mean wx, var σ2) We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn) which are EVIDENCE about w. We want to infer w from the data. p(w|x1, x2, x3,…xn, y1, y2…yn) •You can use BAYES rule to work out a posterior distribution for w given the data. •Or you could do Maximum Likelihood Estimation Copyright © 2001, 2003, Andrew W. Moore 5 Maximum likelihood estimation of w Asks the question: “For which value of w is this data most likely to have happened?” <=> For what w is p(y1, y2…yn |x1, x2, x3,…xn, w) maximized? <=> For what w is n ∏ p( y w, x ) maximized? i i i =1 Copyright © 2001, 2003, Andrew W. Moore 6
  • 4. For what w is n ∏ p ( y i w , x i ) maximized? i =1 For what w is n 1 ∏ exp(− 2 ( yi − wxi ) 2 ) maximized? σ i =1 For what w is 2 1  y − wx i  n ∑ −i  maximized? σ 2  i =1 For what w is 2 n ∑ (y ) − wx minimized? i i i =1 Copyright © 2001, 2003, Andrew W. Moore 7 Linear Regression The maximum likelihood w is the one that E(w) w minimizes sum- Ε = ∑( yi − wxi ) 2 of-squares of residuals i (∑ x )w = ∑ yi − (2∑ xi yi )w + 2 2 2 i i We want to minimize a quadratic function of w. Copyright © 2001, 2003, Andrew W. Moore 8
  • 5. Linear Regression Easy to show the sum of squares is minimized when ∑x y i i w= ∑x 2 i The maximum likelihood Out(x) = wx model is We can use it for prediction Copyright © 2001, 2003, Andrew W. Moore 9 Linear Regression Easy to show the sum of squares is minimized when ∑x y p(w) i i w= w ∑x 2 Note: i In Bayesian stats you’d have ended up with a prob dist of w The maximum likelihood Out(x) = wx model is And predictions would have given a prob dist of expected output Often useful to know your confidence. We can use it for Max likelihood can give some kinds of prediction confidence too. Copyright © 2001, 2003, Andrew W. Moore 10
  • 6. Multivariate Linear Regression Copyright © 2001, 2003, Andrew W. Moore 11 Multivariate Regression What if the inputs are vectors? 3. .4 6. 2-d input 5 . example .8 x2 . 10 x1 Dataset has form x1 y1 x2 y2 x3 y3 .: : . xR yR Copyright © 2001, 2003, Andrew W. Moore 12
  • 7. Multivariate Regression Write matrix X and Y thus:  .....x1 .....   x11 x1m   y1  x12 ... .....x .....  x  y  x22 ... x2 m   =  21 x= y =  2 2    M M M     .....x R .....  xR1 xR 2 ... xRm   yR  (there are R datapoints. Each input has m components) The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XTX) -1(XTY) Copyright © 2001, 2003, Andrew W. Moore 13 Multivariate Regression Write matrix X and Y thus:  .....x1 .....   x11 x1m   y1  x12 ... .....x .....  x ... x2 m   x22  y =  y2   =  21 x= 2    M M M     .....x R .....  xR1 xR 2 ... xRm   yR  IMPORTANT EXERCISE: (there are R datapoints. Each input hasPROVE IT !!!!! m components) The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XTX) -1(XTY) Copyright © 2001, 2003, Andrew W. Moore 14
  • 8. Multivariate Regression (con’t) The max. likelihood w is w = (XTX)-1(XTY) R ∑x x ki kj XTX is an m x m matrix: i,j’th elt is k =1 R ∑x y XTY is an m-element vector: i’th elt ki k k =1 Copyright © 2001, 2003, Andrew W. Moore 15 Constant Term in Linear Regression Copyright © 2001, 2003, Andrew W. Moore 16
  • 9. What about a constant term? We may expect linear data that does not go through the origin. Statisticians and Neural Net Folks all agree on a simple obvious hack. Can you guess?? Copyright © 2001, 2003, Andrew W. Moore 17 The constant term • The trick is to create a fake input “X0” that always takes the value 1 X1 X2 Y X0 X1 X2 Y 2 4 16 1 2 4 16 3 4 17 1 3 4 17 5 5 20 1 5 5 20 Before: After: Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2 …has to be a poor = w0+w1X1+ w2X2 In this example, You should be able model …has a fine constant to see the MLE w0 , w1 and w2 by term inspection Copyright © 2001, 2003, Andrew W. Moore 18
  • 10. . .. i ty astic ed rosc e H et Linear Regression with varying noise Copyright © 2001, 2003, Andrew W. Moore 19 Regression with varying noise • Suppose you know the variance of the noise that was added to each datapoint. y=3 σ=2 σi2 xi yi ½ ½ 4 y=2 σ=1/2 1 1 1 σ=1 2 1 1/4 y=1 σ=1/2 σ=2 2 3 4 y=0 3 2 1/4 x=0 x=1 x=2 x=3 MLE the w? yi ~ N ( wxi , σ i2 ) at’s f W h at e o Assume stim e Copyright © 2001, 2003, Andrew W. Moore 20
  • 11. MLE estimation with varying noise argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ 12 , σ 2 ,..., σ R , w) = 2 2 1 2 R w Assuming independence ( yi − wxi ) 2 R among noise and then argmin ∑ = plugging in equation for σ i2 Gaussian and simplifying. i =1 w   x ( y − wx ) Setting dLL/dw R  w such that ∑ i i 2 i = 0  = equal to zero   σi   i =1  R xi yi  Trivial algebra ∑ 2     i =1 σ i   R xi2  ∑ 2   σ  i =1 i  Copyright © 2001, 2003, Andrew W. Moore 21 This is Weighted Regression • We are asking to minimize the weighted sum of squares y=3 σ=2 ( yi − wxi ) 2 R argmin ∑ σ i2 y=2 i =1 σ=1/2 w σ=1 y=1 σ=1/2 σ=2 y=0 x=0 x=1 x=2 x=3 1 where weight for i’th datapoint is σ i2 Copyright © 2001, 2003, Andrew W. Moore 22
  • 12. Non-linear Regression Copyright © 2001, 2003, Andrew W. Moore 23 Non-linear Regression • Suppose you know that y is related to a function of x in such a way that the predicted values have a non-linear dependence on w, e.g: y=3 xi yi ½ ½ y=2 1 2.5 2 3 y=1 3 2 y=0 3 3 x=0 x=1 x=2 x=3 MLE yi ~ N ( w + xi , σ ) the w? 2 at’s f W h at e o Assume stim e Copyright © 2001, 2003, Andrew W. Moore 24
  • 13. Non-linear MLE estimation argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ (y ) R then plugging in 2 − w + xi = equation for Gaussian i and simplifying. i =1 w   y − w + xi Setting dLL/dw R  w such that ∑ i = 0 = equal to zero   w + xi   i =1 Copyright © 2001, 2003, Andrew W. Moore 25 Non-linear MLE estimation argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ (y ) R then plugging in 2 − w + xi = equation for Gaussian i and simplifying. i =1 w   y − w + xi Setting dLL/dw R  w such that ∑ i = 0 = equal to zero   w + xi   i =1 We’re down the algebraic toilet t wha ess u So g e do? w Copyright © 2001, 2003, Andrew W. Moore 26
  • 14. Non-linear MLE estimation argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ ( ) R Common (but not only) approach: then plugging in 2 yi − w + xi = equation for Gaussian Numerical Solutions: and simplifying. i =1 w • Line Search • Simulated Annealing   yi − w + xi Setting dLL/dw R ∑  w such = 0 = • Gradient Descent that equal to zero   w + xi   • Conjugate Gradient i =1 • Levenberg Marquart We’re down the • Newton’s Method algebraic toilet Also, special purpose statistical- t wha optimization-specific tricks such as uess ? So g e do E.M. (See Gaussian Mixtures lecture for introduction) w Copyright © 2001, 2003, Andrew W. Moore 27 Polynomial Regression Copyright © 2001, 2003, Andrew W. Moore 28
  • 15. Polynomial Regression So far we’ve mainly been dealing with linear regression X1 X2 Y X= 3 2 y= 7 3 2 7 1 1 3 1 1 3 : : : : : : x1=(3,2).. y1=7.. Z= 1 y= 3 2 7 1 1 1 3 : : β=(ZTZ)-1(ZTy) : z1=(1,3,2).. y1=7.. yest = β0+ β1 x1+ β2 x2 zk=(1,xk1,xk2) Copyright © 2001, 2003, Andrew W. Moore 29 Quadratic Regression It’s trivial to do linear fits of fixed nonlinear basis functions X1 X2 Y X= 3 2 y= 7 3 2 7 1 1 3 1 1 3 : : : : : : x1=(3,2).. y1=7.. y= 1 3 2 9 6 4 Z= 7 1 1 1 1 1 1 β=(ZTZ)-1(ZTy) 3 : : : yest = β0+ β1 x1+ β2 x2+ z=(1 , x1, x2 , x1 2, x1x2,x22,) β3 x12 + β4 x1x2 + β5 x22 Copyright © 2001, 2003, Andrew W. Moore 30
  • 16. Quadratic Regression It’s trivial to do linear fitsof afixed nonlinear basisterm. Each component of z vector is called a functions X1 X2 Each column of X= 3 matrix is called a term column Y y= 7 the Z 2 3 2 How many terms in 1 quadratic regression with m 7 a1 3 1 inputs? 1 3 :: : : •1 constant term 1=(3,2).. : : x y1=7.. 1 •m linear terms6 4 y= 329 Z= 7 1 •(m+1)-choose-2 =1 1 1 1 1 m(m+1)/2 quadratic terms β=(ZT -1 T (m+2)-choose-2 terms in total = O(m2) Z) (Z y) 3 : : : y = β0+ β1 x1+ β2 x2+ est z=(1 , x1, x2 , x12, x1x2,x22,) -1 T Note that solving β=(ZTZ) (Z y) is thusβO(m6)+ β x 2 β3 x12 + 4 x1x2 52 Copyright © 2001, 2003, Andrew W. Moore 31 Qth-degree polynomial Regression X1 X2 Y X= 3 y= 7 2 3 2 7 1 1 3 1 1 3 : : : : : : x1=(3,2).. y1=7.. y= 7 1 3 2 9 6 … Z= 3 1 1 1 1 1 … β=(ZTZ)-1(ZTy) : : … yest = β0+ z=(all products of powers of inputs in which sum of powers is q or less,) β1 x1+… Copyright © 2001, 2003, Andrew W. Moore 32
  • 17. m inputs, degree Q: how many terms? = the number of unique terms of the form m x1q1 x2 2 ...xmm where ∑ qi ≤ Q q q i =1 = the number of unique terms of the m form 1q0 x1q1 x2 2 ...xmm where ∑ qi = Q q q i =0 = the number of lists of non-negative integers [q0,q1,q2,..qm] Σqi = Q in which = the number of ways of placing Q red disks on a row of squares of length Q+m = (Q+m)-choose-Q Q=11, m=4 q0=2 q1=2 q2=0 q3=4 q4=3 Copyright © 2001, 2003, Andrew W. Moore 33 Radial Basis Functions Copyright © 2001, 2003, Andrew W. Moore 34
  • 18. Radial Basis Functions (RBFs) X1 X2 Y X= 3 y= 7 2 3 2 7 1 1 3 1 1 3 : : : : :: x1=(3,2).. y1=7.. … … … … … … y= 7 Z= 3 ……………… β=(ZTZ)-1(ZTy) : ……………… yest = β0+ z=(list of radial basis function evaluations) β1 x1+… Copyright © 2001, 2003, Andrew W. Moore 35 1-d RBFs y c1 c1 c1 x yest = β1 φ1(x) + β2 φ2(x) + β3 φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) Copyright © 2001, 2003, Andrew W. Moore 36
  • 19. Example y c1 c1 c1 x yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) Copyright © 2001, 2003, Andrew W. Moore 37 RBFs with Linear Regression KW also held constant (initialized to be large enough that there’s decent c Ally i ’s are held constant overlap between basis (initialized randomly or functions* on a grid in m- c1 c1 c1 x dimensional input space) *Usually much better than the crappy overlap on my diagram yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) Copyright © 2001, 2003, Andrew W. Moore 38
  • 20. RBFs with Linear Regression KW also held constant (initialized to be large enough that there’s decent c Ally i ’s are held constant overlap between basis (initialized randomly or functions* on a grid in m- c1 c1 c1 x dimensional input space) *Usually much better than the crappy overlap on my diagram yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) then given Q basis functions, define the matrix Z such that Zkj = KernelFunction( | xk - ci | / KW) where xk is the kth vector of inputs And as before, β=(ZTZ)-1(ZTy) Copyright © 2001, 2003, Andrew W. Moore 39 RBFs with NonLinear Regression Allow the ci ’s to adapt to KW allowed to adapt to the data. y data (initialized the (Some folks even let each basis function have its own randomly or on a grid in c1 c c1 KWj,permitting fine detail in m-dimensional input 1 x dense regions of input space) space) yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) But how do we now find all the βj’s, ci ’s and KW ? Copyright © 2001, 2003, Andrew W. Moore 40
  • 21. RBFs with NonLinear Regression Allow the ci ’s to adapt to KW allowed to adapt to the data. y data (initialized the (Some folks even let each basis function have its own randomly or on a grid in c1 c c1 KWj,permitting fine detail in m-dimensional input 1 x dense regions of input space) space) yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) But how do we now find all the βj’s, ci ’s and KW ? Answer: Gradient Descent Copyright © 2001, 2003, Andrew W. Moore 41 RBFs with NonLinear Regression Allow the ci ’s to adapt to KW allowed to adapt to the data. y data (initialized the (Some folks even let each basis function have its own randomly or on a grid in c1 c c1 KWj,permitting fine detail in m-dimensional input 1 x dense regions of input space) space) yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) But how do we now find all the βj’s, ci ’s and KW ? (But I’d like to see, or hope someone’s already done, a Answer: Gradient Descent hybrid, where the ci ’s and KW are updated with gradient descent while the βj’s use matrix inversion) Copyright © 2001, 2003, Andrew W. Moore 42
  • 22. Radial Basis Functions in 2-d Two inputs. Outputs (heights Center sticking out of page) not shown. Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 43 Happy RBFs in 2-d Blue dots denote coordinates of input vectors Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 44
  • 23. Crabby RBFs in 2-d What’s the problem in this example? Blue dots denote coordinates of input vectors Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 45 More crabby RBFs And what’s the problem in this example? Blue dots denote coordinates of input vectors Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 46
  • 24. Hopeless! Even before seeing the data, you should understand that this is a disaster! Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 47 Unhappy Even before seeing the data, you should understand that this isn’t good either.. Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 48
  • 25. Robust Regression Copyright © 2001, 2003, Andrew W. Moore 49 Robust Regression y x Copyright © 2001, 2003, Andrew W. Moore 50
  • 26. Robust Regression This is the best fit that Quadratic Regression can manage y x Copyright © 2001, 2003, Andrew W. Moore 51 Robust Regression …but this is what we’d probably prefer y x Copyright © 2001, 2003, Andrew W. Moore 52
  • 27. LOESS-based Robust Regression After the initial fit, score each datapoint according to how well it’s fitted… y You are a very good datapoint. x Copyright © 2001, 2003, Andrew W. Moore 53 LOESS-based Robust Regression After the initial fit, score each datapoint according to how well it’s fitted… y You are a very good datapoint. You are not too shabby. x Copyright © 2001, 2003, Andrew W. Moore 54
  • 28. LOESS-based Robust Regression After the initial fit, score each datapoint according to how well it’s fitted… But you are pathetic. y You are a very good datapoint. You are not too shabby. x Copyright © 2001, 2003, Andrew W. Moore 55 Robust Regression For k = 1 to R… •Let (xk,yk) be the kth datapoint y •Let yestk be predicted value of yk x •Let wk be a weight for datapoint k that is large if the datapoint fits well and small if it fits badly: wk = KernelFn([yk- yestk]2) Copyright © 2001, 2003, Andrew W. Moore 56
  • 29. Robust Regression For k = 1 to R… •Let (xk,yk) be the kth datapoint y •Let yestk be predicted value of yk x •Let wk be a weight for datapoint k that is large if the Then redo the regression datapoint fits well and small if it using weighted datapoints. fits badly: Weighted regression was described earlier in the “vary noise” section, and is also discussed wk = KernelFn([yk- yestk]2) in the “Memory-based Learning” Lecture. Guess what happens next? Copyright © 2001, 2003, Andrew W. Moore 57 Robust Regression For k = 1 to R… •Let (xk,yk) be the kth datapoint y •Let yestk be predicted value of yk x •Let wk be a weight for datapoint k that is large if the Then redo the regression datapoint fits well and small if it using weighted datapoints. fits badly: I taught you how to do this in the “Instance- based” lecture (only then the weights wk = KernelFn([yk- yestk]2) depended on distance in input-space) Repeat whole thing until converged! Copyright © 2001, 2003, Andrew W. Moore 58
  • 30. Robust Regression---what we’re doing What regular regression does: Assume yk was originally generated using the following recipe: yk = β0+ β1 xk+ β2 xk2 +N(0,σ2) Computational task is to find the Maximum Likelihood β0 , β1 and β2 Copyright © 2001, 2003, Andrew W. Moore 59 Robust Regression---what we’re doing What LOESS robust regression does: Assume yk was originally generated using the following recipe: With probability p: yk = β0+ β1 xk+ β2 xk2 +N(0,σ2) But otherwise yk ~ N(µ,σhuge2) Computational task is to find the Maximum Likelihood β0 , β1 , β2 , p, µ and σhuge Copyright © 2001, 2003, Andrew W. Moore 60
  • 31. Robust Regression---what we’re doing What LOESS robust regression does: Mysteriously, the Assume yk was originally generated using the reweighting procedure following recipe: does this computation for us. With probability p: yk = β0+ β1 xk+ β2 xk2 +N(0,σ2) Your first glimpse of two spectacular letters: But otherwise yk ~ N(µ,σhuge2) E.M. Computational task is to find the Maximum Likelihood β0 , β1 , β2 , p, µ and σhuge Copyright © 2001, 2003, Andrew W. Moore 61 Regression Trees Copyright © 2001, 2003, Andrew W. Moore 62
  • 32. Regression Trees • “Decision trees for regression” Copyright © 2001, 2003, Andrew W. Moore 63 A regression tree leaf Predict age = 47 Mean age of records matching this leaf node Copyright © 2001, 2003, Andrew W. Moore 64
  • 33. A one-split regression tree Gender? Female Male Predict age = 39 Predict age = 36 Copyright © 2001, 2003, Andrew W. Moore 65 Choosing the attribute to split on Gender Rich? Num. Num. Beany Age Children Babies Female No 2 1 38 Male No 0 0 24 Male Yes 0 5+ 72 : : : : : • We can’t use information gain. • What should we use? Copyright © 2001, 2003, Andrew W. Moore 66
  • 34. Choosing the attribute to split on Gender Rich? Num. Num. Beany Age Children Babies Female No 2 1 38 Male No 0 0 24 Male Yes 0 5+ 72 : : : : : MSE(Y|X) = The expected squared error if we must predict a record’s Y value given only knowledge of the record’s X value If we’re told x=j, the smallest expected error comes from predicting the mean of the Y-values among those records in which x=j. Call this mean quantity µyx=j Then… 1 NX MSE (Y | X )= ∑ ∑ (x y=kj )− µ yx= j ) 2 R j =1 ( k such that k Copyright © 2001, 2003, Andrew W. Moore 67 Choosing the attribute to split on Gender Rich? Num. Num. Beany Age Children Babies NoRegression tree attribute selection: greedily Female 2 1 38 choose the attribute that minimizes MSE(Y|X) Male No 0 0 24 Guess 0what we do about real-valued inputs? Male Yes 5+ 72 : : : : : Guess how we prevent overfitting MSE(Y|X) = The expected squared error if we must predict a record’s Y value given only knowledge of the record’s X value If we’re told x=j, the smallest expected error comes from predicting the mean of the Y-values among those records in which x=j. Call this mean quantity µyx=j Then… 1 NX MSE (Y | X )= ∑ ∑ (x y=kj )− µ yx= j ) 2 R j =1 ( k such that k Copyright © 2001, 2003, Andrew W. Moore 68
  • 35. Pruning Decision …property-owner = Yes Do I deserve to live? Gender? Female Male Predict age = 39 Predict age = 36 # property-owning females = 56712 # property-owning males = 55800 Mean age among POFs = 39 Mean age among POMs = 36 Age std dev among POFs = 12 Age std dev among POMs = 11.5 Use a standard Chi-squared test of the null- hypothesis “these two populations have the same mean” and Bob’s your uncle. Copyright © 2001, 2003, Andrew W. Moore 69 Linear Regression Trees …property-owner = Yes Also known as “Model Trees” Gender? Female Male Predict age = Predict age = 26 + 6 * NumChildren - 24 + 7 * NumChildren - 2 * YearsEducation 2.5 * YearsEducation Leaves contain linear Split attribute chosen to minimize functions (trained using MSE of regressed children. linear regression on all Pruning with a different Chi- records matching that leaf) squared Copyright © 2001, 2003, Andrew W. Moore 70
  • 36. Linear Regression Trees …property-owner = Yes Also known as “Model Trees” Gender? d ste ny en te Female Male ae re b gno has he i lly Predict raget = ha u ing d t tes a Predict age = ic t typ ibute ree d ste ttribu You 26 + 6 * NumChildrenttr the 24 l+ nte ued a e t - u7 *l NumChildren - ail: rical a in al v et o se e* l-va abo 2 * YearsEducation up D 2.5 a YearsEducation teg her But u e r d te ca hig tes us n. sio , and been n or Leaves contain lineares s Split attribute chosen to minimize reg ibute ey’ve of regressed children. functions (trained using h MSE r at t t linear regression on alln if e ev Pruning with a different Chi- records matching that leaf) squared Copyright © 2001, 2003, Andrew W. Moore 71 Test your understanding Assuming regular regression trees, can you sketch a graph of the fitted function yest(x) over this diagram? y x Copyright © 2001, 2003, Andrew W. Moore 72
  • 37. Test your understanding Assuming linear regression trees, can you sketch a graph of the fitted function yest(x) over this diagram? y x Copyright © 2001, 2003, Andrew W. Moore 73 Multilinear Interpolation Copyright © 2001, 2003, Andrew W. Moore 74
  • 38. Multilinear Interpolation Consider this dataset. Suppose we wanted to create a continuous and piecewise linear fit to the data y x Copyright © 2001, 2003, Andrew W. Moore 75 Multilinear Interpolation Create a set of knot points: selected X-coordinates (usually equally spaced) that cover the data y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 76
  • 39. Multilinear Interpolation We are going to assume the data was generated by a noisy version of a function that can only bend at the knots. Here are 3 examples (none fits the data well) y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 77 How to find the best fit? Idea 1: Simply perform a separate regression in each segment for each part of the curve What’s the problem with this idea? y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 78
  • 40. How to find the best fit? Let’s look at what goes on in the red segment (q3 − x) (q − x) y est ( x) = h2 + 2 h3 where w = q3 − q2 w w h2 y h3 q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 79 How to find the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) x − q2 q −x where φ2 ( x) = 1 − , φ3 ( x) = 1 − 3 w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 80
  • 41. How to find the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) x − q2 q −x where φ2 ( x) = 1 − , φ3 ( x) = 1 − 3 w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 81 How to find the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) | x − q2 | | x − q3 | where φ2 ( x) = 1 − , φ3 ( x) = 1 − w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 82
  • 42. How to find the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) | x − q2 | | x − q3 | where φ2 ( x) = 1 − , φ3 ( x) = 1 − w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 83 How to find the best fit? NK In general y est ( x) = ∑ hi φi ( x) i =1  | x − qi | 1 − if | x − qi |<w where φi ( x) =  w  0 otherwise  h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 84
  • 43. How to find the best fit? NK In general y est ( x) = ∑ hi φi ( x) i =1  | x − qi | 1 − if | x − qi |<w where φi ( x) =  w  0 otherwise  h2 And this is simply a basis function regression problem! y We know how to find the least φ2(x) h3 squares hiis! q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 85 In two dimensions… Blue dots show locations of input vectors (outputs not depicted) x2 x1 Copyright © 2001, 2003, Andrew W. Moore 86
  • 44. In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated not depicted) surface x2 x1 Copyright © 2001, 2003, Andrew W. Moore 87 In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 9 3 not depicted) surface But how do we do the interpolation to ensure that the surface is 7 8 continuous? x2 x1 Copyright © 2001, 2003, Andrew W. Moore 88
  • 45. In two dimensions… Each purple dot is a knot point. Blue dots showthe It will contain To predict locations of input the height of value here… vectors (outputs the estimated 9 3 not depicted) surface But how do we do the interpolation to ensure that the surface is 7 8 continuous? x2 x1 Copyright © 2001, 2003, Andrew W. Moore 89 In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 97 3 not depicted) surface But how do we do the To predict the interpolation to value here… ensure that the First interpolate surface is 7 8 its value on two continuous? opposite edges… 7.33 x2 x1 Copyright © 2001, 2003, Andrew W. Moore 90
  • 46. In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 97 3 not depicted) surface To predict the But how do we value here… do the 7.05 interpolation to First interpolate ensure that the its value on two surface is 7 8 opposite edges… continuous? Then interpolate 7.33 between those x two values 2 x1 Copyright © 2001, 2003, Andrew W. Moore 91 In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 97 3 not depicted) surface To predict the But how do we value here… do the 7.05 interpolation to First interpolate ensure that the its value on two Notes: 8 surface is 7 opposite edges… continuous? Then interpolate 7.33 can easily be generalized This between those to m dimensions. x two values 2 It should be easy to see that it ensures continuity x1 The patches are not linear Copyright © 2001, 2003, Andrew W. Moore 92
  • 47. Doing the regression Given data, how do we find the optimal knot heights? Happily, it’s 9 3 simply a two- dimensional basis function problem. (Working out 7 8 the basis functions is tedious, x2 unilluminating, and easy) What’s the problem in higher x1 dimensions? Copyright © 2001, 2003, Andrew W. Moore 93 MARS: Multivariate Adaptive Regression Splines Copyright © 2001, 2003, Andrew W. Moore 94
  • 48. MARS • Multivariate Adaptive Regression Splines • Invented by Jerry Friedman (one of Andrew’s heroes) • Simplest version: Let’s assume the function we are learning is of the following form: m y (x) = ∑ g k ( xk ) est k =1 Instead of a linear combination of the inputs, it’s a linear combination of non-linear functions of individual inputs Copyright © 2001, 2003, Andrew W. Moore 95 MARS m y (x) = ∑ g k ( xk ) est k =1 Instead of a linear combination of the inputs, it’s a linear combination of non-linear functions of individual inputs Idea: Each gk is one of these y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 96
  • 49. MARS m y (x) = ∑ g k ( xk ) est k =1 Instead of a linear combination of the inputs, it’s a linear combination of non-linear functions of individual inputs m NK y (x) = ∑∑ h k φ k ( xk ) qkj : The location of est the j’th knot in the jj k =1 j =1 k’th dimension hkj : The regressed  | xk − q k | height of the j’th 1 − j if | xk − q k |<wk knot in the k’th where φ k ( x) =  j y wk j dimension  0 otherwise  wk: The spacing between knots in the kth dimension q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 97 That’s not complicated enough! • Okay, now let’s get serious. We’ll allow arbitrary “two-way interactions”: m m m y (x) = ∑ g k ( xk ) + ∑ ∑g est ( xk , xt ) kt k =1 k =1 t = k +1 Can still be expressed as a linear The function we’re combination of basis functions learning is allowed to be Thus learnable by linear regression a sum of non-linear functions over all one-d Full MARS: Uses cross-validation to and 2-d subsets of choose a subset of subspaces, knot attributes resolution and other parameters. Copyright © 2001, 2003, Andrew W. Moore 98
  • 50. If you like MARS… …See also CMAC (Cerebellar Model Articulated Controller) by James Albus (another of Andrew’s heroes) • Many of the same gut-level intuitions • But entirely in a neural-network, biologically plausible way • (All the low dimensional functions are by means of lookup tables, trained with a delta- rule and using a clever blurred update and hash-tables) Copyright © 2001, 2003, Andrew W. Moore 99 Where are we now? Inputs Inference p(E1|E2) Joint DE Engine Learn Dec Tree, Gauss/Joint BC, Gauss Naïve BC, Inputs Predict Classifier category Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve Inputs Inputs Prob- Density DE ability Estimator Linear Regression, Polynomial Regression, RBFs, Predict Regressor Robust Regression Regression Trees, Multilinear real no. Interp, MARS Copyright © 2001, 2003, Andrew W. Moore 100
  • 51. Citations Radial Basis Functions Regression Trees etc T. Poggio and F. Girosi, L. Breiman and J. H. Friedman and Regularization Algorithms for R. A. Olshen and C. J. Stone, Learning That Are Equivalent to Classification and Regression Multilayer Networks, Science, Trees, Wadsworth, 1984 247, 978--982, 1989 J. R. Quinlan, Combining Instance- LOESS Based and Model-Based Learning, Machine Learning: W. S. Cleveland, Robust Locally Proceedings of the Tenth Weighted Regression and International Conference, 1993 Smoothing Scatterplots, Journal of the American Statistical MARS Association, 74, 368, 829-836, J. H. Friedman, Multivariate December, 1979 Adaptive Regression Splines, Department for Statistics, Stanford University, 1988, Technical Report No. 102 Copyright © 2001, 2003, Andrew W. Moore 101