SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Regression and
                            Classification with
                             Neural Networks
                                                  Andrew W. Moore
Note to other teachers and users of
these slides. Andrew would be delighted

                                                       Professor
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them

                                              School of Computer Science
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in

                                              Carnegie Mellon University
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s tutorials:
                                                    www.cs.cmu.edu/~awm
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully
                                                      awm@cs.cmu.edu
received.

                                                        412-268-7599


                    Copyright © 2001, 2003, Andrew W. Moore                   Sep 25th, 2001




                                       Linear Regression
                                                                          DATASET

                                                                 inputs         outputs
                                                               x1 = 1          y1 = 1
                                                               x2 = 3          y2 = 2.2
                    ↑
                                                               x3 = 2          y3 = 2
                    w
                    ↓
                 ←1→
                                                               x4 = 1.5        y4 = 1.9
                                                               x5 = 4          y5 = 3.1

  Linear regression assumes that the expected value of
  the output given an input, E[y|x], is linear.
  Simplest case: Out(x) = wx for some unknown w.
  Given the data, we can estimate w.
  Copyright © 2001, 2003, Andrew W. Moore                                       Neural Networks: Slide 2
1-parameter linear regression
Assume that the data is formed by
                 yi = wxi + noisei

where…
• the noise signals are independent
• the noise has a normal distribution with mean 0
  and unknown variance σ2

P(y|w,x) has a normal distribution with
• mean wx
• variance σ2
Copyright © 2001, 2003, Andrew W. Moore         Neural Networks: Slide 3




          Bayesian Linear Regression
                P(y|w,x) = Normal (mean wx, var σ2)

We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn)
which are EVIDENCE about w.

We want to infer w from the data.
             P(w|x1, x2, x3,…xn, y1, y2…yn)
•You can use BAYES rule to work out a posterior
distribution for w given the data.
•Or you could do Maximum Likelihood Estimation

Copyright © 2001, 2003, Andrew W. Moore         Neural Networks: Slide 4
Maximum likelihood estimation of w

 Asks the question:
 “For which value of w is this data most likely to have
   happened?”
       <=>
 For what w is
   P(y1, y2…yn |x1, x2, x3,…xn, w) maximized?
       <=>
 For what w is      n

                                         ∏P( y w, x ) maximized?
                                                         i        i
                                         i =1
 Copyright © 2001, 2003, Andrew W. Moore                                        Neural Networks: Slide 5




For what w is
                      n

                   ∏ P( y                w , x i ) maximized?
                                     i
                    i =1

For what w is
            n
                                           1
                      ∏ exp(− 2 (
                                                yi − wxi
                                                             ) 2 ) maximized?
                                                 σ
                       i =1
For what w is
                                                                  2
                                   1  y − wx i 
                           n

                      ∑           −i            maximized?
                                         σ
                                   2           
                          i =1
For what w is                                                2
                            n

                          ∑ (y                           )
                                         − wx                    minimized?
                                     i               i
                           i =1


 Copyright © 2001, 2003, Andrew W. Moore                                        Neural Networks: Slide 6
Linear Regression


The maximum
  likelihood w is
  the one that                                            E(w)
                                                                  w
  minimizes sum-
                                                   Ε = ∑( yi − wxi )
                                                                      2
  of-squares of
  residuals                                               i

                                                                             (∑ x )w
                                                   = ∑ yi − (2∑ xi yi )w +
                                                              2                            2    2
                                                                                       i
                                                      i
   We want to minimize a quadratic function of w.
Copyright © 2001, 2003, Andrew W. Moore                                   Neural Networks: Slide 7




                          Linear Regression
Easy to show the sum of
  squares is minimized
  when
                           ∑x y       i        i
                 w=
                           ∑x
                                              2
                                          i

The maximum likelihood
                        Out(x) = wx
  model is


 We can use it for
  prediction
Copyright © 2001, 2003, Andrew W. Moore                                   Neural Networks: Slide 8
Linear Regression
Easy to show the sum of
  squares is minimized
  when
                           ∑ xi yi
                                                               p(w)

                 w=                                                          w

                            ∑x
                                              2
                                                       Note:
                                          i                             In Bayesian stats you’d have
                                                               ended up with a prob dist of w
The maximum likelihood
                         Out(x) = wx
  model is                                             And predictions would have given a prob
                                                               dist of expected output

                                                       Often useful to know your confidence.
 We can use it for
                                                               Max likelihood can give some kinds of
  prediction
                                                               confidence too.

Copyright © 2001, 2003, Andrew W. Moore                                             Neural Networks: Slide 9




                 Multivariate Regression
What if the inputs are vectors?
                                                               3.
                                       .4
                                 6.
                                                                                 2-d input
                                                           5
                                                       .
                                                                                 example
                                  .8
                    x2                                          . 10

                                 x1
  Dataset has form
                                                  x1                   y1
                                                  x2                   y2
                                                  x3                   y3
                                                  .:                    :
                                                  .
                                                  xR                    yR
Copyright © 2001, 2003, Andrew W. Moore                                            Neural Networks: Slide 10
Multivariate Regression
Write matrix X and Y thus:

                  .....x1 .....   x11               x1m        y1 
                                            x12    ...
                 .....x .....  x                              y 
                                            x22    ... x2 m 
                                  =  21
               x=                                            y =  2
                          2

                                                          
                                                                 M
                                                    M
                         M
                                                          
                                                                 
                 .....x R .....  xR1     xR 2   ... xRm        yR 

 (there are R datapoints. Each input has m components)
 The linear regression model assumes a vector w such that
             Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]
 The max. likelihood w is w = (XTX) -1(XTY)

Copyright © 2001, 2003, Andrew W. Moore                        Neural Networks: Slide 11




                 Multivariate Regression
Write matrix X and Y thus:

                  .....x1 .....   x11               x1m        y1 
                                            x12    ...
                 .....x .....  x                ... x2 m      
                                            x22              y =  y2 
                                  =  21
               x=        2

                                                          
                                                                 M
                                                    M
                         M
                                                          
                                                                 
                 .....x R .....  xR1     xR 2   ... xRm        yR 

                                                   IMPORTANT EXERCISE:
 (there are R datapoints. Each input hasPROVE IT !!!!!
                                         m components)
 The linear regression model assumes a vector w such that
             Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]
 The max. likelihood w is w = (XTX) -1(XTY)

Copyright © 2001, 2003, Andrew W. Moore                        Neural Networks: Slide 12
Multivariate Regression (con’t)

The max. likelihood w is w = (XTX)-1(XTY)
                                           R

                                          ∑x x       ki kj
XTX is an m x m matrix: i,j’th elt is
                                          k =1
                                                 R

                                               ∑x        y
XTY is an m-element vector: i’th elt                   ki k
                                               k =1


Copyright © 2001, 2003, Andrew W. Moore                       Neural Networks: Slide 13




       What about a constant term?
We may expect
linear data that does
not go through the
origin.

Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.

Can you guess??
Copyright © 2001, 2003, Andrew W. Moore                       Neural Networks: Slide 14
The constant term
• The trick is to create a fake input “X0” that
  always takes the value 1

     X1       X2        Y                                         X0         X1        X2        Y
     2        4         16                                        1          2         4         16
     3        4         17                                        1          3         4         17
     5        5         20                                        1          5         5         20
   Before:                                                        After:
   Y=w1X1+ w2X2                                                   Y= w0X0+w1X1+ w2X2
   …has to be a poor                                               = w0+w1X1+ w2X2
                                          In this example,
                                          You should be able
   model                                                          …has a fine constant
                                          to see the MLE w0
                                          , w1 and w2 by
                                                                  term
                                          inspection
Copyright © 2001, 2003, Andrew W. Moore                                                Neural Networks: Slide 15




           Regression with varying noise
• Suppose you know the variance of the noise that
  was added to each datapoint.
                                                     y=3
                                                                                  σ=2
                            σi2
    xi         yi
    ½          ½            4                        y=2
                                                                                              σ=1/2
    1          1            1
                                                                       σ=1
    2          1            1/4                      y=1
                                                                                  σ=1/2
                                                                 σ=2
    2          3            4                        y=0

    3          2            1/4                            x=0     x=1           x=2         x=3


                                                                                                  MLE
                                                                                               the w?
                                   yi ~ N ( wxi , σ i2 )                                  at’s    f
                                                                                       W h at e o
          Assume
                                                                                         stim
                                                                                        e
Copyright © 2001, 2003, Andrew W. Moore                                                Neural Networks: Slide 16
MLE estimation with varying noise
 argmax log p( y , y ,..., y                       | x1 , x2 ,..., xR , σ 12 , σ 2 ,..., σ R , w) =
                                                                                 2         2
                                1    2         R

         w                                                                          Assuming i.i.d. and
                                                     ( yi − wxi ) 2
                                               R                                    then plugging in
                              argmin ∑                                 =            equation for Gaussian
                                                          σ i2                      and simplifying.
                                              i =1
                                    w
                                                    
                                        x ( y − wx )                                 Setting dLL/dw
                                     R
                      w such that ∑ i i 2 i = 0  =                                 equal to zero
                                                    
                                             σi
                                                    
                                   i =1


                                            R xi yi                          Trivial algebra
                                           ∑ 2 
                                                     
                                            i =1 σ i 
                                             R xi2 
                                            ∑ 2 
                                             σ
                                             i =1 i 
Copyright © 2001, 2003, Andrew W. Moore                                                 Neural Networks: Slide 17




             This is Weighted Regression
• We are asking to minimize the weighted sum of
  squares
                                                      y=3
                                                                                    σ=2
                              ( yi − wxi )    2
                        R

argmin ∑                            σ i2              y=2
                       i =1                                                                    σ=1/2
         w
                                                                         σ=1
                                                      y=1
                                                                                   σ=1/2
                                                                   σ=2
                                                      y=0
                                                            x=0          x=1      x=2         x=3



                                                                   1
  where weight for i’th datapoint is                              σ i2
Copyright © 2001, 2003, Andrew W. Moore                                                 Neural Networks: Slide 18
Weighted Multivariate Regression

The max. likelihood w is w = (WXTWX)-1(WXTWY)


                                                                            xki xkj
                                                                      R

                                                                  ∑
(WXTWX)           is an m x m matrix: i,j’th elt is
                                                                              σ i2
                                                                  k =1

                                                                  R
                                                                           xki yk
(WXTWY) is an m-element vector: i’th elt
                                                                 ∑          σ i2
                                                                 k =1


Copyright © 2001, 2003, Andrew W. Moore                               Neural Networks: Slide 19




                      Non-linear Regression
• Suppose you know that y is related to a function of x in
  such a way that the predicted values have a non-linear
  dependence on w, e.g:
                                          y=3
    xi         yi
    ½          ½                          y=2

    1          2.5
    2          3                          y=1


    3          2                          y=0

    3          3                                x=0   x=1       x=2         x=3


                                                                                 MLE
                   yi ~ N ( w + xi , σ )                                      the w?
                                                            2            at’s    f
                                                                      W h at e o
 Assume
                                                                        stim
                                                                       e
Copyright © 2001, 2003, Andrew W. Moore                               Neural Networks: Slide 20
Non-linear MLE estimation
          argmax log p( y , y ,..., y                           | x1 , x2 ,..., xR , σ , w) =
                                          1        2        R

                  w                                                                   Assuming i.i.d. and

                         argmin ∑ (y                                    )
                                               R                                      then plugging in
                                                                         2
                                                           − w + xi          =        equation for Gaussian
                                                       i
                                                                                      and simplifying.
                                              i =1
                                w
                                                     
                                        y − w + xi                                     Setting dLL/dw
                                     R
                      w such that ∑ i             = 0 =                              equal to zero
                                                     
                                           w + xi
                                                     
                                   i =1




Copyright © 2001, 2003, Andrew W. Moore                                                  Neural Networks: Slide 21




            Non-linear MLE estimation
          argmax log p( y , y ,..., y                           | x1 , x2 ,..., xR , σ , w) =
                                          1        2        R

                  w                                                                   Assuming i.i.d. and

                         argmin ∑ (y                                    )
                                               R                                      then plugging in
                                                                         2
                                                           − w + xi          =        equation for Gaussian
                                                       i
                                                                                      and simplifying.
                                              i =1
                                w
                                                     
                                        y − w + xi                                     Setting dLL/dw
                                     R
                      w such that ∑ i             = 0 =                              equal to zero
                                                     
                                           w + xi
                                                     
                                   i =1


                                                                             We’re down the
                                                                             algebraic toilet
                                                                                                   t
                                                                                               wha
                                                                                           ess
                                                                                         u
                                                                                     So g e do?
                                                                                         w
Copyright © 2001, 2003, Andrew W. Moore                                                  Neural Networks: Slide 22
Non-linear MLE estimation
          argmax log p( y , y ,..., y                 | x1 , x2 ,..., xR , σ , w) =
                                          1   2   R

                  w                                                         Assuming i.i.d. and

                         argmin ∑ (                           )
                              R
Common (but not only) approach:                      then plugging in
                                              2
                                  yi − w + xi =      equation for Gaussian
Numerical Solutions:                                 and simplifying.
                            i =1
                     w
• Line Search
• Simulated Annealing
                                               
                                 yi − w + xi          Setting dLL/dw
                            R

                                          ∑
              w such                        = 0 =
• Gradient Descent that                               equal to zero
                                               
                                     w + xi
                                               
• Conjugate Gradient      i =1

• Levenberg Marquart
                                                We’re down the
• Newton’s Method
                                                                   algebraic toilet
Also, special purpose statistical-
                                                                                        t
                                                                                    wha
    optimization-specific tricks such as
                                                                               uess ?
                                                                           So g e do
    E.M. (See Gaussian Mixtures lecture
    for introduction)                                                          w
Copyright © 2001, 2003, Andrew W. Moore                                        Neural Networks: Slide 23




         GRADIENT DESCENT
                                  f(w): ℜ → ℜ
Suppose we have a scalar function

We want to find a local minimum.
Assume our current weight is w

                                                                             ∂
                                                                               f (w)
GRADIENT DESCENT RULE:                                   w ← w −η
                                                                            ∂w

η is called the LEARNING RATE. A small positive
  number, e.g. η = 0.05

Copyright © 2001, 2003, Andrew W. Moore                                        Neural Networks: Slide 24
GRADIENT DESCENT
                                  f(w): ℜ → ℜ
Suppose we have a scalar function

We want to find a local minimum.
Assume our current weight is w

                                                                     ∂
                                                                       f (w)
GRADIENT DESCENT RULE:                              w ← w −η
                                                                    ∂w
                                             Recall Andrew’s favorite
                                             default value for anything
η is called the LEARNING RATE. A small positive
  number, e.g. η = 0.05
               QUESTION: Justify the Gradient Descent Rule
Copyright © 2001, 2003, Andrew W. Moore                               Neural Networks: Slide 25




 Gradient Descent in “m” Dimensions
              f(w ) : ℜ m → ℜ
Given
                        ∂           
                              f (w ) 
                        
                         ∂w1        
                                      points in direction of steepest ascent.
               ∇f (w) =      M
                        ∂           
                         ∂w f (w)
                        m           
                           ∇f (w) is the gradient in that direction

                                                 w ← w -η∇f (w)
   GRADIENT DESCENT RULE:
   Equivalently                             ∂
                                                f (w ) ….where wj is the jth weight
                             wj ← wj - η
                                           ∂w j
                                                      “just like a linear feedback system”
Copyright © 2001, 2003, Andrew W. Moore                               Neural Networks: Slide 26
What’s all this got to do with Neural
           Nets, then, eh??
For supervised learning, neural nets are also models with
vectors of w parameters in them. They are now called
weights.
As before, we want to compute the weights to minimize sum-
of-squared residuals.
             Which turns out, under “Gaussian i.i.d noise”
             assumption to be max. likelihood.
Instead of explicitly solving for max. likelihood weights, we
use GRADIENT DESCENT to SEARCH for them.
                                                                               our eyes.
                                                                   ression in y
                                                        s exp
                                           , a querulou
                              y?” you ask
                          “Wh
                                                     e later.”
                                       ply: “We’ll se
                          “Aha!!” I re
Copyright © 2001, 2003, Andrew W. Moore                                    Neural Networks: Slide 27




                        Linear Perceptrons
They are multivariate linear models:

                                          Out(x) = wTx

And “training” consists of minimizing sum-of-squared residuals
  by gradient descent.

                                 ∑ (Out (x ) −                )2
                       Ε=                                yk
                                                  k
                                   k


                                 ∑ (w                    )
                                                         2
                                          Τ
                             =                x k − yk
                                   k


QUESTION: Derive the perceptron training rule.
Copyright © 2001, 2003, Andrew W. Moore                                    Neural Networks: Slide 28
Linear Perceptron Training Rule
           R
 E = ∑ ( yk − w T x k ) 2
          k =1

   Gradient descent tells us
   we should update w
   thusly if we wish to
   minimize E:
                        ∂E
 wj ← wj - η
                        ∂w j

                       ∂E
                            ?
 So what’s
                       ∂w j
Copyright © 2001, 2003, Andrew W. Moore                                      Neural Networks: Slide 29




   Linear Perceptron Training Rule
           R
                                            ∂E         ∂
                                                  R
 E = ∑ ( yk − w T x k ) 2                       =∑         ( yk − w T x k ) 2
                                            ∂w j k =1 ∂w j
          k =1
                                                                           ∂
                                             R
                                          = ∑ 2( yk − w T x k )
   Gradient descent tells us                                                   ( yk − w T x k )
                                                                          ∂w j
   we should update w                       k =1

   thusly if we wish to                                          ∂
                                                       R
                                          = −2∑ δk
   minimize E:                                                       wT xk
                                                                ∂w j
                                                   k =1
                        ∂E                                                …where…
 wj ← wj - η                                                              δk = yk − w T x k
                        ∂w j
                                                            ∂
                                                   R                m
                                          = −2∑ δk                 ∑w x   i ki
                                                           ∂w j
                                                 k =1              i =1
                       ∂E
                            ?
 So what’s                                                  R
                       ∂w j                      = −2∑ δk xkj
                                                           k =1

Copyright © 2001, 2003, Andrew W. Moore                                      Neural Networks: Slide 30
Linear Perceptron Training Rule
           R
 E = ∑ ( yk − w T x k ) 2
          k =1

   Gradient descent tells us
   we should update w
   thusly if we wish to
   minimize E:
                        ∂E
 wj ← wj - η                                                                 R
                                               w j ← w j + 2η∑ δk xkj
                        ∂w j
…where…
                                                                            k =1
∂E        R
     = −2∑ δk xkj
∂w j                                               We frequently neglect the 2 (meaning
         k =1
                                                   we halve the learning rate)
Copyright © 2001, 2003, Andrew W. Moore                                Neural Networks: Slide 31




The “Batch” perceptron algorithm
1) Randomly initialize weights w1 w2 … wm

2) Get your dataset (append 1’s to the inputs if
   you don’t want to go through the origin).

3) for i = 1 to R
                                          δ i := yi − wΤxi
4) for j = 1 to m                                         R
                                          w j ← w j + η ∑ δ i xij
                                                         i =1

5) if ∑ δ i 2 stops improving then stop. Else loop
   back to 3.
Copyright © 2001, 2003, Andrew W. Moore                                Neural Networks: Slide 32
δ i ← yi − w Τ x i                               A RULE KNOWN BY

   w j ← w j + ηδ i xij
                                                       MANY NAMES



                                                                            rule
                                                                  off
                     e                                         wH
                  Rul                                      idro
              LMS                                        W
                                                   The
            e
       Th
                                  The delta ru
                                              le
                                                         Th
                                                            e   ad
                                                                   alin
                                                                       er
                                                                         ule
             Classical
            conditioning


Copyright © 2001, 2003, Andrew W. Moore                              Neural Networks: Slide 33




If data is voluminous and arrives fast

Input-output pairs (x,y) come streaming in very
quickly. THEN
           Don’t bother remembering old ones.
           Just keep using new ones.

                                    observe (x,y)

                                     δ ← y − wΤx
                                      ∀j w j ← w j + η δ x j

Copyright © 2001, 2003, Andrew W. Moore                              Neural Networks: Slide 34
Gradient Descent vs Matrix Inversion
       for Linear Perceptrons
GD Advantages (MI disadvantages):
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
  fewer than m iterations needed we’ve beaten Matrix Inversion
• More easily parallelizable (or implementable in wetware)?
GD Disadvantages (MI advantages):
• It’s moronic
• It’s essentially a slow implementation of a way to build the XTX matrix
  and then solve a set of linear equations
• If m is small it’s especially outageous. If m is large then the direct
  matrix inversion method gets fiddly but not impossible if you want to
  be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when
  gradient descent will stop.
Copyright © 2001, 2003, Andrew W. Moore                   Neural Networks: Slide 35




Gradient Descent vs Matrix Inversion
       for Linear Perceptrons
GD Advantages (MI disadvantages):
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
  fewer than m iterations needed we’ve beaten Matrix Inversion
• More easily parallelizable (or implementable in wetware)?
GD Disadvantages (MI advantages):
• It’s moronic
• It’s essentially a slow implementation of a way to build the XTX matrix
  and then solve a set of linear equations
• If m is small it’s especially outageous. If m is large then the direct
  matrix inversion method gets fiddly but not impossible if you want to
  be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when
  gradient descent will stop.
Copyright © 2001, 2003, Andrew W. Moore                   Neural Networks: Slide 36
Gradient Descent vs Matrix Inversion
       for Linear Perceptrons
GD Advantages (MI disadvantages):
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
  fewer than m iterations needed we’ve beaten Matrix Inversion
                                      But we’ll
• More easily parallelizable (or implementable in wetware)?
GD Disadvantages (MIsoon see that
                     advantages):
                                           GD
• It’s moronic
                                has an important extra
• It’s essentially a slow implementation of a way to build the XTX matrix
  and then solve a set of linear equations
                                   trick up its sleeve
• If m is small it’s especially outageous. If m is large then the direct
  matrix inversion method gets fiddly but not impossible if you want to
  be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when
  gradient descent will stop.
Copyright © 2001, 2003, Andrew W. Moore                     Neural Networks: Slide 37




        Perceptrons for Classification
What if all outputs are 0’s or 1’s ?


                                           or



      We can do a linear fit.
      Our prediction is 0 if out(x)≤1/2
                                          1 if out(x)>1/2
      WHAT’S THE BIG PROBLEM WITH THIS???

Copyright © 2001, 2003, Andrew W. Moore                     Neural Networks: Slide 38
Perceptrons for Classification
What if all outputs are 0’s or 1’s ?


                                           or


                                                          Blue = Out(x)
      We can do a linear fit.
      Our prediction is 0 if out(x)≤½
                                          1 if out(x)>½
      WHAT’S THE BIG PROBLEM WITH THIS???

Copyright © 2001, 2003, Andrew W. Moore                         Neural Networks: Slide 39




        Perceptrons for Classification
What if all outputs are 0’s or 1’s ?


                                           or


                                                          Blue = Out(x)
      We can do a linear fit.
      Our prediction is 0 if out(x)≤½ Green = Classification
                                          1 if out(x)>½



Copyright © 2001, 2003, Andrew W. Moore                         Neural Networks: Slide 40
Classification with Perceptrons I
                         ∑ (y                     )2
                                       − w Τxi .
Don’t minimize                     i

Minimize number of misclassifications instead. [Assume outputs are

                                       ∑ (y                (        ))
   +1 & -1, not +1 & 0]
                                                  − Round w Τ x i
                                              i


where Round(x) =                       -1 if x<0                            NOTE: CUTE &
                                                                          NON OBVIOUS WHY
                                          1 if x≥0                          THIS WORKS!!


The gradient descent rule can be changed to:
            if (xi,yi) correctly classed, don’t change
                                                               w         w - xi
            if wrongly predicted as 1
                                                               w         w + xi
            if wrongly predicted as -1

Copyright © 2001, 2003, Andrew W. Moore                                     Neural Networks: Slide 41




     Classification with Perceptrons II:
              Sigmoid Functions



    Least squares fit useless
                                                                   This fit would classify much
                                                                   better. But not a least
                                                                   squares fit.




Copyright © 2001, 2003, Andrew W. Moore                                     Neural Networks: Slide 42
Classification with Perceptrons II:
              Sigmoid Functions



    Least squares fit useless
                                                         This fit would classify much
SOLUTION:                                                better. But not a least
                                                         squares fit.
Instead of             Out(x) = wTx
We’ll use              Out(x) = g(wTx)
where g( x) : ℜ → (0,1) is a
  squashing function
Copyright © 2001, 2003, Andrew W. Moore                          Neural Networks: Slide 43




                                   The Sigmoid
                      1
  g ( h) =
                1 + exp(− h)


  Note that if you rotate this
     curve through 180o
 centered on (0,1/2) you get
       the same curve.

           i.e. g(h)=1-g(-h)


                                          Can you prove this?
Copyright © 2001, 2003, Andrew W. Moore                          Neural Networks: Slide 44
The Sigmoid
                      1
  g ( h) =
                1 + exp(− h)




Now we choose w to minimize

                         ∑ [ yi − Out(x i )] = ∑ [yi − g (w Τ x i )]
                           R                            R
                                                                    2
                                            2

                          i =1                         i =1


Copyright © 2001, 2003, Andrew W. Moore                            Neural Networks: Slide 45




         Linear Perceptron Classification
                    Regions
                                 0        0
                                     0
                                              1
                   X2            1
                                         1
                                 X1

                                                  Out(x) = g(wT(x,1))
         We’ll use the model
                                                     = g(w1x1 + w2x2 + w0)
         Which region of above diagram classified with +1, and
          which with 0 ??
Copyright © 2001, 2003, Andrew W. Moore                            Neural Networks: Slide 46
Gradient descent with sigmoid on a perceptron
First, notice g ' ( x ) = g ( x )(1 − g ( x ))
                                                        − e− x
                           1
Because : g ( x ) =             so g ' ( x ) =
                       1 + e− x                                     2
                                                     1 + e − x 
                                                               
                                                               
                 1 −1 − e− x                                 −1         1
                                        1            1
                                                                               = − g ( x )(1 − g ( x ))
                                                                   1 −
             =                  =               −        =
                                               2 1 + e− x 1 + e− x  1 + e− x 
                              2                                              
                 1 + e − x      1 + e − x 
                                          
                                          
                    
Out(x) = g  ∑ wk xk 
                                                                                 The sigmoid perceptron
           k        
                                                                                 update rule:
                                 2
                         
               
Ε = ∑  yi − g  ∑ wk xik  
                           
               k         
    i
                                                                                                 R
                                                                                 w j ← w j + η ∑ δ i gi (1 − gi )xij
                                ∂               
          
∂Ε                 
     = ∑ 2 yi − g  ∑ wk xik   −   g  ∑ wk xik  
                               
                                ∂w j  k         
∂w j               k
                                                                                              i =1
       i


                                                                                                         m         
                                  
                                               ∂
   = ∑ − 2 yi − g  ∑ wk xik  g '  ∑ wk xik               ∑w x                              gi = g  ∑ w j xij 
                                                                      k   ik
                                                                                   where
                                                 ∂w j
                      k            k
                                                                                                                   
           
      i                                                          k

                                                                                                          j =1     
   = ∑ − 2δ i g (net i )(1 − g (net i ))xij

                                                                                                   δ i = yi − gi
       i

                                     net i = ∑ wk xk
where δ i = yi − Out(x i )
                                                 k
  Copyright © 2001, 2003, Andrew W. Moore                                                               Neural Networks: Slide 47




      Other Things about Perceptrons

  • Invented and popularized by Rosenblatt (1962)

  • Even with sigmoid nonlinearity, correct
    convergence is guaranteed

  • Stable behavior for overconstrained and
    underconstrained problems




  Copyright © 2001, 2003, Andrew W. Moore                                                               Neural Networks: Slide 48
Perceptrons and Boolean Functions
If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s…

                                           X2
• Can learn the function x1 ∧ x2
                                                X1


                                           X2
• Can learn the function x1 ∨ x2 .
                                                X1
• Can learn any conjunction of literals, e.g.
     x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5

           QUESTION: WHY?


 Copyright © 2001, 2003, Andrew W. Moore             Neural Networks: Slide 49




    Perceptrons and Boolean Functions
 • Can learn any disjunction of literals
      e.g. x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5

 • Can learn majority function
      f(x1,x2 … xn) = 1 if n/2 xi’s or more are = 1
                        0 if less than n/2 xi’s are = 1

 • What about the exclusive or function?
       f(x1,x2) = x1 ∀ x2 =
      (x1 ∧ ~x2) ∨ (~ x1 ∧ x2)




 Copyright © 2001, 2003, Andrew W. Moore             Neural Networks: Slide 50
Multilayer Networks
The class of functions representable by perceptrons
  is limited                       
                                          (   )
                            Out(x) = g w Τ x = g  ∑ w j x j 
                                                            
                                                 j          



                                                           Use a wider
                                                           representation !


                                 
                    
Out(x) = g  ∑W j g  ∑ w jk x jk   This is a nonlinear function
                                   
                    k            
           j                          Of a linear combination
                                                  Of non linear functions
                                                   Of linear combinations of inputs
Copyright © 2001, 2003, Andrew W. Moore                                      Neural Networks: Slide 51




               A 1-HIDDEN LAYER NET
  NINPUTS = 2                                                    NHIDDEN = 3

                                             NINS     
                                      v1 = g ∑ w1k xk 
                                                      
                     w11
                                             k =1                   w1
        x1            w21


                                             NINS     
                    w31
                                      v2 = g ∑ w2k xk                             N HID   
                                                                 w2
                                                                           Out = g  ∑ Wk vk 
                                                                                          
                                             k =1     
              w12
                                                                                    k =1    
                           w22
        x2                                                       w3

                                             NINS     
                             w32
                                      v3 = g ∑ w3k xk 
                                                      
                                             k =1     
Copyright © 2001, 2003, Andrew W. Moore                                      Neural Networks: Slide 52
OTHER NEURAL NETS
   1

  x1
  x2
  x3
                               2-Hidden layers + Constant Term

                                          “JUMP” CONNECTIONS


  x1

                                                          N INS              
                                                                       N HID
                                                 Out = g  ∑ w0 k xk + ∑Wk vk 
  x2
                                                                             
                                                          k =1               
                                                                       k =1
Copyright © 2001, 2003, Andrew W. Moore                               Neural Networks: Slide 53




                            Backpropagation
                                           
                                
            Out(x) = g  ∑W j g  ∑ w jk xk  
                                             
                                k          
                       j
            Find a set of weights {W j },{w jk }
            to minimize

                             ∑ ( y − Out(x ))
                                                         2
                                          i          i
                                i

            by gradient descent.
                                              That’s it!
                                              That’s it!
                                              That’s the backpropagation
                                              That’s the backpropagation
                                                algorithm.
                                                 algorithm.

Copyright © 2001, 2003, Andrew W. Moore                               Neural Networks: Slide 54
Backpropagation Convergence
Convergence to a global minimum is not
guaranteed.
                •In practice, this is not a problem, apparently.
Tweaking to find the right number of hidden
units, or a useful learning rate η, is more
hassle, apparently.

IMPLEMENTING BACKPROP:            Differentiate Monster sum-square residual
Write down the Gradient Descent Rule        It turns out to be easier &
computationally efficient to use lots of local variables with names like hj ok vj neti
etc…



Copyright © 2001, 2003, Andrew W. Moore                               Neural Networks: Slide 55




            Choosing the learning rate
• This is a subtle art.
• Too small: can take days instead of minutes
  to converge
• Too large: diverges (MSE gets larger and
  larger while the weights increase and
  usually oscillate)
• Sometimes the “just right” value is hard to
  find.



Copyright © 2001, 2003, Andrew W. Moore                               Neural Networks: Slide 56
Learning-rate problems




From J. Hertz, A. Krogh, and R.
G. Palmer. Introduction to the
Theory of Neural Computation.
Addison-Wesley, 1994.




Copyright © 2001, 2003, Andrew W. Moore                        Neural Networks: Slide 57




  Improving Simple Gradient Descent
Momentum
Don’t just change weights according to the current datapoint.
Re-use changes from earlier iterations.
       Let ∆w(t) = weight changes at time t.
       Let          be the change we would make with
               ∂Ε
            −η
               ∂w   regular gradient descent.
  Instead we use
                                              ∂Ε
                             ∆w (t +1) = −η       + α∆w (t )
                                             ∂w
                             w (t + 1) = w (t ) + ∆w (t )
    Momentum damps oscillations.                         momentum parameter
    A hack? Well, maybe.

Copyright © 2001, 2003, Andrew W. Moore                        Neural Networks: Slide 58
Momentum illustration




Copyright © 2001, 2003, Andrew W. Moore                         Neural Networks: Slide 59




  Improving Simple Gradient Descent
Newton’s method
                                          ∂E 1 T ∂ 2 E
            E ( w + h) = E ( w ) + hT       +h          h + O(| h |3 )
                                          ∂w 2 ∂w     2




  If we neglect the O(h3) terms, this is a quadratic form

     Quadratic form fun facts:
     If y = c + bT x - 1/2 xT A x
     And if A is SPD
     Then
     xopt = A-1b is the value of x that maximizes y
Copyright © 2001, 2003, Andrew W. Moore                         Neural Networks: Slide 60
Improving Simple Gradient Descent
Newton’s method
                                          ∂E 1 T ∂ 2 E
            E ( w + h) = E ( w ) + hT       +h          h + O(| h |3 )
                                          ∂w 2 ∂w     2




  If we neglect the O(h3) terms, this is a quadratic form
                                                  −1
                                          ∂ 2 E  ∂E
                                    w ←w− 2
                                          ∂w  ∂w
   This should send us directly to the global minimum if the
   function is truly quadratic.
   And it might get us close if it’s locally quadraticish

Copyright © 2001, 2003, Andrew W. Moore                         Neural Networks: Slide 61




  Improving Simple Gradient Descent
Newton’s method
                                          ∂E 1 T ∂ 2 E
            E ( w + h) = E ( w ) + hT       +h          h + O(| h |3 )
                                          ∂w 2 ∂w     2




 IfBUT neglect the O(h3) terms, this is a quadratic form
    we
          (and
                  it’s a
                          big b
  That                          ut)             −1
         secon
                 d der w ←… −  ∂ E  ∂E
                                           2
 expen                              w  2
                         i va
         s i ve a
                 nd fid tive matr  ∂w  ∂w
                                      ix
If we                      dly to
                                  comp can b
      ’r
we’ll e not alrea us directlyute.the e
  Thisoshould sendy i                   to      global minimum if the
     g nu               dn
              ts                the q
  function is. truly quadratic. adr  ua
                                           tic bo
                                                   wl,
  And it might get us close if it’s locally quadraticish

Copyright © 2001, 2003, Andrew W. Moore                         Neural Networks: Slide 62
Improving Simple Gradient Descent
Conjugate Gradient
  Another method which attempts to exploit the “local
  quadratic bowl” assumption
                                                ∂E
  But does so while only needing to use
                                                ∂w

  and not ∂ E
           2


                  ∂w 2

   It is also more stable than Newton’s method if the local
   quadratic bowl assumption is violated.
   It’s complicated, outside our scope, but it often works well.
   More details in Numerical Recipes in C.
Copyright © 2001, 2003, Andrew W. Moore                Neural Networks: Slide 63




               BEST GENERALIZATION
Intuitively, you want to use the smallest,
    simplest net that seems to fit the data.

            HOW TO FORMALIZE THIS INTUITION?

      1.     Don’t. Just use intuition
      2.     Bayesian Methods Get it Right
      3.     Statistical Analysis explains what’s going on
      4.     Cross-validation
                                          Discussed in the next
                                                 lecture

Copyright © 2001, 2003, Andrew W. Moore                Neural Networks: Slide 64
What You Should Know
• How to implement multivariate Least-
  squares linear regression.
• Derivation of least squares as max.
  likelihood estimator of linear coefficients
• The general gradient descent rule




Copyright © 2001, 2003, Andrew W. Moore            Neural Networks: Slide 65




                 What You Should Know
• Perceptrons
           Linear output, least squares
           Sigmoid output, least squares


• Multilayer nets
           The idea behind back prop
            Awareness of better minimization methods


• Generalization. What it means.


Copyright © 2001, 2003, Andrew W. Moore            Neural Networks: Slide 66
APPLICATIONS
To Discuss:

      • What can non-linear regression be useful for?

      • What can neural nets (used as non-linear
        regressors) be useful for?

      • What are the advantages of N. Nets for
        nonlinear regression?

      • What are the disadvantages?


Copyright © 2001, 2003, Andrew W. Moore      Neural Networks: Slide 67




         Other Uses of Neural Nets…
• Time series with recurrent nets
• Unsupervised learning (clustering principal
  components and non-linear versions
  thereof)
• Combinatorial optimization with Hopfield
  nets, Boltzmann Machines
• Evaluation function learning (in
  reinforcement learning)



Copyright © 2001, 2003, Andrew W. Moore      Neural Networks: Slide 68

Weitere ähnliche Inhalte

Was ist angesagt?

Gaussian Mixture Models
Gaussian Mixture ModelsGaussian Mixture Models
Gaussian Mixture Models
guestfee8698
 
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Patrick Diehl
 

Was ist angesagt? (20)

Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Gaussian Mixture Models
Gaussian Mixture ModelsGaussian Mixture Models
Gaussian Mixture Models
 
Mining socio-political and socio-economic signals from social media content
Mining socio-political and socio-economic signals from social media contentMining socio-political and socio-economic signals from social media content
Mining socio-political and socio-economic signals from social media content
 
Chapter 12
Chapter 12Chapter 12
Chapter 12
 
Ch05 6
Ch05 6Ch05 6
Ch05 6
 
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
 
Lesson 30: Duality In Linear Programming
Lesson 30: Duality In Linear ProgrammingLesson 30: Duality In Linear Programming
Lesson 30: Duality In Linear Programming
 
Solution 4
Solution 4Solution 4
Solution 4
 
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...
Mi ordenador, de mayor, quiere ser cuántico (curso acelerado de simulación ...
 
Complex varible
Complex varibleComplex varible
Complex varible
 
Leidy rivadeneira deber_1
Leidy rivadeneira deber_1Leidy rivadeneira deber_1
Leidy rivadeneira deber_1
 
Fuzzy graph
Fuzzy graphFuzzy graph
Fuzzy graph
 
Sect4 4
Sect4 4Sect4 4
Sect4 4
 
2015 01 22 - Rende - Unical - Angelo Fanelli: An Overview of Congestion Games
2015 01 22 - Rende - Unical - Angelo Fanelli: An Overview of Congestion Games2015 01 22 - Rende - Unical - Angelo Fanelli: An Overview of Congestion Games
2015 01 22 - Rende - Unical - Angelo Fanelli: An Overview of Congestion Games
 
B010310813
B010310813B010310813
B010310813
 
近似ベイズ計算によるベイズ推定
近似ベイズ計算によるベイズ推定近似ベイズ計算によるベイズ推定
近似ベイズ計算によるベイズ推定
 
Transformation of random variables
Transformation of random variablesTransformation of random variables
Transformation of random variables
 
International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI), International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI),
 
基礎からのベイズ統計学 輪読会資料 第8章 「比率・相関・信頼性」
基礎からのベイズ統計学 輪読会資料  第8章 「比率・相関・信頼性」基礎からのベイズ統計学 輪読会資料  第8章 「比率・相関・信頼性」
基礎からのベイズ統計学 輪読会資料 第8章 「比率・相関・信頼性」
 
INTUITIONISTIC S-FUZZY SOFT NORMAL SUBGROUPS
INTUITIONISTIC S-FUZZY SOFT NORMAL SUBGROUPSINTUITIONISTIC S-FUZZY SOFT NORMAL SUBGROUPS
INTUITIONISTIC S-FUZZY SOFT NORMAL SUBGROUPS
 

Andere mochten auch (15)

A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
 
Bayesian Networks
Bayesian NetworksBayesian Networks
Bayesian Networks
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
 
FY2009 Sex Ed Abstinence Fact Sheet
FY2009 Sex Ed Abstinence Fact SheetFY2009 Sex Ed Abstinence Fact Sheet
FY2009 Sex Ed Abstinence Fact Sheet
 
Role Of States
Role Of StatesRole Of States
Role Of States
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networks
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Miners
 
Training pendidikan 2012
Training pendidikan 2012Training pendidikan 2012
Training pendidikan 2012
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)
 
VC dimensio
VC dimensioVC dimensio
VC dimensio
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Cross-Validation
Cross-ValidationCross-Validation
Cross-Validation
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clustering
 

Ähnlich wie Neural Networks (9)

A simple confidence interval for meta analysis
A simple confidence interval for meta analysisA simple confidence interval for meta analysis
A simple confidence interval for meta analysis
 
Perceptrons
PerceptronsPerceptrons
Perceptrons
 
Classification Theory
Classification TheoryClassification Theory
Classification Theory
 
Information Gain
Information GainInformation Gain
Information Gain
 
Partitions
PartitionsPartitions
Partitions
 
Gaussian Integration
Gaussian IntegrationGaussian Integration
Gaussian Integration
 
Computational Linguistics week 5
Computational Linguistics  week 5Computational Linguistics  week 5
Computational Linguistics week 5
 
Stochastic modelling and quasi-random numbers
Stochastic modelling and quasi-random numbersStochastic modelling and quasi-random numbers
Stochastic modelling and quasi-random numbers
 
Aces Verona 07 Foils
Aces Verona 07 FoilsAces Verona 07 Foils
Aces Verona 07 Foils
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Neural Networks

  • 1. Regression and Classification with Neural Networks Andrew W. Moore Note to other teachers and users of these slides. Andrew would be delighted Professor if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them School of Computer Science to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in Carnegie Mellon University your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: www.cs.cmu.edu/~awm http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully awm@cs.cmu.edu received. 412-268-7599 Copyright © 2001, 2003, Andrew W. Moore Sep 25th, 2001 Linear Regression DATASET inputs outputs x1 = 1 y1 = 1 x2 = 3 y2 = 2.2 ↑ x3 = 2 y3 = 2 w ↓ ←1→ x4 = 1.5 y4 = 1.9 x5 = 4 y5 = 3.1 Linear regression assumes that the expected value of the output given an input, E[y|x], is linear. Simplest case: Out(x) = wx for some unknown w. Given the data, we can estimate w. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 2
  • 2. 1-parameter linear regression Assume that the data is formed by yi = wxi + noisei where… • the noise signals are independent • the noise has a normal distribution with mean 0 and unknown variance σ2 P(y|w,x) has a normal distribution with • mean wx • variance σ2 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 3 Bayesian Linear Regression P(y|w,x) = Normal (mean wx, var σ2) We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn) which are EVIDENCE about w. We want to infer w from the data. P(w|x1, x2, x3,…xn, y1, y2…yn) •You can use BAYES rule to work out a posterior distribution for w given the data. •Or you could do Maximum Likelihood Estimation Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 4
  • 3. Maximum likelihood estimation of w Asks the question: “For which value of w is this data most likely to have happened?” <=> For what w is P(y1, y2…yn |x1, x2, x3,…xn, w) maximized? <=> For what w is n ∏P( y w, x ) maximized? i i i =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 5 For what w is n ∏ P( y w , x i ) maximized? i i =1 For what w is n 1 ∏ exp(− 2 ( yi − wxi ) 2 ) maximized? σ i =1 For what w is 2 1  y − wx i  n ∑ −i  maximized? σ 2  i =1 For what w is 2 n ∑ (y ) − wx minimized? i i i =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 6
  • 4. Linear Regression The maximum likelihood w is the one that E(w) w minimizes sum- Ε = ∑( yi − wxi ) 2 of-squares of residuals i (∑ x )w = ∑ yi − (2∑ xi yi )w + 2 2 2 i i We want to minimize a quadratic function of w. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 7 Linear Regression Easy to show the sum of squares is minimized when ∑x y i i w= ∑x 2 i The maximum likelihood Out(x) = wx model is We can use it for prediction Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 8
  • 5. Linear Regression Easy to show the sum of squares is minimized when ∑ xi yi p(w) w= w ∑x 2 Note: i In Bayesian stats you’d have ended up with a prob dist of w The maximum likelihood Out(x) = wx model is And predictions would have given a prob dist of expected output Often useful to know your confidence. We can use it for Max likelihood can give some kinds of prediction confidence too. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 9 Multivariate Regression What if the inputs are vectors? 3. .4 6. 2-d input 5 . example .8 x2 . 10 x1 Dataset has form x1 y1 x2 y2 x3 y3 .: : . xR yR Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 10
  • 6. Multivariate Regression Write matrix X and Y thus:  .....x1 .....   x11 x1m   y1  x12 ... .....x .....  x  y  x22 ... x2 m   =  21 x= y =  2 2    M M M     .....x R .....  xR1 xR 2 ... xRm   yR  (there are R datapoints. Each input has m components) The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XTX) -1(XTY) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 11 Multivariate Regression Write matrix X and Y thus:  .....x1 .....   x11 x1m   y1  x12 ... .....x .....  x ... x2 m   x22  y =  y2   =  21 x= 2    M M M     .....x R .....  xR1 xR 2 ... xRm   yR  IMPORTANT EXERCISE: (there are R datapoints. Each input hasPROVE IT !!!!! m components) The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XTX) -1(XTY) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 12
  • 7. Multivariate Regression (con’t) The max. likelihood w is w = (XTX)-1(XTY) R ∑x x ki kj XTX is an m x m matrix: i,j’th elt is k =1 R ∑x y XTY is an m-element vector: i’th elt ki k k =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 13 What about a constant term? We may expect linear data that does not go through the origin. Statisticians and Neural Net Folks all agree on a simple obvious hack. Can you guess?? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 14
  • 8. The constant term • The trick is to create a fake input “X0” that always takes the value 1 X1 X2 Y X0 X1 X2 Y 2 4 16 1 2 4 16 3 4 17 1 3 4 17 5 5 20 1 5 5 20 Before: After: Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2 …has to be a poor = w0+w1X1+ w2X2 In this example, You should be able model …has a fine constant to see the MLE w0 , w1 and w2 by term inspection Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 15 Regression with varying noise • Suppose you know the variance of the noise that was added to each datapoint. y=3 σ=2 σi2 xi yi ½ ½ 4 y=2 σ=1/2 1 1 1 σ=1 2 1 1/4 y=1 σ=1/2 σ=2 2 3 4 y=0 3 2 1/4 x=0 x=1 x=2 x=3 MLE the w? yi ~ N ( wxi , σ i2 ) at’s f W h at e o Assume stim e Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 16
  • 9. MLE estimation with varying noise argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ 12 , σ 2 ,..., σ R , w) = 2 2 1 2 R w Assuming i.i.d. and ( yi − wxi ) 2 R then plugging in argmin ∑ = equation for Gaussian σ i2 and simplifying. i =1 w   x ( y − wx ) Setting dLL/dw R  w such that ∑ i i 2 i = 0  = equal to zero   σi   i =1  R xi yi  Trivial algebra ∑ 2     i =1 σ i   R xi2  ∑ 2   σ  i =1 i  Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 17 This is Weighted Regression • We are asking to minimize the weighted sum of squares y=3 σ=2 ( yi − wxi ) 2 R argmin ∑ σ i2 y=2 i =1 σ=1/2 w σ=1 y=1 σ=1/2 σ=2 y=0 x=0 x=1 x=2 x=3 1 where weight for i’th datapoint is σ i2 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 18
  • 10. Weighted Multivariate Regression The max. likelihood w is w = (WXTWX)-1(WXTWY) xki xkj R ∑ (WXTWX) is an m x m matrix: i,j’th elt is σ i2 k =1 R xki yk (WXTWY) is an m-element vector: i’th elt ∑ σ i2 k =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 19 Non-linear Regression • Suppose you know that y is related to a function of x in such a way that the predicted values have a non-linear dependence on w, e.g: y=3 xi yi ½ ½ y=2 1 2.5 2 3 y=1 3 2 y=0 3 3 x=0 x=1 x=2 x=3 MLE yi ~ N ( w + xi , σ ) the w? 2 at’s f W h at e o Assume stim e Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 20
  • 11. Non-linear MLE estimation argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ (y ) R then plugging in 2 − w + xi = equation for Gaussian i and simplifying. i =1 w   y − w + xi Setting dLL/dw R  w such that ∑ i = 0 = equal to zero   w + xi   i =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 21 Non-linear MLE estimation argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ (y ) R then plugging in 2 − w + xi = equation for Gaussian i and simplifying. i =1 w   y − w + xi Setting dLL/dw R  w such that ∑ i = 0 = equal to zero   w + xi   i =1 We’re down the algebraic toilet t wha ess u So g e do? w Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 22
  • 12. Non-linear MLE estimation argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ ( ) R Common (but not only) approach: then plugging in 2 yi − w + xi = equation for Gaussian Numerical Solutions: and simplifying. i =1 w • Line Search • Simulated Annealing   yi − w + xi Setting dLL/dw R ∑  w such = 0 = • Gradient Descent that equal to zero   w + xi   • Conjugate Gradient i =1 • Levenberg Marquart We’re down the • Newton’s Method algebraic toilet Also, special purpose statistical- t wha optimization-specific tricks such as uess ? So g e do E.M. (See Gaussian Mixtures lecture for introduction) w Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 23 GRADIENT DESCENT f(w): ℜ → ℜ Suppose we have a scalar function We want to find a local minimum. Assume our current weight is w ∂ f (w) GRADIENT DESCENT RULE: w ← w −η ∂w η is called the LEARNING RATE. A small positive number, e.g. η = 0.05 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 24
  • 13. GRADIENT DESCENT f(w): ℜ → ℜ Suppose we have a scalar function We want to find a local minimum. Assume our current weight is w ∂ f (w) GRADIENT DESCENT RULE: w ← w −η ∂w Recall Andrew’s favorite default value for anything η is called the LEARNING RATE. A small positive number, e.g. η = 0.05 QUESTION: Justify the Gradient Descent Rule Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 25 Gradient Descent in “m” Dimensions f(w ) : ℜ m → ℜ Given ∂  f (w )    ∂w1   points in direction of steepest ascent. ∇f (w) =  M ∂   ∂w f (w) m  ∇f (w) is the gradient in that direction w ← w -η∇f (w) GRADIENT DESCENT RULE: Equivalently ∂ f (w ) ….where wj is the jth weight wj ← wj - η ∂w j “just like a linear feedback system” Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 26
  • 14. What’s all this got to do with Neural Nets, then, eh?? For supervised learning, neural nets are also models with vectors of w parameters in them. They are now called weights. As before, we want to compute the weights to minimize sum- of-squared residuals. Which turns out, under “Gaussian i.i.d noise” assumption to be max. likelihood. Instead of explicitly solving for max. likelihood weights, we use GRADIENT DESCENT to SEARCH for them. our eyes. ression in y s exp , a querulou y?” you ask “Wh e later.” ply: “We’ll se “Aha!!” I re Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 27 Linear Perceptrons They are multivariate linear models: Out(x) = wTx And “training” consists of minimizing sum-of-squared residuals by gradient descent. ∑ (Out (x ) − )2 Ε= yk k k ∑ (w ) 2 Τ = x k − yk k QUESTION: Derive the perceptron training rule. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 28
  • 15. Linear Perceptron Training Rule R E = ∑ ( yk − w T x k ) 2 k =1 Gradient descent tells us we should update w thusly if we wish to minimize E: ∂E wj ← wj - η ∂w j ∂E ? So what’s ∂w j Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 29 Linear Perceptron Training Rule R ∂E ∂ R E = ∑ ( yk − w T x k ) 2 =∑ ( yk − w T x k ) 2 ∂w j k =1 ∂w j k =1 ∂ R = ∑ 2( yk − w T x k ) Gradient descent tells us ( yk − w T x k ) ∂w j we should update w k =1 thusly if we wish to ∂ R = −2∑ δk minimize E: wT xk ∂w j k =1 ∂E …where… wj ← wj - η δk = yk − w T x k ∂w j ∂ R m = −2∑ δk ∑w x i ki ∂w j k =1 i =1 ∂E ? So what’s R ∂w j = −2∑ δk xkj k =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 30
  • 16. Linear Perceptron Training Rule R E = ∑ ( yk − w T x k ) 2 k =1 Gradient descent tells us we should update w thusly if we wish to minimize E: ∂E wj ← wj - η R w j ← w j + 2η∑ δk xkj ∂w j …where… k =1 ∂E R = −2∑ δk xkj ∂w j We frequently neglect the 2 (meaning k =1 we halve the learning rate) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 31 The “Batch” perceptron algorithm 1) Randomly initialize weights w1 w2 … wm 2) Get your dataset (append 1’s to the inputs if you don’t want to go through the origin). 3) for i = 1 to R δ i := yi − wΤxi 4) for j = 1 to m R w j ← w j + η ∑ δ i xij i =1 5) if ∑ δ i 2 stops improving then stop. Else loop back to 3. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 32
  • 17. δ i ← yi − w Τ x i A RULE KNOWN BY w j ← w j + ηδ i xij MANY NAMES rule off e wH Rul idro LMS W The e Th The delta ru le Th e ad alin er ule Classical conditioning Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 33 If data is voluminous and arrives fast Input-output pairs (x,y) come streaming in very quickly. THEN Don’t bother remembering old ones. Just keep using new ones. observe (x,y) δ ← y − wΤx ∀j w j ← w j + η δ x j Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 34
  • 18. Gradient Descent vs Matrix Inversion for Linear Perceptrons GD Advantages (MI disadvantages): • Biologically plausible • With very very many attributes each iteration costs only O(mR). If fewer than m iterations needed we’ve beaten Matrix Inversion • More easily parallelizable (or implementable in wetware)? GD Disadvantages (MI advantages): • It’s moronic • It’s essentially a slow implementation of a way to build the XTX matrix and then solve a set of linear equations • If m is small it’s especially outageous. If m is large then the direct matrix inversion method gets fiddly but not impossible if you want to be efficient. • Hard to choose a good learning rate • Matrix inversion takes predictable time. You can’t be sure when gradient descent will stop. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 35 Gradient Descent vs Matrix Inversion for Linear Perceptrons GD Advantages (MI disadvantages): • Biologically plausible • With very very many attributes each iteration costs only O(mR). If fewer than m iterations needed we’ve beaten Matrix Inversion • More easily parallelizable (or implementable in wetware)? GD Disadvantages (MI advantages): • It’s moronic • It’s essentially a slow implementation of a way to build the XTX matrix and then solve a set of linear equations • If m is small it’s especially outageous. If m is large then the direct matrix inversion method gets fiddly but not impossible if you want to be efficient. • Hard to choose a good learning rate • Matrix inversion takes predictable time. You can’t be sure when gradient descent will stop. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 36
  • 19. Gradient Descent vs Matrix Inversion for Linear Perceptrons GD Advantages (MI disadvantages): • Biologically plausible • With very very many attributes each iteration costs only O(mR). If fewer than m iterations needed we’ve beaten Matrix Inversion But we’ll • More easily parallelizable (or implementable in wetware)? GD Disadvantages (MIsoon see that advantages): GD • It’s moronic has an important extra • It’s essentially a slow implementation of a way to build the XTX matrix and then solve a set of linear equations trick up its sleeve • If m is small it’s especially outageous. If m is large then the direct matrix inversion method gets fiddly but not impossible if you want to be efficient. • Hard to choose a good learning rate • Matrix inversion takes predictable time. You can’t be sure when gradient descent will stop. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 37 Perceptrons for Classification What if all outputs are 0’s or 1’s ? or We can do a linear fit. Our prediction is 0 if out(x)≤1/2 1 if out(x)>1/2 WHAT’S THE BIG PROBLEM WITH THIS??? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 38
  • 20. Perceptrons for Classification What if all outputs are 0’s or 1’s ? or Blue = Out(x) We can do a linear fit. Our prediction is 0 if out(x)≤½ 1 if out(x)>½ WHAT’S THE BIG PROBLEM WITH THIS??? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 39 Perceptrons for Classification What if all outputs are 0’s or 1’s ? or Blue = Out(x) We can do a linear fit. Our prediction is 0 if out(x)≤½ Green = Classification 1 if out(x)>½ Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 40
  • 21. Classification with Perceptrons I ∑ (y )2 − w Τxi . Don’t minimize i Minimize number of misclassifications instead. [Assume outputs are ∑ (y ( )) +1 & -1, not +1 & 0] − Round w Τ x i i where Round(x) = -1 if x<0 NOTE: CUTE & NON OBVIOUS WHY 1 if x≥0 THIS WORKS!! The gradient descent rule can be changed to: if (xi,yi) correctly classed, don’t change w w - xi if wrongly predicted as 1 w w + xi if wrongly predicted as -1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 41 Classification with Perceptrons II: Sigmoid Functions Least squares fit useless This fit would classify much better. But not a least squares fit. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 42
  • 22. Classification with Perceptrons II: Sigmoid Functions Least squares fit useless This fit would classify much SOLUTION: better. But not a least squares fit. Instead of Out(x) = wTx We’ll use Out(x) = g(wTx) where g( x) : ℜ → (0,1) is a squashing function Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 43 The Sigmoid 1 g ( h) = 1 + exp(− h) Note that if you rotate this curve through 180o centered on (0,1/2) you get the same curve. i.e. g(h)=1-g(-h) Can you prove this? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 44
  • 23. The Sigmoid 1 g ( h) = 1 + exp(− h) Now we choose w to minimize ∑ [ yi − Out(x i )] = ∑ [yi − g (w Τ x i )] R R 2 2 i =1 i =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 45 Linear Perceptron Classification Regions 0 0 0 1 X2 1 1 X1 Out(x) = g(wT(x,1)) We’ll use the model = g(w1x1 + w2x2 + w0) Which region of above diagram classified with +1, and which with 0 ?? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 46
  • 24. Gradient descent with sigmoid on a perceptron First, notice g ' ( x ) = g ( x )(1 − g ( x )) − e− x 1 Because : g ( x ) = so g ' ( x ) = 1 + e− x 2 1 + e − x      1 −1 − e− x −1  1 1 1  = − g ( x )(1 − g ( x )) 1 − = = − = 2 1 + e− x 1 + e− x  1 + e− x  2   1 + e − x  1 + e − x            Out(x) = g  ∑ wk xk  The sigmoid perceptron k  update rule: 2    Ε = ∑  yi − g  ∑ wk xik     k  i R w j ← w j + η ∑ δ i gi (1 − gi )xij   ∂    ∂Ε  = ∑ 2 yi − g  ∑ wk xik   − g  ∑ wk xik       ∂w j  k  ∂w j k    i =1 i m      ∂ = ∑ − 2 yi − g  ∑ wk xik  g '  ∑ wk xik  ∑w x gi = g  ∑ w j xij    k ik where  ∂w j k   k    i k  j =1  = ∑ − 2δ i g (net i )(1 − g (net i ))xij δ i = yi − gi i net i = ∑ wk xk where δ i = yi − Out(x i ) k Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 47 Other Things about Perceptrons • Invented and popularized by Rosenblatt (1962) • Even with sigmoid nonlinearity, correct convergence is guaranteed • Stable behavior for overconstrained and underconstrained problems Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 48
  • 25. Perceptrons and Boolean Functions If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s… X2 • Can learn the function x1 ∧ x2 X1 X2 • Can learn the function x1 ∨ x2 . X1 • Can learn any conjunction of literals, e.g. x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5 QUESTION: WHY? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 49 Perceptrons and Boolean Functions • Can learn any disjunction of literals e.g. x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5 • Can learn majority function f(x1,x2 … xn) = 1 if n/2 xi’s or more are = 1 0 if less than n/2 xi’s are = 1 • What about the exclusive or function? f(x1,x2) = x1 ∀ x2 = (x1 ∧ ~x2) ∨ (~ x1 ∧ x2) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 50
  • 26. Multilayer Networks The class of functions representable by perceptrons is limited   ( ) Out(x) = g w Τ x = g  ∑ w j x j    j  Use a wider representation !    Out(x) = g  ∑W j g  ∑ w jk x jk   This is a nonlinear function   k  j Of a linear combination Of non linear functions Of linear combinations of inputs Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 51 A 1-HIDDEN LAYER NET NINPUTS = 2 NHIDDEN = 3  NINS  v1 = g ∑ w1k xk    w11  k =1  w1 x1 w21  NINS  w31 v2 = g ∑ w2k xk   N HID  w2 Out = g  ∑ Wk vk       k =1  w12  k =1  w22 x2 w3  NINS  w32 v3 = g ∑ w3k xk     k =1  Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 52
  • 27. OTHER NEURAL NETS 1 x1 x2 x3 2-Hidden layers + Constant Term “JUMP” CONNECTIONS x1  N INS  N HID Out = g  ∑ w0 k xk + ∑Wk vk  x2    k =1  k =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 53 Backpropagation    Out(x) = g  ∑W j g  ∑ w jk xk     k  j Find a set of weights {W j },{w jk } to minimize ∑ ( y − Out(x )) 2 i i i by gradient descent. That’s it! That’s it! That’s the backpropagation That’s the backpropagation algorithm. algorithm. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 54
  • 28. Backpropagation Convergence Convergence to a global minimum is not guaranteed. •In practice, this is not a problem, apparently. Tweaking to find the right number of hidden units, or a useful learning rate η, is more hassle, apparently. IMPLEMENTING BACKPROP: Differentiate Monster sum-square residual Write down the Gradient Descent Rule It turns out to be easier & computationally efficient to use lots of local variables with names like hj ok vj neti etc… Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 55 Choosing the learning rate • This is a subtle art. • Too small: can take days instead of minutes to converge • Too large: diverges (MSE gets larger and larger while the weights increase and usually oscillate) • Sometimes the “just right” value is hard to find. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 56
  • 29. Learning-rate problems From J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley, 1994. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 57 Improving Simple Gradient Descent Momentum Don’t just change weights according to the current datapoint. Re-use changes from earlier iterations. Let ∆w(t) = weight changes at time t. Let be the change we would make with ∂Ε −η ∂w regular gradient descent. Instead we use ∂Ε ∆w (t +1) = −η + α∆w (t ) ∂w w (t + 1) = w (t ) + ∆w (t ) Momentum damps oscillations. momentum parameter A hack? Well, maybe. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 58
  • 30. Momentum illustration Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 59 Improving Simple Gradient Descent Newton’s method ∂E 1 T ∂ 2 E E ( w + h) = E ( w ) + hT +h h + O(| h |3 ) ∂w 2 ∂w 2 If we neglect the O(h3) terms, this is a quadratic form Quadratic form fun facts: If y = c + bT x - 1/2 xT A x And if A is SPD Then xopt = A-1b is the value of x that maximizes y Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 60
  • 31. Improving Simple Gradient Descent Newton’s method ∂E 1 T ∂ 2 E E ( w + h) = E ( w ) + hT +h h + O(| h |3 ) ∂w 2 ∂w 2 If we neglect the O(h3) terms, this is a quadratic form −1  ∂ 2 E  ∂E w ←w− 2  ∂w  ∂w This should send us directly to the global minimum if the function is truly quadratic. And it might get us close if it’s locally quadraticish Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 61 Improving Simple Gradient Descent Newton’s method ∂E 1 T ∂ 2 E E ( w + h) = E ( w ) + hT +h h + O(| h |3 ) ∂w 2 ∂w 2 IfBUT neglect the O(h3) terms, this is a quadratic form we (and it’s a big b That ut) −1 secon d der w ←… −  ∂ E  ∂E 2 expen w  2 i va s i ve a nd fid tive matr  ∂w  ∂w ix If we dly to comp can b ’r we’ll e not alrea us directlyute.the e Thisoshould sendy i to global minimum if the g nu dn ts the q function is. truly quadratic. adr ua tic bo wl, And it might get us close if it’s locally quadraticish Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 62
  • 32. Improving Simple Gradient Descent Conjugate Gradient Another method which attempts to exploit the “local quadratic bowl” assumption ∂E But does so while only needing to use ∂w and not ∂ E 2 ∂w 2 It is also more stable than Newton’s method if the local quadratic bowl assumption is violated. It’s complicated, outside our scope, but it often works well. More details in Numerical Recipes in C. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 63 BEST GENERALIZATION Intuitively, you want to use the smallest, simplest net that seems to fit the data. HOW TO FORMALIZE THIS INTUITION? 1. Don’t. Just use intuition 2. Bayesian Methods Get it Right 3. Statistical Analysis explains what’s going on 4. Cross-validation Discussed in the next lecture Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 64
  • 33. What You Should Know • How to implement multivariate Least- squares linear regression. • Derivation of least squares as max. likelihood estimator of linear coefficients • The general gradient descent rule Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 65 What You Should Know • Perceptrons Linear output, least squares Sigmoid output, least squares • Multilayer nets The idea behind back prop Awareness of better minimization methods • Generalization. What it means. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 66
  • 34. APPLICATIONS To Discuss: • What can non-linear regression be useful for? • What can neural nets (used as non-linear regressors) be useful for? • What are the advantages of N. Nets for nonlinear regression? • What are the disadvantages? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 67 Other Uses of Neural Nets… • Time series with recurrent nets • Unsupervised learning (clustering principal components and non-linear versions thereof) • Combinatorial optimization with Hopfield nets, Boltzmann Machines • Evaluation function learning (in reinforcement learning) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 68