SlideShare ist ein Scribd-Unternehmen logo
1 von 54
Kriti Puniyani
Carnegie Mellon University
      kriti@cmu.edu
About me
  Graduate student at Carnegie Mellon University
  Statistical machine learning
    Topic models
    Sparse network learning
    Optimization
  Domains of interest
    Social media analysis
    Systems biology
    Genetics
    Sentiment analysis
    Text processing




                                                    4/15/11   2
Machine learning
  Computers to “learn with experience”
  Learn : to be able to predict “unseen” things.


  Many applications
    Search
    Machine translation
    Speech recognition
    Vision : identify cars, people, sky, apples
    Robot control


  Introductions :
    http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf
    http://videolectures.net/mlss2010_lawrence_mlfcs/
                                                         4/15/11   3
Classification

  Is this the digit “9” ?   ρ

  Will this patient survive ?




  Will this user click on my ad ?



                                     4/15/11   4
Predict the next coin toss
                                                    Data
Task



              THTTTTHHTHTHTTT


     Model 1 :                            Model 2 :
  Coin is tossed with                Toss depends on wind
   probability p (of                 condition W, starting
     being tails)                      pose S, torque T

                        Parameters                    4/15/11   5
Predict the next coin toss



          THTTTTHHTHTHTTT


                Learning

                            Model 2 :
    Model 1 :              W=12.2, S=1,
     p=2/3                   T=0.23
                                      4/15/11   6
Predict the next coin toss
                               I predict the next
                               toss to be T




                Inference


                             Model 2 :
    Model 1 :               W=12.2, S=1,
     P=2/3                    T=0.23
                                        4/15/11     7
Inference
  Parameter : p=2/3


  Predicted next 9 tosses           ….H H H T T T T T T
  Observed next 9 tosses             ….T T T T T T H H H
  Accuracy = 2/9 

  Predicted next 9 tosses           ….T T T T T T T T T
  Observed next 9 tosses             ….T T T T T T H H H
  Accuracy = 6/9 

  Inference rule :
    if p > 0.5, always predict T,
    if p < 0.5 always predict H.                          4/15/11   8
The anatomy of classification
     1.  What is the data (features X, label y) ? ★★★
     2.  What is the model ? Model parameterization (w)
     3.  Inference : Given X, w, predict the label.
     4.  Learning    : Given (X,y) pairs, learn the “best” w
            Define “best” – maximize an objective function



(X,Y) pairs    train time
                               Learning
                                              w

                test time
     (X, ? )                    Inference         predicted Y
                                                                9
Logistic Regression




                      4/15/11   10
Predict speaker success
  X = Number of hours spent in preparation
  Y = Was the speaker “good”?




                                     I(a) = 1 if(a==TRUE)
Prediction : Y   = I ( X > h)             = 0 if(a==FALSE)

                                                  4/15/11    11
Predict speaker success
                     Y = I ( X > h)

  Learning is difficult.
  Not robust




                                      4/15/11   12
1
     P(Y | w, X) =
                     1+e−wX +w0



€

    Y = I ( X > 10)




                                  4/15/11   13
Logistic (sigmoidal) function




                                4/15/11   14
Extend to d dimensions

                                           1
        P(Y | w, X ) =         −( w1 X 1 +w 2 X 2 +...+w d X d +w 0 )
                         1+e


                                           1
              P(Y | w, X ) =                  
€                                   1+e
                                          −( w. X +w 0 )




                                                                   4/15/11   15
    €
Logistic regression
      Model parameter :   w
                                      1
          P(Y = 1 | w, X) =          −wX +w 0
                               1+e
      Example : Given X = 0.9 , w = 1.2
      => wX = 1.08, P(Y=1|X=0.9) = 0.7465 ~ 0.75
€     Toss a coin with p=3/4

      Example : Given X = -1.1 , w = 1.2
      => wX = -1.32, P(Y=1|X=-1.1) = 0.2108 ~ 0.2
      Toss a coin with p=1/5
                                                    4/15/11   16
Another view of logistic regression
  Log odds : ln [ p/(1-p) ] = wX + ε

  p / (1-p) = ewX

  p = (1-p) ewX

  p (1 + ewX) = ewX

  p = ewX / (1 + ewX) = 1/(e-wX+1)


  Logistic regression is a “linear regression” between log-
  odds of an event and features (X)


                                                       4/15/11   17
The anatomy of classification
1.  What is the data (features X, label y) ?                ✔
2.  What is the model ? Model parameterization (w)          ✔
3.  Inference :Given X, w, predict the label.               ✔
4.  Learning : Given (X,y) pairs, learn the “best” w
      Define “best” – maximize an objective function




                                                  4/15/11       18
Learning : Finding the best w
Expressing…(X , Y )
   Data : (X , Y ),
             1   1
                     Conditional Log Likelihood
                         n   n



   If yi == 1, max P(yi=1| xi, w)
   If yi == 0, max P(yi=0| xi, w)


   Maximize Log-likelihood




                                     4/15/11   19
Learning : Example


             Data : (5, 0), (11, 1), (25,1)

    l(w)= ln P(y = 0 | x = 5,w) + ln P(y = 1 | x = 11,w) + ln P(y = 1 | x = 25,w)


             P(Y=1|X,w) is a logistic function

                                  1                               1
                                           ©Carlos Guestrin 2005-2009               1             !
             l(w)= ln(1−          −5w +w 0
                                           ) + ln                       + ln
                           1+ e                      1+ e−11w +w 0             1+ e−25w +w 0


      P(y=1|x) + P(y=0|x) = 1                                                           4/15/11   20
€
Optimization : Pick the “best” w




1.    Weka
2.    Matlab : w = mnrfit(X,Y)
3.    R : w <- glm(Y~X, family=binomial(link="logit"))
4.    IRLS : http://www.cs.cmu.edu/~ggordon/IRLS-example/logistic.m
5.    Implement your own                                             4/15/11   21
Decision surface is linear




                             Errors

    Y=0          Y=1




                              4/15/11   22
Decision surface is linear




http://www.cs.technion.ac.il/~rani/LocBoost/   4/15/11   23
So far..
  Logistic regression is a binary classifier (multinomial
   version exists)
  P(Y=1|X,w) is a logistic function
  Inference : Compute P(Y=1|X,w), and do “rounding”.
  Parameter learnt by maximizing log-likelihood of data.
  Decision surface is linear (kernelized version exists)




                                                        4/15/11   24
Improvements in the model
  Prevent over-fitting          Regularization
  Maximize accuracy directly    SVMs
  Non-linear decision surface   Kernel Trick
  Multi-label data




                                         4/15/11   25
Occam’s razor


The simplest explanation is most likely the correct
one




                                                  4/15/11   26
New and improved learning
      “Best” w == maximize log-likelihood
      Maximum Log-likelihood Estimate (MLE)

                    Small concern … over-fitting


If data is
linearly
separable,
w



                                                   4/15/11   27
L2 regularization
                                                                         2
                                                   || w ||2 =     ∑ wi
                                                                  i

                                               2
                 max w l(w) − λ || w ||        2
  Prevents over-fitting
                                           €
  “Pushes” parameters
   towards zero
  Equivalent to a prior on
€  the parameters
      Normal distribution (0
      mean, unit covariance)




             λ : tuning parameter ( 0.1)                4/15/11       28
Patient Diagnosis
  Y = disease
  X = [age, weight, BP, blood sugar, MRI, genetic tests …]


  Don’t want all “features” to be relevant.


  Weight vector w should be “mostly zeros”.




                                                     4/15/11   29
L1 regularization                                     || w ||1= ∑ | w i |
                                                                i



                 max w l(w) − λ || w ||1
                                               €
  Prevents over-fitting
  “Pushes” parameters to
   zero
  Equivalent to a prior on
€  the parameters
      Laplace distribution




      λ increases, more zeros (irrelevant) features       4/15/11      30
L1 v/s L2 example
  MLE estimate         : [ 11 0.8 ]

  L2 estimate          : [ 10 0.6 ]      shrinkage

  L1 estimate          : [ 10.2 0 ]     sparsity

  Mini-conclusion :
    L2 optimization is fast, L1 tends to be slower. If you have the
     computational resources, try both (at the same time) !
    ALWAYS run logistic regression with at least some
     regularization.
    Corollary : ALWAYS run logistic regression on features that
     have been standardized (zero mean, unit variance)
                                                            4/15/11    31
So far …
  Logistic regression
    Model
    Inference
    Learning via maximum likelihood
    L1 and L2 regularization




  Next …. SVMs !



                                       4/15/11   32
Why did we use probability again?
  Aim : Maximize “accuracy”


  Logistic regression : Indirect method that maximizes
  likelihood of data.

  A much more direct approach is to directly maximize
  accuracy.


        Support Vector Machines (SVMs)


                                                     4/15/11   33
Maximize the margin
Maximize the margin




               ! 2005-2007 Carlos Guestrin                  "


                                             4/15/11   34
Geometry review
                  Y=1              2x1+x2-2=0




      Y= -1




For a point on the line :
(0.5, 1 ) : d = 2*0.5 + 1 – 2       =0

Signed “distance” to the line from (x10, x20)
      d = 2x10 + x20 - 2                        4/15/11   35
Geometry review
                 Y=1            2x1+x2-2=0




      Y= -1




(1,   2.5) : d = 2*1 + 2.5 - 2    = 2.5 > 0
        y(wx+b) = 1*2.5 = 2.5 > γ



                                              4/15/11   36
Geometry review
                 Y=1              2x1+x2-2=0




     Y= -1




(0.5, 0.5) : d = 2*0.5 + 0.5 – 2    = -0.5 < 0
        y(w.x+b) = y*d = -1 * -0.5 = 0.5



                                                 4/15/11   37
Support Vector Machines
Normalized margin – Canonical
hyperplanes
                                               ! 2005-2007 Carlos Guestrin                  "




                                            Support vectors are the
  x+                                        points touching the margins.
         x-


                                                                             4/15/11   38
              ! 2005-2007 Carlos Guestrin                                       !
= !j w(j) x(j)
                      ! 2005-2007 Carlos Guestrin                                      !


          w.x = !j w(j) x(j)
                                                    w.x = !j w(j) x(j)
                                                         ! 2005-2007 Carlos Guestrin                                 !




            Slack variables                                                            ! 2005-2007 Carlos Guestrin




                   SVMs are made robust by adding “slack variables” that
                   allow training error to be non-zero
ximize the margin point. Slack variable ==0 for correctly
         One for each data
     Maximizepoints. margin
          classified the
                      Maximize the−C∑ξ i
                             max γ margin



                                                     €



                                                                                                                     4/15/11   39
Slack variables
           max γ −C ∑ξ i



€
Need to tune C :
          high C == minimize mis-classifications
          low C == maximize margin




                                              4/15/11   40
SVM summary
  Model :     w.x + b > 0       if y = +1
               w.x + b < 0       if y = -1

  Inference : ŷ = sign(w.x+b)


  Learning : Maximize { (margin) - C ( slack-variables) }




                  Next … Kernel SVMs


                                                      4/15/11   41
The kernel trick
  Why linear separator ? What if data looks like below ?

                                          The kernel trick
                                          allows you to use
                                          SVMs with non-
                                          linear separators.

                                          Different kernels
                                          1.  Polynomial
                                          2.  Gaussian
                                          3.  exponential


                                                     4/15/11   42
Logistic                      Linear SVM




  Error ~ 40% in both cases
                                      4/15/11   43
Kernel SVM with polynomial kernel
of degree 2

                      Polynomial kernel of
                      degree 2/4 do very
                      well, but degree 3/5
                      do very bad.

                      Gaussian kernel has
                      tuning parameter
                      (bandwidth).
                      Performance
                      depends on picking
                      the right bandwith.
        Error = 7%                4/15/11   44
SVMs summary
  Maximize the margin between positive and negative
   examples.
  Kernel trick is widely implemented, allowing non-linear
   decision surface.
  Not probabilistic 




  Software :
    SVM-light http://svmlight.joachims.org/,
    LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/
    Weka, matlab, R



                                                        4/15/11   45
Demo


       http://www.cs.technion.ac.il/~rani/LocBoost




                                                     4/15/11   46
Which to use ?
  Linear SVMs and logistic regression work very similar in
   most cases.
  Kernelized SVMs work better than linear SVMs (mostly)
  Kernelized logistic regression is possible, but
   implementations are not available easily.




                                                     4/15/11   47
Recommendations
1.  First, try logistic regression. Easy, fast, stable. No “tuning”
    parameters.
2.  Equivalently, you can first try linear SVMs, but you need
    to tune “C”
3.  If results are “good enough”, stop.
4.  Else try SVMs with Gaussian kernels.
     Need to tune bandwidth, C – by using validation data.

If you have more time/computational resources, try random
     forests as well.


** Recommendations are opinions of the presenter, and not known facts.

                                                                         4/15/11   48
In conclusion …


    Logistic Regression
    Support Vector Machines


       Other classification approaches …

    Random forests / decision trees
    Naïve Bayes
    Nearest Neighbors
    Boosting (Adaboost)
                                       4/15/11   49
Thank you
Questions?




             4/15/11   50
Kriti Puniyani
Carnegie Mellon University
      kriti@cmu.edu
Is this athlete doing drugs ?
  X = Blood-test-to-detect-drugs
  Y = Doped athlete ?


  Two types of errors :
    Athlete is doped, we predict “NO” : false negative
    Athlete is NOT doped, we predict “YES” : false positive


  Penalize false positives more than false negatives




                                                          4/15/11   52
Outline
  What is classification ?
     Parameters, data, inference, learning
     Predicting coin tosses (0-dimensional X)
  Logistic Regression
     Predicting “speaker success” (1-dimensional X)
     Formulation, optimizatiob
     Decision surface is linear
     Interpreting coefficients
     Hypothesis testing
     Evaluating the performance of the model
     Why is it called “regression” : log-odds
     L2 regularization
     Patient survival (d-dimensional X)
     L1 regularization
  Support Vector Machines
     Linear SVMs + formulation
     What are “support vectors”
     The kernel trick
  Demo : logistic regression v/s SVMs v/s kernel tricks   4/15/11   53
Overfitting a more serious problem




2x+y-2 = 0               w = [2 1 -2]
4x+2y-4 = 0              w = [4 2 -4]
400x+200y-400 = 0        w = [400 200 -400]

 Absolutely need L2 regularization           4/15/11   54

Weitere ähnliche Inhalte

Was ist angesagt?

Regression lineaire Multiple (Autosaved) (Autosaved)
Regression lineaire Multiple (Autosaved) (Autosaved)Regression lineaire Multiple (Autosaved) (Autosaved)
Regression lineaire Multiple (Autosaved) (Autosaved)
Pierre Robentz Cassion
 
文献注釈情報MeSHを利用した網羅的な遺伝子の機能アノテーションパッケージ
文献注釈情報MeSHを利用した網羅的な遺伝子の機能アノテーションパッケージ文献注釈情報MeSHを利用した網羅的な遺伝子の機能アノテーションパッケージ
文献注釈情報MeSHを利用した網羅的な遺伝子の機能アノテーションパッケージ
弘毅 露崎
 

Was ist angesagt? (20)

Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
はじめてのパターン認識8章サポートベクトルマシン
はじめてのパターン認識8章サポートベクトルマシンはじめてのパターン認識8章サポートベクトルマシン
はじめてのパターン認識8章サポートベクトルマシン
 
データサイエンス概論第一=1-1 データとは
データサイエンス概論第一=1-1 データとはデータサイエンス概論第一=1-1 データとは
データサイエンス概論第一=1-1 データとは
 
Ridge regression
Ridge regressionRidge regression
Ridge regression
 
統計学基礎
統計学基礎統計学基礎
統計学基礎
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
Outlier detection method introduction
Outlier detection method introductionOutlier detection method introduction
Outlier detection method introduction
 
21世紀の手法対決 (MIC vs HSIC)
21世紀の手法対決 (MIC vs HSIC)21世紀の手法対決 (MIC vs HSIC)
21世紀の手法対決 (MIC vs HSIC)
 
多変量解析を用いたメタボロームデータ解析
多変量解析を用いたメタボロームデータ解析多変量解析を用いたメタボロームデータ解析
多変量解析を用いたメタボロームデータ解析
 
Regression lineaire Multiple (Autosaved) (Autosaved)
Regression lineaire Multiple (Autosaved) (Autosaved)Regression lineaire Multiple (Autosaved) (Autosaved)
Regression lineaire Multiple (Autosaved) (Autosaved)
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Scikit-learnを使って 画像分類を行う
Scikit-learnを使って 画像分類を行うScikit-learnを使って 画像分類を行う
Scikit-learnを使って 画像分類を行う
 
Rで学ぶ観察データでの因果推定
Rで学ぶ観察データでの因果推定Rで学ぶ観察データでの因果推定
Rで学ぶ観察データでの因果推定
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
Curse of Dimensionality and Big Data
Curse of Dimensionality and Big DataCurse of Dimensionality and Big Data
Curse of Dimensionality and Big Data
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
 
統計モデリングで癌の5年生存率データから良い病院を探す
統計モデリングで癌の5年生存率データから良い病院を探す統計モデリングで癌の5年生存率データから良い病院を探す
統計モデリングで癌の5年生存率データから良い病院を探す
 
Regression lineaire simple
Regression lineaire simpleRegression lineaire simple
Regression lineaire simple
 
文献注釈情報MeSHを利用した網羅的な遺伝子の機能アノテーションパッケージ
文献注釈情報MeSHを利用した網羅的な遺伝子の機能アノテーションパッケージ文献注釈情報MeSHを利用した網羅的な遺伝子の機能アノテーションパッケージ
文献注釈情報MeSHを利用した網羅的な遺伝子の機能アノテーションパッケージ
 

Ähnlich wie Intro to Classification: Logistic Regression & SVM

05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
zukun
 
Lec04 min cost linear problems
Lec04 min cost linear problemsLec04 min cost linear problems
Lec04 min cost linear problems
Danny Luk
 
Signal processingcolumbia
Signal processingcolumbiaSignal processingcolumbia
Signal processingcolumbia
vevin1986
 
Equational axioms for probability calculus and modelling of Likelihood ratio ...
Equational axioms for probability calculus and modelling of Likelihood ratio ...Equational axioms for probability calculus and modelling of Likelihood ratio ...
Equational axioms for probability calculus and modelling of Likelihood ratio ...
Advanced-Concepts-Team
 

Ähnlich wie Intro to Classification: Logistic Regression & SVM (20)

Slides risk-rennes
Slides risk-rennesSlides risk-rennes
Slides risk-rennes
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
 
Bayesian_Decision_Theory-3.pdf
Bayesian_Decision_Theory-3.pdfBayesian_Decision_Theory-3.pdf
Bayesian_Decision_Theory-3.pdf
 
Lecture2 xing
Lecture2 xingLecture2 xing
Lecture2 xing
 
Bayes 6
Bayes 6Bayes 6
Bayes 6
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantification
 
Tutorial on Belief Propagation in Bayesian Networks
Tutorial on Belief Propagation in Bayesian NetworksTutorial on Belief Propagation in Bayesian Networks
Tutorial on Belief Propagation in Bayesian Networks
 
random variables-descriptive and contincuous
random variables-descriptive and contincuousrandom variables-descriptive and contincuous
random variables-descriptive and contincuous
 
quebec.pdf
quebec.pdfquebec.pdf
quebec.pdf
 
Lec04 min cost linear problems
Lec04 min cost linear problemsLec04 min cost linear problems
Lec04 min cost linear problems
 
3_MLE_printable.pdf
3_MLE_printable.pdf3_MLE_printable.pdf
3_MLE_printable.pdf
 
Signal processingcolumbia
Signal processingcolumbiaSignal processingcolumbia
Signal processingcolumbia
 
Connection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problemsConnection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problems
 
Machine Learning Summer School 2016
Machine Learning Summer School 2016Machine Learning Summer School 2016
Machine Learning Summer School 2016
 
March12 natarajan
March12 natarajanMarch12 natarajan
March12 natarajan
 
report
reportreport
report
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
Equational axioms for probability calculus and modelling of Likelihood ratio ...
Equational axioms for probability calculus and modelling of Likelihood ratio ...Equational axioms for probability calculus and modelling of Likelihood ratio ...
Equational axioms for probability calculus and modelling of Likelihood ratio ...
 

Mehr von NYC Predictive Analytics

An Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for PredictionAn Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for Prediction
NYC Predictive Analytics
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
NYC Predictive Analytics
 

Mehr von NYC Predictive Analytics (10)

Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media Analytics
 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
 
Introduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System CompetitionIntroduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System Competition
 
R package Recommendation Engine
R package Recommendation EngineR package Recommendation Engine
R package Recommendation Engine
 
Optimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsOptimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive Analytics
 
An Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for PredictionAn Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for Prediction
 
How OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive ChangeHow OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive Change
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Recommendation Engine Demystified
Recommendation Engine DemystifiedRecommendation Engine Demystified
Recommendation Engine Demystified
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 

Kürzlich hochgeladen

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 

Kürzlich hochgeladen (20)

Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 

Intro to Classification: Logistic Regression & SVM

  • 1. Kriti Puniyani Carnegie Mellon University kriti@cmu.edu
  • 2. About me   Graduate student at Carnegie Mellon University   Statistical machine learning   Topic models   Sparse network learning   Optimization   Domains of interest   Social media analysis   Systems biology   Genetics   Sentiment analysis   Text processing 4/15/11 2
  • 3. Machine learning   Computers to “learn with experience”   Learn : to be able to predict “unseen” things.   Many applications   Search   Machine translation   Speech recognition   Vision : identify cars, people, sky, apples   Robot control   Introductions :   http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf   http://videolectures.net/mlss2010_lawrence_mlfcs/ 4/15/11 3
  • 4. Classification   Is this the digit “9” ? ρ   Will this patient survive ?   Will this user click on my ad ? 4/15/11 4
  • 5. Predict the next coin toss Data Task THTTTTHHTHTHTTT Model 1 : Model 2 : Coin is tossed with Toss depends on wind probability p (of condition W, starting being tails) pose S, torque T Parameters 4/15/11 5
  • 6. Predict the next coin toss THTTTTHHTHTHTTT Learning Model 2 : Model 1 : W=12.2, S=1, p=2/3 T=0.23 4/15/11 6
  • 7. Predict the next coin toss I predict the next toss to be T Inference Model 2 : Model 1 : W=12.2, S=1, P=2/3 T=0.23 4/15/11 7
  • 8. Inference   Parameter : p=2/3   Predicted next 9 tosses ….H H H T T T T T T Observed next 9 tosses ….T T T T T T H H H Accuracy = 2/9    Predicted next 9 tosses ….T T T T T T T T T Observed next 9 tosses ….T T T T T T H H H Accuracy = 6/9    Inference rule :   if p > 0.5, always predict T,   if p < 0.5 always predict H. 4/15/11 8
  • 9. The anatomy of classification 1.  What is the data (features X, label y) ? ★★★ 2.  What is the model ? Model parameterization (w) 3.  Inference : Given X, w, predict the label. 4.  Learning : Given (X,y) pairs, learn the “best” w   Define “best” – maximize an objective function (X,Y) pairs train time Learning w test time (X, ? ) Inference predicted Y 9
  • 10. Logistic Regression 4/15/11 10
  • 11. Predict speaker success   X = Number of hours spent in preparation   Y = Was the speaker “good”? I(a) = 1 if(a==TRUE) Prediction : Y = I ( X > h) = 0 if(a==FALSE) 4/15/11 11
  • 12. Predict speaker success Y = I ( X > h)   Learning is difficult.   Not robust 4/15/11 12
  • 13. 1 P(Y | w, X) = 1+e−wX +w0 € Y = I ( X > 10) 4/15/11 13
  • 15. Extend to d dimensions   1 P(Y | w, X ) = −( w1 X 1 +w 2 X 2 +...+w d X d +w 0 ) 1+e   1 P(Y | w, X ) =   € 1+e −( w. X +w 0 ) 4/15/11 15 €
  • 16. Logistic regression   Model parameter : w 1 P(Y = 1 | w, X) = −wX +w 0 1+e   Example : Given X = 0.9 , w = 1.2 => wX = 1.08, P(Y=1|X=0.9) = 0.7465 ~ 0.75 € Toss a coin with p=3/4   Example : Given X = -1.1 , w = 1.2 => wX = -1.32, P(Y=1|X=-1.1) = 0.2108 ~ 0.2 Toss a coin with p=1/5 4/15/11 16
  • 17. Another view of logistic regression   Log odds : ln [ p/(1-p) ] = wX + ε   p / (1-p) = ewX   p = (1-p) ewX   p (1 + ewX) = ewX   p = ewX / (1 + ewX) = 1/(e-wX+1)   Logistic regression is a “linear regression” between log- odds of an event and features (X) 4/15/11 17
  • 18. The anatomy of classification 1.  What is the data (features X, label y) ? ✔ 2.  What is the model ? Model parameterization (w) ✔ 3.  Inference :Given X, w, predict the label. ✔ 4.  Learning : Given (X,y) pairs, learn the “best” w   Define “best” – maximize an objective function 4/15/11 18
  • 19. Learning : Finding the best w Expressing…(X , Y )   Data : (X , Y ), 1 1 Conditional Log Likelihood n n   If yi == 1, max P(yi=1| xi, w)   If yi == 0, max P(yi=0| xi, w)   Maximize Log-likelihood 4/15/11 19
  • 20. Learning : Example   Data : (5, 0), (11, 1), (25,1) l(w)= ln P(y = 0 | x = 5,w) + ln P(y = 1 | x = 11,w) + ln P(y = 1 | x = 25,w)   P(Y=1|X,w) is a logistic function 1 1 ©Carlos Guestrin 2005-2009 1 ! l(w)= ln(1− −5w +w 0 ) + ln + ln 1+ e 1+ e−11w +w 0 1+ e−25w +w 0 P(y=1|x) + P(y=0|x) = 1 4/15/11 20 €
  • 21. Optimization : Pick the “best” w 1.  Weka 2.  Matlab : w = mnrfit(X,Y) 3.  R : w <- glm(Y~X, family=binomial(link="logit")) 4.  IRLS : http://www.cs.cmu.edu/~ggordon/IRLS-example/logistic.m 5.  Implement your own  4/15/11 21
  • 22. Decision surface is linear Errors Y=0 Y=1 4/15/11 22
  • 23. Decision surface is linear http://www.cs.technion.ac.il/~rani/LocBoost/ 4/15/11 23
  • 24. So far..   Logistic regression is a binary classifier (multinomial version exists)   P(Y=1|X,w) is a logistic function   Inference : Compute P(Y=1|X,w), and do “rounding”.   Parameter learnt by maximizing log-likelihood of data.   Decision surface is linear (kernelized version exists) 4/15/11 24
  • 25. Improvements in the model   Prevent over-fitting Regularization   Maximize accuracy directly SVMs   Non-linear decision surface Kernel Trick   Multi-label data 4/15/11 25
  • 26. Occam’s razor The simplest explanation is most likely the correct one 4/15/11 26
  • 27. New and improved learning   “Best” w == maximize log-likelihood Maximum Log-likelihood Estimate (MLE) Small concern … over-fitting If data is linearly separable, w 4/15/11 27
  • 28. L2 regularization 2 || w ||2 = ∑ wi i 2 max w l(w) − λ || w || 2   Prevents over-fitting €   “Pushes” parameters towards zero   Equivalent to a prior on € the parameters   Normal distribution (0 mean, unit covariance) λ : tuning parameter ( 0.1) 4/15/11 28
  • 29. Patient Diagnosis   Y = disease   X = [age, weight, BP, blood sugar, MRI, genetic tests …]   Don’t want all “features” to be relevant.   Weight vector w should be “mostly zeros”. 4/15/11 29
  • 30. L1 regularization || w ||1= ∑ | w i | i max w l(w) − λ || w ||1 €   Prevents over-fitting   “Pushes” parameters to zero   Equivalent to a prior on € the parameters   Laplace distribution λ increases, more zeros (irrelevant) features 4/15/11 30
  • 31. L1 v/s L2 example   MLE estimate : [ 11 0.8 ]   L2 estimate : [ 10 0.6 ] shrinkage   L1 estimate : [ 10.2 0 ] sparsity   Mini-conclusion :   L2 optimization is fast, L1 tends to be slower. If you have the computational resources, try both (at the same time) !   ALWAYS run logistic regression with at least some regularization.   Corollary : ALWAYS run logistic regression on features that have been standardized (zero mean, unit variance) 4/15/11 31
  • 32. So far …   Logistic regression   Model   Inference   Learning via maximum likelihood   L1 and L2 regularization Next …. SVMs ! 4/15/11 32
  • 33. Why did we use probability again?   Aim : Maximize “accuracy”   Logistic regression : Indirect method that maximizes likelihood of data.   A much more direct approach is to directly maximize accuracy. Support Vector Machines (SVMs) 4/15/11 33
  • 34. Maximize the margin Maximize the margin ! 2005-2007 Carlos Guestrin " 4/15/11 34
  • 35. Geometry review Y=1 2x1+x2-2=0 Y= -1 For a point on the line : (0.5, 1 ) : d = 2*0.5 + 1 – 2 =0 Signed “distance” to the line from (x10, x20) d = 2x10 + x20 - 2 4/15/11 35
  • 36. Geometry review Y=1 2x1+x2-2=0 Y= -1 (1, 2.5) : d = 2*1 + 2.5 - 2 = 2.5 > 0 y(wx+b) = 1*2.5 = 2.5 > γ 4/15/11 36
  • 37. Geometry review Y=1 2x1+x2-2=0 Y= -1 (0.5, 0.5) : d = 2*0.5 + 0.5 – 2 = -0.5 < 0 y(w.x+b) = y*d = -1 * -0.5 = 0.5 4/15/11 37
  • 38. Support Vector Machines Normalized margin – Canonical hyperplanes ! 2005-2007 Carlos Guestrin " Support vectors are the x+ points touching the margins. x- 4/15/11 38 ! 2005-2007 Carlos Guestrin !
  • 39. = !j w(j) x(j) ! 2005-2007 Carlos Guestrin ! w.x = !j w(j) x(j) w.x = !j w(j) x(j) ! 2005-2007 Carlos Guestrin ! Slack variables ! 2005-2007 Carlos Guestrin   SVMs are made robust by adding “slack variables” that allow training error to be non-zero ximize the margin point. Slack variable ==0 for correctly   One for each data Maximizepoints. margin classified the Maximize the−C∑ξ i max γ margin € 4/15/11 39
  • 40. Slack variables max γ −C ∑ξ i € Need to tune C : high C == minimize mis-classifications low C == maximize margin 4/15/11 40
  • 41. SVM summary   Model : w.x + b > 0 if y = +1 w.x + b < 0 if y = -1   Inference : ŷ = sign(w.x+b)   Learning : Maximize { (margin) - C ( slack-variables) } Next … Kernel SVMs 4/15/11 41
  • 42. The kernel trick   Why linear separator ? What if data looks like below ? The kernel trick allows you to use SVMs with non- linear separators. Different kernels 1.  Polynomial 2.  Gaussian 3.  exponential 4/15/11 42
  • 43. Logistic Linear SVM Error ~ 40% in both cases 4/15/11 43
  • 44. Kernel SVM with polynomial kernel of degree 2 Polynomial kernel of degree 2/4 do very well, but degree 3/5 do very bad. Gaussian kernel has tuning parameter (bandwidth). Performance depends on picking the right bandwith. Error = 7% 4/15/11 44
  • 45. SVMs summary   Maximize the margin between positive and negative examples.   Kernel trick is widely implemented, allowing non-linear decision surface.   Not probabilistic    Software :   SVM-light http://svmlight.joachims.org/,   LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/   Weka, matlab, R 4/15/11 45
  • 46. Demo http://www.cs.technion.ac.il/~rani/LocBoost 4/15/11 46
  • 47. Which to use ?   Linear SVMs and logistic regression work very similar in most cases.   Kernelized SVMs work better than linear SVMs (mostly)   Kernelized logistic regression is possible, but implementations are not available easily. 4/15/11 47
  • 48. Recommendations 1.  First, try logistic regression. Easy, fast, stable. No “tuning” parameters. 2.  Equivalently, you can first try linear SVMs, but you need to tune “C” 3.  If results are “good enough”, stop. 4.  Else try SVMs with Gaussian kernels. Need to tune bandwidth, C – by using validation data. If you have more time/computational resources, try random forests as well. ** Recommendations are opinions of the presenter, and not known facts. 4/15/11 48
  • 49. In conclusion …   Logistic Regression   Support Vector Machines Other classification approaches …   Random forests / decision trees   Naïve Bayes   Nearest Neighbors   Boosting (Adaboost) 4/15/11 49
  • 50. Thank you Questions? 4/15/11 50
  • 51. Kriti Puniyani Carnegie Mellon University kriti@cmu.edu
  • 52. Is this athlete doing drugs ?   X = Blood-test-to-detect-drugs   Y = Doped athlete ?   Two types of errors :   Athlete is doped, we predict “NO” : false negative   Athlete is NOT doped, we predict “YES” : false positive   Penalize false positives more than false negatives 4/15/11 52
  • 53. Outline   What is classification ?   Parameters, data, inference, learning   Predicting coin tosses (0-dimensional X)   Logistic Regression   Predicting “speaker success” (1-dimensional X)   Formulation, optimizatiob   Decision surface is linear   Interpreting coefficients   Hypothesis testing   Evaluating the performance of the model   Why is it called “regression” : log-odds   L2 regularization   Patient survival (d-dimensional X)   L1 regularization   Support Vector Machines   Linear SVMs + formulation   What are “support vectors”   The kernel trick   Demo : logistic regression v/s SVMs v/s kernel tricks 4/15/11 53
  • 54. Overfitting a more serious problem 2x+y-2 = 0 w = [2 1 -2] 4x+2y-4 = 0 w = [4 2 -4] 400x+200y-400 = 0 w = [400 200 -400]  Absolutely need L2 regularization 4/15/11 54