SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Classifiers Ensembles
Machine Learning and Data Mining (Unit 16)




Prof. Pier Luca Lanzi
References                                     2



  Jiawei Han and Micheline Kamber, quot;Data Mining: Concepts
  and Techniquesquot;, The Morgan Kaufmann Series in Data
  Management Systems (Second Edition)
  Tom M. Mitchell. “Machine Learning” McGraw Hill 1997
  Pang-Ning Tan, Michael Steinbach, Vipin Kumar,
  “Introduction to Data Mining”, Addison Wesley
  Ian H. Witten, Eibe Frank. “Data Mining: Practical Machine
  Learning Tools and Techniques with Java Implementations”
  2nd Edition




                  Prof. Pier Luca Lanzi
Outline                                   3



  What is the general idea?

  What ensemble methods?
    Bagging
    Boosting
    Random Forest




                  Prof. Pier Luca Lanzi
Ensemble Methods                                  4



  Construct a set of classifiers from the training data

  Predict class label of previously unseen records by
  aggregating predictions made by multiple classifiers




                   Prof. Pier Luca Lanzi
What is the General Idea?                5




                 Prof. Pier Luca Lanzi
Building models ensembles                       6



  Basic idea
     Build different “experts”, let them vote

  Advantage
    Often improves predictive performance

  Disadvantage
     Usually produces output that is very hard to analyze

  However, there are approaches that aim to produce a single
  comprehensible structure




                   Prof. Pier Luca Lanzi
Why does it work?                                  7



  Suppose there are 25 base classifiers

  Each classifier has error rate, Δ = 0.35

  Assume classifiers are independent

  Probability that the ensemble classifier makes a wrong
  prediction:

                       ⎛ 25 ⎞ i
                 25

                 ∑⎜ i ⎟⎜ ⎟Δ (1 − Δ ) 25−i = 0.06
                            ⎠
                 i =13 ⎝




                      Prof. Pier Luca Lanzi
How to generate an ensemble?               8



  Bootstrap Aggregating (Bagging)

  Boosting

  Random Forests




                   Prof. Pier Luca Lanzi
What is Bagging? (Bootstrap                             9
Aggregation)

  Analogy: Diagnosis based on multiple doctors’ majority vote

  Training
     Given a set D of d tuples, at each iteration i, a training set Di of
     d tuples is sampled with replacement from D (i.e., bootstrap)
     A classifier model Mi is learned for each training set Di

  Classification: classify an unknown sample X
     Each classifier Mi returns its class prediction
     The bagged classifier M* counts the votes and
     assigns the class with the most votes to X

  Prediction: can be applied to the prediction of continuous values by
  taking the average value of each prediction for a given test tuple




                     Prof. Pier Luca Lanzi
What is Bagging?                                  10



  Combining predictions by voting/averaging
    Simplest way
    Each model receives equal weight

  “Idealized” version:
     Sample several training sets of size n
     (instead of just having one training set of size n)â€«â€â€Ź
     Build a classifier for each training set
     Combine the classifiers’ predictions




                   Prof. Pier Luca Lanzi
More on bagging                                11



  Bagging works because it reduces variance
  by voting/averaging
     Note: in some pathological hypothetical situations the
     overall error might increase
     Usually, the more classifiers the better

  Problem: we only have one dataset!
  Solution: generate new ones of size n
  by Bootstrap, i.e., sampling from it with replacement

  Can help a lot if data is noisy

  Can also be applied to numeric prediction
    Aside: bias-variance decomposition originally only known
    for numeric prediction



                    Prof. Pier Luca Lanzi
Bagging classifiers                         12




  Model generation
  Let n be the number of instances in the training data
  For each of t iterations:
         Sample n instances from training set
                (with replacement)â€«â€â€Ź
         Apply learning algorithm to the sample
         Store resulting model




  Classification
  For each of the t models:
         Predict class of instance using model
  Return class that is predicted most often




                   Prof. Pier Luca Lanzi
Bias-variance decomposition                      13



  Used to analyze how much selection of any specific training
  set affects performance

  Assume infinitely many classifiers,
  built from different training sets of size n

  For any learning scheme,
     Bias = expected error of the combined
     classifier on new data
     Variance = expected error due to
     the particular training set used

  Total expected error ≈ bias + variance




                    Prof. Pier Luca Lanzi
When does Bagging work?                           14



  Learning algorithm is unstable, if small changes to the
  training set cause large changes in the learned classifier

  If the learning algorithm is unstable, then
  Bagging almost always improves performance

  Bagging stable classifiers is not a good idea

  Which ones are unstable?
    Neural nets, decision trees, regression trees, linear
    regression

  Which ones are stable?
    K-nearest neighbors


                   Prof. Pier Luca Lanzi
Why Bagging works?                                   15



  Let T={<xn, yn>} be the set of training data containing N examples

  Let {Tk} be a sequence of training sets containing N examples
  independently sampled from T, for instance Tk can be generated
  using bootstrap

  Let P be the underlying distribution of T

  Bagging replaces the prediction of the model φ(x,T) obtained by E,
  with the majority of the predictions given by the models {φ(x,Tk)}

                         φA(x,P) = ET (φ(x,Tk))

  The algorithm is instable, if perturbing the learning set can cause
  significant changes in the predictor constructed

  Bagging can improve the accuracy of the predictor,



                     Prof. Pier Luca Lanzi
Why Bagging works?                                     16



  It is possible to prove that,

                  (y –ET(φ(x,T)))2 ≀ ET(y - φ(x,T))2

  Thus, Bagging produces a smaller error

  How much is smaller the error is, depends on how much unequal
  are the two sides of,

                      [ET(φ(x,T)) ]2 ≀ ET(φ2(x,T))

  If the algorithm is stable, the two sides will be nearly equal
  If more highly variable the φ(x,T) are the more improvement the
  aggregation produces

  However, φA always improves φ



                      Prof. Pier Luca Lanzi
Bagging with costs                             17



  Bagging unpruned decision trees known to produce good
  probability estimates
     Where, instead of voting, the individual classifiers'
     probability estimates are averaged
     Note: this can also improve the success rate

  Can use this with minimum-expected cost approach
  for learning problems with costs

  Problem: not interpretable
     MetaCost re-labels training data using bagging with costs
     and then builds single tree




                  Prof. Pier Luca Lanzi
Randomization                                 18



  Can randomize learning algorithm instead of input

  Some algorithms already have a random component:
  e.g. initial weights in neural net

  Most algorithms can be randomized, e.g. greedy algorithms:
    Pick from the N best options at random instead of always
    picking the best options
    E.g.: attribute selection in decision trees

  More generally applicable than bagging:
  e.g. random subsets in nearest-neighbor scheme

  Can be combined with bagging


                  Prof. Pier Luca Lanzi
What is Boosting?                                     19



  Analogy: Consult several doctors, based on a combination of
  weighted diagnoses—weight assigned based on the previous
  diagnosis accuracy
  How boosting works?
     Weights are assigned to each training tuple
     A series of k classifiers is iteratively learned
     After a classifier Mi is learned, the weights are updated to allow
     the subsequent classifier, Mi+1, to pay more attention to the
     training tuples that were misclassified by Mi
     The final M* combines the votes of each individual classifier,
     where the weight of each classifier's vote is a function of its
     accuracy
  The boosting algorithm can be extended for the prediction of
  continuous values
  Comparing with bagging: boosting tends to achieve greater
  accuracy, but it also risks overfitting the model to misclassified data



                     Prof. Pier Luca Lanzi
What is the Basic Idea?                              20



  Suppose there are just 5 training examples {1,2,3,4,5}

  Initially each example has a 0.2 (1/5) probability of being sampled

  1st round of boosting samples (with replacement) 5 examples:
  {2, 4, 4, 3, 2} and builds a classifier from them

  Suppose examples 2, 3, 5 are correctly predicted by this classifier,
  and examples 1, 4 are wrongly predicted:
     Weight of examples 1 and 4 is increased,
     Weight of examples 2, 3, 5 is decreased

  2nd round of boosting samples again 5 examples, but now
  examples 1 and 4 are more likely to be sampled

  And so on 
 until some convergence is achieved

                     Prof. Pier Luca Lanzi
Boosting                                           21



  Also uses voting/averaging

  Weights models according to performance

  Iterative: new models are influenced
  by the performance of previously built ones
      Encourage new model to become an “expert” for instances
      misclassified by earlier models
      Intuitive justification: models should be experts that
      complement each other

  Several variants
     Boosting by sampling,
     the weights are used to sample the data for training
     Boosting by weighting,
     the weights are used by the learning algorithm


                    Prof. Pier Luca Lanzi
AdaBoost.M1                                 22



 Model generation
 Assign equal weight to each training instance
 For t iterations:
   Apply learning algorithm to weighted dataset,
              store resulting model
   Compute model’s error e on weighted dataset
   If e = 0 or e ≄ 0.5:
     Terminate model generation
   For each instance in dataset:
     If classified correctly by model:
        Multiply instance’s weight by e/(1-e)â€«â€â€Ź
   Normalize weight of all instances



 Classification
 Assign weight = 0 to all classes
 For each of the t (or less) models:
        For the class this model predicts
               add –log e/(1-e) to this class’s weight
 Return class with highest weight

                  Prof. Pier Luca Lanzi
Example: AdaBoost                                 23



  Base classifiers: C1, C2, 
, CT

  Error rate:


           ∑ w ή (C ( x ) ≠ y )
            N
       1
  Δi =            j      i     j              j
       N   j =1


  Importance of a classifier:

       1 ⎛ 1 − Δi ⎞
   αi = ln⎜
          ⎜Δ ⎟
       2⎝ i⎟      ⎠



                      Prof. Pier Luca Lanzi
Example: AdaBoost                                        24



  Weight update:

                       ⎧exp −α j if C j ( xi ) = yi
                   wâŽȘ
                    ( j)
         ( j +1)
                 =     ⎚
                    i
      wi                    α
                   Z j âŽȘ exp j if C j ( xi ) ≠ yi
                       ⎩
       where Z j is the normalizat ion factor

  If any intermediate rounds produce error rate higher than
  50%, the weights are reverted back to 1/n and the
  resampling procedure is repeated

  Classification:

                C * ( x ) = arg max ∑ α jÎŽ (C j ( x ) = y )
                                                  T


                                                  j =1
                                              y

                      Prof. Pier Luca Lanzi
Illustrating AdaBoost                              25




             Initial weights for each data point        Data points
                                                        for training




                    Prof. Pier Luca Lanzi
Illustrating AdaBoost                    26




                 Prof. Pier Luca Lanzi
Adaboost (Freund and Schapire, 1997)                        27



   Given a set of d class-labeled tuples, (X1, y1), 
, (Xd, yd)
   Initially, all the weights of tuples are set the same (1/d)
   Generate k classifiers in k rounds. At round i,
        Tuples from D are sampled (with replacement) to form a
        training set Di of the same size
        Each tuple’s chance of being selected is based on its weight
        A classification model Mi is derived from Di
        Its error rate is calculated using Di as a test set
        If a tuple is misclassified, its weight is increased, o.w. it is
        decreased
   Error rate: err(Xj) is the misclassification error of tuple Xj.
   Classifier Mi error rate is the sum of the weights of the
   misclassified tuples:
                                       d
                      error ( M i ) = ∑ w j × err ( X j )
                                       j

   The weight of classifier Mi’s vote is
                              1 − error ( M i )
                        log
                                error ( M i )
                      Prof. Pier Luca Lanzi
What is a Random Forest?                         28



  Random forests (RF) are a combination of tree predictors

  Each tree depends on the values of a random vector sampled
  independently and with the same distribution for all trees in
  the forest

  The generalization error of a forest of tree classifiers depends
  on the strength of the individual trees in the forest and the
  correlation between them

  Using a random selection of features to split each node yields
  error rates that compare favorably to Adaboost, and are
  more robust with respect to noise




                   Prof. Pier Luca Lanzi
How do Random Forests Work?                             29




D = training set                            F = set of tests
k = nb of trees in forest                   n = nb of tests

for i = 1 to k do:
   build data set Di by sampling with replacement from D
   learn tree Ti (Tilde) from Di:
      at each node:
         choose best split from random subset of F of size n
      allow aggregates and refinement of aggregates in tests


make predictions according to majority vote of the set of k trees.




                    Prof. Pier Luca Lanzi
Random Forests                                 30



  Ensemble method tailored for decision tree classifiers

  Creates k decision trees, where each tree is independently
  generated based on random decisions

  Bagging using decision trees can be seen as a special case of
  random forests where the random decisions are the random
  creations of the bootstrap samples




                   Prof. Pier Luca Lanzi
Two examples of random decisions                     31
in decision forests

  At each internal tree node, randomly select F attributes, and
  evaluate just those attributes to choose the partitioning attribute
      Tends to produce trees larger than trees where all attributes are
      considered for selection at each node, but different classes will
      be eventually assigned to different leaf nodes, anyway
      Saves processing time in the construction of each individual
      tree, since just a subset of attributes is considered at each
      internal node

  At each internal tree node, evaluate the quality of all possible
  partitioning attributes, but randomly select one of the F best
  attributes to label that node (based on InfoGain, etc.)
      Unlike the previous approach, does not save processing time




                     Prof. Pier Luca Lanzi
Properties of Random Forests                         32



  Easy to use (quot;off-the-shelvequot;), only 2 parameters
  (no. of trees, %variables for split)
  Very high accuracy
  No overfitting if selecting large number of trees (choose high)
  Insensitive to choice of split% (~20%)
  Returns an estimate of variable importance




                     Prof. Pier Luca Lanzi
Out of the bag                                       33



  For every tree grown, about one-third of the cases are out-of-bag
  (out of the bootstrap sample). Abbreviated oob.

  Put these oob cases down the corresponding tree and get response
  estimates for them.

  For each case n, average or pluralize the response estimates over
  all time that n was oob to get a test set estimate yn for yn.

  Averaging the loss over all n give the test set
  estimate of prediction error.

  The only adjustable parameter in RF is m.

  The default value for m is M. But RF is not sensitive to the value of
  m over a wide range.


                     Prof. Pier Luca Lanzi
Variable Importance                                 34



  Because of the need to know which variables are important in the
  classification, RF has three different ways of looking at variable
  importance

  Measure 1
     To estimate the importance of the mth variable, in the oob
     cases for the kth tree, randomly permute all values of the mth
     variable
     Put these altered oob x-values down the tree and get
     classifications.
     Proceed as though computing a new internal error rate.
     The amount by which this new error exceeds the original test
     set error is defined as the importance of the mth variable.




                    Prof. Pier Luca Lanzi
Variable Importance                                  35



  For the nth case in the data, its margin at the end of a run is the
  proportion of votes for its true class minus the maximum of the
  proportion of votes for each of the other classes

  The 2nd measure of importance of the mth variable is the average
  lowering of the margin across all cases when the mth variable is
  randomly permuted as in method 1

  The third measure is the count of how many margins are lowered
  minus the number of margins raised




                     Prof. Pier Luca Lanzi
Summary of Random Forests                     36



  Random forests are an effective tool in prediction.
  Forests give results competitive with boosting and adaptive
  bagging, yet do not progressively change the training set.
  Random inputs and random features produce good results in
  classification- less so in regression.
  For larger data sets, we can gain accuracy by combining
  random features with boosting.




                  Prof. Pier Luca Lanzi
Summary                                        37



 Ensembles in general improve predictive accuracy
    Good results reported for most application domains,
    unlike algorithm variations whose success are more
    dependant on the application domain/dataset

 Improvement in accuracy, but interpretability decreases
   Much more difficult for the user to interpret an ensemble
   of classification models than a single classification model

 Diversity of the base classifiers in the ensemble is important
    Trade-off between each base classifier’s error and
    diversity
    Maximizing classifier diversity tends to increase the error
    of each individual base classifier


                  Prof. Pier Luca Lanzi

Weitere Àhnliche Inhalte

Was ist angesagt?

Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
Xueping Peng
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
butest
 
Facial Emoji Recognition
Facial Emoji RecognitionFacial Emoji Recognition
Facial Emoji Recognition
ijtsrd
 
21 situation calculus
21 situation calculus21 situation calculus
21 situation calculus
Tianlu Wang
 

Was ist angesagt? (20)

Ensemble methods
Ensemble methodsEnsemble methods
Ensemble methods
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
 
Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.
 
Branch and bound technique
Branch and bound techniqueBranch and bound technique
Branch and bound technique
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
 
Id3,c4.5 algorithim
Id3,c4.5 algorithimId3,c4.5 algorithim
Id3,c4.5 algorithim
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is fun
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Decision tree
Decision treeDecision tree
Decision tree
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Facial Emoji Recognition
Facial Emoji RecognitionFacial Emoji Recognition
Facial Emoji Recognition
 
21 situation calculus
21 situation calculus21 situation calculus
21 situation calculus
 
Seminar(Pattern Recognition)
Seminar(Pattern Recognition)Seminar(Pattern Recognition)
Seminar(Pattern Recognition)
 
Adaline madaline
Adaline madalineAdaline madaline
Adaline madaline
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 

Ähnlich wie Machine Learning and Data Mining: 16 Classifiers Ensembles

Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
èèŻ æ„Š
 
3_learning.ppt
3_learning.ppt3_learning.ppt
3_learning.ppt
butest
 
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and prediction
Margaret Wang
 
Download It
Download ItDownload It
Download It
butest
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)
NYversity
 

Ähnlich wie Machine Learning and Data Mining: 16 Classifiers Ensembles (20)

DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
BaggingBoosting.pdf
BaggingBoosting.pdfBaggingBoosting.pdf
BaggingBoosting.pdf
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
Introduction to Machine Learning Lectures
Introduction to Machine Learning LecturesIntroduction to Machine Learning Lectures
Introduction to Machine Learning Lectures
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
 
Mncs 16-09-4ìŁŒ-변ìŠč규-introduction to the machine learning
Mncs 16-09-4ìŁŒ-변ìŠč규-introduction to the machine learningMncs 16-09-4ìŁŒ-변ìŠč규-introduction to the machine learning
Mncs 16-09-4ìŁŒ-변ìŠč규-introduction to the machine learning
 
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
 
boosting algorithm
boosting algorithmboosting algorithm
boosting algorithm
 
3_learning.ppt
3_learning.ppt3_learning.ppt
3_learning.ppt
 
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and prediction
 
Download It
Download ItDownload It
Download It
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)
 
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...
 
Ensemble Learning.pptx
Ensemble Learning.pptxEnsemble Learning.pptx
Ensemble Learning.pptx
 
Machine Learning basics
Machine Learning basicsMachine Learning basics
Machine Learning basics
 
DMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to ClassificationDMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 10 Introduction to Classification
 
Borderline Smote
Borderline SmoteBorderline Smote
Borderline Smote
 

Mehr von Pier Luca Lanzi

Mehr von Pier Luca Lanzi (20)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
 

KĂŒrzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

KĂŒrzlich hochgeladen (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Machine Learning and Data Mining: 16 Classifiers Ensembles

  • 1. Classifiers Ensembles Machine Learning and Data Mining (Unit 16) Prof. Pier Luca Lanzi
  • 2. References 2 Jiawei Han and Micheline Kamber, quot;Data Mining: Concepts and Techniquesquot;, The Morgan Kaufmann Series in Data Management Systems (Second Edition) Tom M. Mitchell. “Machine Learning” McGraw Hill 1997 Pang-Ning Tan, Michael Steinbach, Vipin Kumar, “Introduction to Data Mining”, Addison Wesley Ian H. Witten, Eibe Frank. “Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations” 2nd Edition Prof. Pier Luca Lanzi
  • 3. Outline 3 What is the general idea? What ensemble methods? Bagging Boosting Random Forest Prof. Pier Luca Lanzi
  • 4. Ensemble Methods 4 Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers Prof. Pier Luca Lanzi
  • 5. What is the General Idea? 5 Prof. Pier Luca Lanzi
  • 6. Building models ensembles 6 Basic idea Build different “experts”, let them vote Advantage Often improves predictive performance Disadvantage Usually produces output that is very hard to analyze However, there are approaches that aim to produce a single comprehensible structure Prof. Pier Luca Lanzi
  • 7. Why does it work? 7 Suppose there are 25 base classifiers Each classifier has error rate, Δ = 0.35 Assume classifiers are independent Probability that the ensemble classifier makes a wrong prediction: ⎛ 25 ⎞ i 25 ∑⎜ i ⎟⎜ ⎟Δ (1 − Δ ) 25−i = 0.06 ⎠ i =13 ⎝ Prof. Pier Luca Lanzi
  • 8. How to generate an ensemble? 8 Bootstrap Aggregating (Bagging) Boosting Random Forests Prof. Pier Luca Lanzi
  • 9. What is Bagging? (Bootstrap 9 Aggregation) Analogy: Diagnosis based on multiple doctors’ majority vote Training Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap) A classifier model Mi is learned for each training set Di Classification: classify an unknown sample X Each classifier Mi returns its class prediction The bagged classifier M* counts the votes and assigns the class with the most votes to X Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple Prof. Pier Luca Lanzi
  • 10. What is Bagging? 10 Combining predictions by voting/averaging Simplest way Each model receives equal weight “Idealized” version: Sample several training sets of size n (instead of just having one training set of size n)â€«â€â€Ź Build a classifier for each training set Combine the classifiers’ predictions Prof. Pier Luca Lanzi
  • 11. More on bagging 11 Bagging works because it reduces variance by voting/averaging Note: in some pathological hypothetical situations the overall error might increase Usually, the more classifiers the better Problem: we only have one dataset! Solution: generate new ones of size n by Bootstrap, i.e., sampling from it with replacement Can help a lot if data is noisy Can also be applied to numeric prediction Aside: bias-variance decomposition originally only known for numeric prediction Prof. Pier Luca Lanzi
  • 12. Bagging classifiers 12 Model generation Let n be the number of instances in the training data For each of t iterations: Sample n instances from training set (with replacement)â€«â€â€Ź Apply learning algorithm to the sample Store resulting model Classification For each of the t models: Predict class of instance using model Return class that is predicted most often Prof. Pier Luca Lanzi
  • 13. Bias-variance decomposition 13 Used to analyze how much selection of any specific training set affects performance Assume infinitely many classifiers, built from different training sets of size n For any learning scheme, Bias = expected error of the combined classifier on new data Variance = expected error due to the particular training set used Total expected error ≈ bias + variance Prof. Pier Luca Lanzi
  • 14. When does Bagging work? 14 Learning algorithm is unstable, if small changes to the training set cause large changes in the learned classifier If the learning algorithm is unstable, then Bagging almost always improves performance Bagging stable classifiers is not a good idea Which ones are unstable? Neural nets, decision trees, regression trees, linear regression Which ones are stable? K-nearest neighbors Prof. Pier Luca Lanzi
  • 15. Why Bagging works? 15 Let T={<xn, yn>} be the set of training data containing N examples Let {Tk} be a sequence of training sets containing N examples independently sampled from T, for instance Tk can be generated using bootstrap Let P be the underlying distribution of T Bagging replaces the prediction of the model φ(x,T) obtained by E, with the majority of the predictions given by the models {φ(x,Tk)} φA(x,P) = ET (φ(x,Tk)) The algorithm is instable, if perturbing the learning set can cause significant changes in the predictor constructed Bagging can improve the accuracy of the predictor, Prof. Pier Luca Lanzi
  • 16. Why Bagging works? 16 It is possible to prove that, (y –ET(φ(x,T)))2 ≀ ET(y - φ(x,T))2 Thus, Bagging produces a smaller error How much is smaller the error is, depends on how much unequal are the two sides of, [ET(φ(x,T)) ]2 ≀ ET(φ2(x,T)) If the algorithm is stable, the two sides will be nearly equal If more highly variable the φ(x,T) are the more improvement the aggregation produces However, φA always improves φ Prof. Pier Luca Lanzi
  • 17. Bagging with costs 17 Bagging unpruned decision trees known to produce good probability estimates Where, instead of voting, the individual classifiers' probability estimates are averaged Note: this can also improve the success rate Can use this with minimum-expected cost approach for learning problems with costs Problem: not interpretable MetaCost re-labels training data using bagging with costs and then builds single tree Prof. Pier Luca Lanzi
  • 18. Randomization 18 Can randomize learning algorithm instead of input Some algorithms already have a random component: e.g. initial weights in neural net Most algorithms can be randomized, e.g. greedy algorithms: Pick from the N best options at random instead of always picking the best options E.g.: attribute selection in decision trees More generally applicable than bagging: e.g. random subsets in nearest-neighbor scheme Can be combined with bagging Prof. Pier Luca Lanzi
  • 19. What is Boosting? 19 Analogy: Consult several doctors, based on a combination of weighted diagnoses—weight assigned based on the previous diagnosis accuracy How boosting works? Weights are assigned to each training tuple A series of k classifiers is iteratively learned After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to pay more attention to the training tuples that were misclassified by Mi The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy The boosting algorithm can be extended for the prediction of continuous values Comparing with bagging: boosting tends to achieve greater accuracy, but it also risks overfitting the model to misclassified data Prof. Pier Luca Lanzi
  • 20. What is the Basic Idea? 20 Suppose there are just 5 training examples {1,2,3,4,5} Initially each example has a 0.2 (1/5) probability of being sampled 1st round of boosting samples (with replacement) 5 examples: {2, 4, 4, 3, 2} and builds a classifier from them Suppose examples 2, 3, 5 are correctly predicted by this classifier, and examples 1, 4 are wrongly predicted: Weight of examples 1 and 4 is increased, Weight of examples 2, 3, 5 is decreased 2nd round of boosting samples again 5 examples, but now examples 1 and 4 are more likely to be sampled And so on 
 until some convergence is achieved Prof. Pier Luca Lanzi
  • 21. Boosting 21 Also uses voting/averaging Weights models according to performance Iterative: new models are influenced by the performance of previously built ones Encourage new model to become an “expert” for instances misclassified by earlier models Intuitive justification: models should be experts that complement each other Several variants Boosting by sampling, the weights are used to sample the data for training Boosting by weighting, the weights are used by the learning algorithm Prof. Pier Luca Lanzi
  • 22. AdaBoost.M1 22 Model generation Assign equal weight to each training instance For t iterations: Apply learning algorithm to weighted dataset, store resulting model Compute model’s error e on weighted dataset If e = 0 or e ≄ 0.5: Terminate model generation For each instance in dataset: If classified correctly by model: Multiply instance’s weight by e/(1-e)â€«â€â€Ź Normalize weight of all instances Classification Assign weight = 0 to all classes For each of the t (or less) models: For the class this model predicts add –log e/(1-e) to this class’s weight Return class with highest weight Prof. Pier Luca Lanzi
  • 23. Example: AdaBoost 23 Base classifiers: C1, C2, 
, CT Error rate: ∑ w ÎŽ (C ( x ) ≠ y ) N 1 Δi = j i j j N j =1 Importance of a classifier: 1 ⎛ 1 − Δi ⎞ αi = ln⎜ ⎜Δ ⎟ 2⎝ i⎟ ⎠ Prof. Pier Luca Lanzi
  • 24. Example: AdaBoost 24 Weight update: ⎧exp −α j if C j ( xi ) = yi wâŽȘ ( j) ( j +1) = ⎚ i wi α Z j âŽȘ exp j if C j ( xi ) ≠ yi ⎩ where Z j is the normalizat ion factor If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the resampling procedure is repeated Classification: C * ( x ) = arg max ∑ α jÎŽ (C j ( x ) = y ) T j =1 y Prof. Pier Luca Lanzi
  • 25. Illustrating AdaBoost 25 Initial weights for each data point Data points for training Prof. Pier Luca Lanzi
  • 26. Illustrating AdaBoost 26 Prof. Pier Luca Lanzi
  • 27. Adaboost (Freund and Schapire, 1997) 27 Given a set of d class-labeled tuples, (X1, y1), 
, (Xd, yd) Initially, all the weights of tuples are set the same (1/d) Generate k classifiers in k rounds. At round i, Tuples from D are sampled (with replacement) to form a training set Di of the same size Each tuple’s chance of being selected is based on its weight A classification model Mi is derived from Di Its error rate is calculated using Di as a test set If a tuple is misclassified, its weight is increased, o.w. it is decreased Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is the sum of the weights of the misclassified tuples: d error ( M i ) = ∑ w j × err ( X j ) j The weight of classifier Mi’s vote is 1 − error ( M i ) log error ( M i ) Prof. Pier Luca Lanzi
  • 28. What is a Random Forest? 28 Random forests (RF) are a combination of tree predictors Each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them Using a random selection of features to split each node yields error rates that compare favorably to Adaboost, and are more robust with respect to noise Prof. Pier Luca Lanzi
  • 29. How do Random Forests Work? 29 D = training set F = set of tests k = nb of trees in forest n = nb of tests for i = 1 to k do: build data set Di by sampling with replacement from D learn tree Ti (Tilde) from Di: at each node: choose best split from random subset of F of size n allow aggregates and refinement of aggregates in tests make predictions according to majority vote of the set of k trees. Prof. Pier Luca Lanzi
  • 30. Random Forests 30 Ensemble method tailored for decision tree classifiers Creates k decision trees, where each tree is independently generated based on random decisions Bagging using decision trees can be seen as a special case of random forests where the random decisions are the random creations of the bootstrap samples Prof. Pier Luca Lanzi
  • 31. Two examples of random decisions 31 in decision forests At each internal tree node, randomly select F attributes, and evaluate just those attributes to choose the partitioning attribute Tends to produce trees larger than trees where all attributes are considered for selection at each node, but different classes will be eventually assigned to different leaf nodes, anyway Saves processing time in the construction of each individual tree, since just a subset of attributes is considered at each internal node At each internal tree node, evaluate the quality of all possible partitioning attributes, but randomly select one of the F best attributes to label that node (based on InfoGain, etc.) Unlike the previous approach, does not save processing time Prof. Pier Luca Lanzi
  • 32. Properties of Random Forests 32 Easy to use (quot;off-the-shelvequot;), only 2 parameters (no. of trees, %variables for split) Very high accuracy No overfitting if selecting large number of trees (choose high) Insensitive to choice of split% (~20%) Returns an estimate of variable importance Prof. Pier Luca Lanzi
  • 33. Out of the bag 33 For every tree grown, about one-third of the cases are out-of-bag (out of the bootstrap sample). Abbreviated oob. Put these oob cases down the corresponding tree and get response estimates for them. For each case n, average or pluralize the response estimates over all time that n was oob to get a test set estimate yn for yn. Averaging the loss over all n give the test set estimate of prediction error. The only adjustable parameter in RF is m. The default value for m is M. But RF is not sensitive to the value of m over a wide range. Prof. Pier Luca Lanzi
  • 34. Variable Importance 34 Because of the need to know which variables are important in the classification, RF has three different ways of looking at variable importance Measure 1 To estimate the importance of the mth variable, in the oob cases for the kth tree, randomly permute all values of the mth variable Put these altered oob x-values down the tree and get classifications. Proceed as though computing a new internal error rate. The amount by which this new error exceeds the original test set error is defined as the importance of the mth variable. Prof. Pier Luca Lanzi
  • 35. Variable Importance 35 For the nth case in the data, its margin at the end of a run is the proportion of votes for its true class minus the maximum of the proportion of votes for each of the other classes The 2nd measure of importance of the mth variable is the average lowering of the margin across all cases when the mth variable is randomly permuted as in method 1 The third measure is the count of how many margins are lowered minus the number of margins raised Prof. Pier Luca Lanzi
  • 36. Summary of Random Forests 36 Random forests are an effective tool in prediction. Forests give results competitive with boosting and adaptive bagging, yet do not progressively change the training set. Random inputs and random features produce good results in classification- less so in regression. For larger data sets, we can gain accuracy by combining random features with boosting. Prof. Pier Luca Lanzi
  • 37. Summary 37 Ensembles in general improve predictive accuracy Good results reported for most application domains, unlike algorithm variations whose success are more dependant on the application domain/dataset Improvement in accuracy, but interpretability decreases Much more difficult for the user to interpret an ensemble of classification models than a single classification model Diversity of the base classifiers in the ensemble is important Trade-off between each base classifier’s error and diversity Maximizing classifier diversity tends to increase the error of each individual base classifier Prof. Pier Luca Lanzi