SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Hands on Classification:
Decision Trees and Random
Forests
Predictive Analytics Meetup Group
Machine Learning Workshop
December 2, 2012




Daniel Gerlanc, Managing Director
Enplus Advisors, Inc.
www.enplusadvisors.com
dgerlanc@enplusadvisors.com
© Daniel Gerlanc, 2012.
All rights reserved.


If you‟d like to use this material for any
purpose, please contact
dgerlanc@enplusadvisors.com
What You‟ll Learn

• Intuition behind decision trees and
  random forests
• Implementation in R
• Assessing the results
Dataset

• Chemical Analysis of Italian Wines
• http://www.parvus.unige.it/
• 178 records, 14 attributes
Follow along
> library(mlclass)
> data(wine)
> str(wine)
'data.frame':        178 obs. of 14 variables:
 $ Type         : Factor w/ 2 levels "Grig","No": 2 2 2 2 2 2 2 2 2 2 ...
 $ Alcohol       : num 14.2 13.2 13.2 14.4 13.2 ...
 $ Malic       : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
 $ Ash         : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
 $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
What are Decision
       Trees?


• Model for partitioning an input space
What‟s partitioning?




See rf-1.R
Create the   1 st   split.

                        Not G

        G




See rf-1.R
Create the   2 nd   Split

                          Not G

        G


                      G



See rf-1.R
Create more splits…

                        Not G

         G


                      Not G
                G


I drew this one in.
Another view of partitioning




See rf-2.R
Use R to do the partitioning.

 tree.1 <- rpart(Type ~ ., data=wine)
 prp(tree.1, type=4, extra=2)




• See the „rpart‟ and „rpart.plot‟ R packages.
• Many parameters available to control the fit.



 See rf-2.R
Make predictions on a test dataset

 predict(tree.1, data=wine, type=“vector”)
How‟d it do?
Guessing: 60.11%
CART: 94.38% Accuracy
 •     Precision: 92.95% (66 / 71)
 •     Sensitivity/Recall: 92.95% (66 / 71)


                                              Actual

Predicted                  Grig                  no

Grig                       (1) 66                (3)   5

No                         (2) 5                 (4)   102
Decision Tree
       Problems

• Overfitting the data
• May not use all relevant features
• Perpendicular decision boundaries
Random Forests


One Decision
    Tree



                 Many Decision
                Trees (Ensemble)
Random Forest Fixes

• Overfitting the data
• May not use all relevant features
• Perpendicular decision boundaries
Building RF

For each tree:
  Sample from the data
  At each split, sample from the available
  variables
Bootstrap Sampling
Sample Attributes at each
          split
Motivations for RF

• Create uncorrelated trees
• Variance reduction
• Subspace exploration
Random Forests
rffit.1 <- randomForest(Type ~ ., data=wine)




See rf-3.R
RF Parameters in R
Most important parameters are:

Variable   Description                       Default

ntree      Number of Trees                   500

mtry       Number of variables to randomly   • square root of # predictors for
           select at each node                 classification
                                             • # predictors / 3 for regression
nodesize   Minimum number of records in a    • 1 for classification
           terminal node                     • 5 for regression

sampsize Number of records to select in each • 63.2%
         bootstrap sample
How‟d it do?
Guessing Accuracy: 60.11%
Random Forest: 98.31% Accuracy
 •     Precision: 95.77% (68 / 71)
 •     Sensitivity/Recall: 100% (68 / 68)


                                            Actual

Predicted                  Grig                No

Grig                       (1) 68              (3)   3

No                         (2) 0               (4)   107
Tuning RF: Grid Search
This is the default.




      See rf-4.R
Tuning is Expensive
Benefits of RF

• Good performance with default settings
• Relatively easy to make parallel
• Many implementations
 • R, Weka, RapidMiner, Mahout
References

•   A. Liaw and M. Wiener (2002). Classification and Regression by
    randomForest. R News 2(3), 18--22.

•   Breiman, Leo. Classification and Regression Trees. Belmont, Calif:
    Wadsworth International Group, 1984. Print.

•   Brieman, Leo and Adele Cutler. Random forests.
    http://www.stat.berkeley.edu/~breiman/RandomForests/cc_contact.ht
    m

Weitere ähnliche Inhalte

Andere mochten auch (8)

Chahal’s photography
Chahal’s photographyChahal’s photography
Chahal’s photography
 
05999528
0599952805999528
05999528
 
uso de internet
uso de internetuso de internet
uso de internet
 
Final media evaluation
Final media evaluation Final media evaluation
Final media evaluation
 
Preschool Garden
Preschool GardenPreschool Garden
Preschool Garden
 
11 things i wish i had learned final presentation
11 things i wish i had learned final presentation11 things i wish i had learned final presentation
11 things i wish i had learned final presentation
 
บทที่ 9 แต่งเติมเว็บเพจด้วยกราฟิก
บทที่ 9 แต่งเติมเว็บเพจด้วยกราฟิกบทที่ 9 แต่งเติมเว็บเพจด้วยกราฟิก
บทที่ 9 แต่งเติมเว็บเพจด้วยกราฟิก
 
Creating Discounts & Promotions with Hitachi Solutions Ecommerce
Creating Discounts & Promotions with Hitachi Solutions EcommerceCreating Discounts & Promotions with Hitachi Solutions Ecommerce
Creating Discounts & Promotions with Hitachi Solutions Ecommerce
 

Ähnlich wie Machine Learning Workshop

Cluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasCluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and Sas
Madhumita Ghosh
 
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
Bobby Filar
 
Nanometer Testing: Challenges and Solutions
Nanometer Testing: Challenges and SolutionsNanometer Testing: Challenges and Solutions
Nanometer Testing: Challenges and Solutions
DVClub
 

Ähnlich wie Machine Learning Workshop (20)

Predicting Customer Conversion with Random Forests
Predicting Customer Conversion with Random ForestsPredicting Customer Conversion with Random Forests
Predicting Customer Conversion with Random Forests
 
Random Forests Lightning Talk
Random Forests Lightning TalkRandom Forests Lightning Talk
Random Forests Lightning Talk
 
Cluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasCluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and Sas
 
4 1 tree world
4 1 tree world4 1 tree world
4 1 tree world
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Decision tree
Decision treeDecision tree
Decision tree
 
Machine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsMachine Learning Decision Tree Algorithms
Machine Learning Decision Tree Algorithms
 
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
 
4_1_Tree World.pdf
4_1_Tree World.pdf4_1_Tree World.pdf
4_1_Tree World.pdf
 
Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
Adam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the OddballsAdam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the Oddballs
 
Self healing data
Self healing dataSelf healing data
Self healing data
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
RNA-Seq_analysis_course(2).pptx
RNA-Seq_analysis_course(2).pptxRNA-Seq_analysis_course(2).pptx
RNA-Seq_analysis_course(2).pptx
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Nanometer Testing: Challenges and Solutions
Nanometer Testing: Challenges and SolutionsNanometer Testing: Challenges and Solutions
Nanometer Testing: Challenges and Solutions
 
Abraham q3 2008
Abraham q3 2008Abraham q3 2008
Abraham q3 2008
 
Connected Components Labeling
Connected Components LabelingConnected Components Labeling
Connected Components Labeling
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 

Kürzlich hochgeladen

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

Machine Learning Workshop

  • 1. Hands on Classification: Decision Trees and Random Forests Predictive Analytics Meetup Group Machine Learning Workshop December 2, 2012 Daniel Gerlanc, Managing Director Enplus Advisors, Inc. www.enplusadvisors.com dgerlanc@enplusadvisors.com
  • 2. © Daniel Gerlanc, 2012. All rights reserved. If you‟d like to use this material for any purpose, please contact dgerlanc@enplusadvisors.com
  • 3. What You‟ll Learn • Intuition behind decision trees and random forests • Implementation in R • Assessing the results
  • 4. Dataset • Chemical Analysis of Italian Wines • http://www.parvus.unige.it/ • 178 records, 14 attributes
  • 5. Follow along > library(mlclass) > data(wine) > str(wine) 'data.frame': 178 obs. of 14 variables: $ Type : Factor w/ 2 levels "Grig","No": 2 2 2 2 2 2 2 2 2 2 ... $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ... $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ... $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ... $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
  • 6. What are Decision Trees? • Model for partitioning an input space
  • 8. Create the 1 st split. Not G G See rf-1.R
  • 9. Create the 2 nd Split Not G G G See rf-1.R
  • 10. Create more splits… Not G G Not G G I drew this one in.
  • 11. Another view of partitioning See rf-2.R
  • 12. Use R to do the partitioning. tree.1 <- rpart(Type ~ ., data=wine) prp(tree.1, type=4, extra=2) • See the „rpart‟ and „rpart.plot‟ R packages. • Many parameters available to control the fit. See rf-2.R
  • 13. Make predictions on a test dataset predict(tree.1, data=wine, type=“vector”)
  • 14. How‟d it do? Guessing: 60.11% CART: 94.38% Accuracy • Precision: 92.95% (66 / 71) • Sensitivity/Recall: 92.95% (66 / 71) Actual Predicted Grig no Grig (1) 66 (3) 5 No (2) 5 (4) 102
  • 15. Decision Tree Problems • Overfitting the data • May not use all relevant features • Perpendicular decision boundaries
  • 16. Random Forests One Decision Tree Many Decision Trees (Ensemble)
  • 17. Random Forest Fixes • Overfitting the data • May not use all relevant features • Perpendicular decision boundaries
  • 18. Building RF For each tree: Sample from the data At each split, sample from the available variables
  • 20. Sample Attributes at each split
  • 21. Motivations for RF • Create uncorrelated trees • Variance reduction • Subspace exploration
  • 22. Random Forests rffit.1 <- randomForest(Type ~ ., data=wine) See rf-3.R
  • 23. RF Parameters in R Most important parameters are: Variable Description Default ntree Number of Trees 500 mtry Number of variables to randomly • square root of # predictors for select at each node classification • # predictors / 3 for regression nodesize Minimum number of records in a • 1 for classification terminal node • 5 for regression sampsize Number of records to select in each • 63.2% bootstrap sample
  • 24. How‟d it do? Guessing Accuracy: 60.11% Random Forest: 98.31% Accuracy • Precision: 95.77% (68 / 71) • Sensitivity/Recall: 100% (68 / 68) Actual Predicted Grig No Grig (1) 68 (3) 3 No (2) 0 (4) 107
  • 25. Tuning RF: Grid Search This is the default. See rf-4.R
  • 27. Benefits of RF • Good performance with default settings • Relatively easy to make parallel • Many implementations • R, Weka, RapidMiner, Mahout
  • 28. References • A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22. • Breiman, Leo. Classification and Regression Trees. Belmont, Calif: Wadsworth International Group, 1984. Print. • Brieman, Leo and Adele Cutler. Random forests. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_contact.ht m

Hinweis der Redaktion

  1. John, Dave, and I have spoken a bit about the motivations for using Machine Learning techniques.
  2. John, Dave, and I have spoken a bit about the motivations for using Machine Learning techniques.