This document discusses random decision forests and how they work at scale. It provides an overview of decision trees and random decision forests. Individual decision trees are prone to overfitting, but a random decision forest addresses this by growing many decision trees on randomly selected subsets of the data and features. This results in greater accuracy compared to a single decision tree by averaging the predictions of the ensemble. The document demonstrates how to build random decision forests using Spark MLlib and discusses hyperparameters like the number of trees and feature subset strategy that can be tuned.
Entropy:
Information Theory Concept – Claude Shannon
Measures mixed-ness, unpredictability of population
Lower is better
Gini:
1 minus probability that random guess I (probability pi) is correct
Lower is better
Table of Contents
Introduction 1
Why should I know this? 2
Concept – Example 2
Describe the Iris Data Set 2
Explain what machine learning has to do with the iris data set 2
Physical representation of the decision tree concept 2
Interpreting the Model 2
Building Decision Trees 2
Introduce data set 2
Implement a Decision tree with Data Set 2
Explain how it worked 2
Split the data 2
Calculate Entropy 2
Cache the training data 2
Show current accuracy 2
Benchmark against Random 2
Confusion Matrix 2
Tuning the model (Hyper parameter) 2
Impurity: 2
Maximum Depth: 3
Maximim Bins 3
Try multiple parameters 3
Can the data be improved? 3
Random Decision Forests 3
Wisdom of the crowd 3
Introduce concept of Decision Forests 3
How to create Diversity of opinion 3
Handicap each tree 3
Subset of features 3
Aggregating Results 3
How to implement that in Spark 3
Introduction -
Good Evening,
Today I would like to talk about Random Decision forests and how we can implement them at scale using the Hadoop Ecosystem. I believe that this is important discussion due to the method’s popularity, and its potential impact for most organizations.
Let’s Begin
Concept – Example
Before we can begin talking about Random Decision forests, I believe that it is important that we begin by discussing decision tree’s. If you have ever had the opportunity to play “21 questions” it is a game where an individual picks an object, or place and the competing individual slowly asks binary questions, attempting to shrink the possibilities for the object. This often begins by the person asking if it’s a person or a thing? Generally, these questions are mutually exclusive and collectively exhausted. The inquisitive individual would then ask a follow on question in order to lower the number of possibilities again and again until they know the answer or have run out of available questions.
This is a perfect example of a decision tree. The decision tree is one of the most commonly used classification techniques and according to a recent survey, the technique is among most commonly implemented technique today.
The utilization of this technique ranges from agriculture to physics. In physics decision trees have been used for the detection of physical particles, or classification of particle signatures. In other sectors, Decision trees have even been used to build personal learning assistant and classification of sleep patterns.
Due to decisions tree’s ability to be utilized in both classification and regression, the possibilities are almost endless.
Pro’s vs Cons
Pro’s - The decision tree is used in so many problems due to its many strengths and short list of disadvantages. One of the key benefits of decision trees is that they implicitly perform variable screening or feature selection naturally. Additionally, the technique is naturally computationally inexpensive and it requires very little effort from the user during data preparation. Lastly, the best feature of using trees for analytics - easy to interpret and explain to executives! This is increasingly important in fields where individuals have to explain why there black box makes specific decisions at certain periods of time.
Con’s – Due to the overall flexibility of this technique, models using a decision tree can be prone to overfitting. Overfitting is where the model describes errors in the data and improperly fits a relationship that generally cannot be connected. In addition to their ability to be easily over fitted, the technique also lacks the performance available through other available techniques. Lastly Decision forests can be significantly impacted by small changes in Data over time.
Describe the Iris Data Set
We don’t need to understand the relationship in order to model the relationships. We can learn them imperially from the dataset
Explain what machine learning has to do with the iris data set
Attempt to plot the iris Data-set
Graphical example of Knn Iris
Physical representation of the decision tree concept
Interpreting the Model
Introduction to Decision Trees
Give Example
Building Decision Trees
Introduce data set
Implement a Decision tree with Data Set
Explain how it worked
Split the data
Cache the training data
Show current accuracy
Benchmark against Random
Confusion Matrix
Tuning the model (Hyper parameter)
Impurity:
Measure how much decision decreases impurity using gini impurity (vs. entropy)
Maximum Depth:
Calculates the total depth of the tree and set the number of decisions allowed before terminating a prediction.
Maximim Bins
Calculates the total number of rules allowed
Fix the training data
Try multiple parameters
Can the data be improved?
Random Decision Forests
Wisdom of the crowd
Introduce concept of Decision Forests
Ensemble Learning
How to create Diversity of opinion
Handicap each tree
Subset of features
Aggregating Results
Hyperparameters in a forest
How to implement that in Spark
Determining results from Random Forest