Random Decision Forests at Scale

1© Cloudera, Inc. All rights reserved.
Random Decision Forests
- at Scale
Todd M. Boetticher| Solution Consultant

Overview
• Decision trees
• Introduction to decision tree’s
• Building a decision tree with Spark
• Tuning your model with Hyerparameters
• Random Decision Forests
• Introduction to Random Decision
Forests
• Handicapping individual learners
• Deploying random decision trees in
Spark

Decision Tree
The Decision tree is one of the most commonly
used classification techniques. According
to a recent survey, the decision tree is the
most common technique used today.
A decision tree is a collection of outcomes that
eventually lead to a decision. Like the game 21
questions, decision tree’s use features to split
the data into subsets that will give you the best
results.

Benefits and disadvantages of Decision Trees
Pros
• Computationally cheap to use
• Easy for humans to understand learned results
• Missing values OK
• Capable of dealing with irrelevant feature
Cons
• Prone to overfitting
• Lacks the performance available through other methods
• Small changes in the Data can have enormous impacts on the data

Iris Dataset
6.4, 3.2, 4.5, 1.5, Iris-versacolor

Possible Decision Trees

Interpreting Models
Simple graphic
explanation of how the
feature space can be
divided into decision
boundaries.

Decision Breakdown
def visualize_tree(tree, feature_names):
tree -- scikit-learn DecsisionTree.
feature_names -- list of feature names.
"""
with open("dt.dot", 'w') as f:
export_graphviz(tree, out_file=f,
feature_names=feature_names)
command = ["dot", "-Tpng", "dt.dot", "-o",
"dt.png"]
try:
subprocess.check_call(command)
except:
exit("Could not run dot, ie graphviz, to "
"produce visualization")

Building a Decision Tree

Name and Address Matching
• Potential uses in every industry
• Marketing
• Defense and law enforcement
• Bank Secrecy Act and Patriot Act Compliance
RN-KOMSOMOLSKY LLC,
352 DeerPath Ave SW,
Leesburg,
Virginia,
United States
KOMSOMOLSKY REFINERY,
1621 Parkcrest Cir,
Reston,
Virginia,
United States
VS

0.78, 0.0, 0.0, 0.78, 0.78, True Positive
Name and Address Matching
Name Address City State Country
0.81 0.0 0.0 0.78 1.0
0.6 0.6 0.2 0.0 0.0
0.2 0.0 0.0 0.0 0.0
0.91 0.91 0.91 0.91 1.0
0.78 0.0 0.0 0.78 0.78
0.4 0.4 0.36 0.0 1.0
1.0 0.0 0.0 0.0 0.0
Hits
True-Positive
False-Positive
False-Positive
True-Positive
True-Positive
False-Positive
False-Positive

Building our first decision tree in MLlib

Evaluating a decision Tree
.82243123866534172
~82.24% accuracy

Benchmark vs Random
.2718652198532764
~27.19% accuracy

Hyperparameters
Impurity: Gini, Impurity,
Variance
Maximum Depth Maximum Bin
Measures the expected
value of the information.
The calculation of impurity is
generally computed with
Gini impurity or entropy
trainClassifier(trainingData, numClasses,
categoricalFeaturesInfo, numTrees, featureSubsetStrategy,
impurity, maxDepth, maxBins)

Impurity: Gini, Impurity, Variance
1 − 𝑃𝑖2H= - 𝑖=1
𝑛
𝑃(𝑥𝑖)𝑙𝑜𝑔2 𝑃(𝑥𝑖)
Gini Impurity Entropy
Entropy is defined as the expected
value of information. First, we need
to define information. If you’re
classifying something that can take
on multiple values, the information
for symbol 𝑥 𝑖 is defined as
Gini impurity is a measure of how
often a randomly chosen element
from the set would be incorrectly
labeled if it were randomly
labeled according to the
distribution of labels in the subset
1
𝑛 𝑖=1
𝑛
(𝑦𝑖−𝜇)2
Variance
vi is label for an instance, N is the
number of instances and μ is the
mean given by 1N∑Ni=1xi1N∑i=1Nxi.

Maximum Depth
• Maximum tree depth is a limit to
stop further splitting of nodes when
the specified tree depth has been
reached during the building of the
initial decision tree.
• The absolute maximum depth
would be N−1, where N is the
number of training samples. You
can derive this by considering that
the least effective split would be
peeling off one training example
per node.
1.
2.
3.

Decision Trees to Random
Decision Forests

Wisdom of the crowds
• The wisdom of the crowd is the collective opinion of a group of individuals rather
than that of a single expert.
• A large group's aggregated answers to questions involving quantity estimation,
general world knowledge, and spatial reasoning has generally been found to be
as good as, and often better than, the answer given by any of the individuals
within the group. An explanation for this phenomenon is that there is
idiosyncratic noise associated with each individual judgment, and taking the
average over a large number of responses will go some way toward canceling the
effect of this noise.

What is Random Forest
Random Forests grows many
classification trees. To classify a new
object from an input vector, put the
input vector down each of the trees in
the forest. Each tree gives a classification,
and we say the tree "votes" for that class.
The forest chooses the classification
having the most votes (over all the trees
in the forest).

How to Create a Crowd?

Random Decision Forests with Spark MLlib
RandomForest.trainClassifier(trainingData, numClasses,
categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity,
maxDepth, maxBins)
Number of Trees Feature Subset Strategy
Number of trees in the forest. Increasing the number
of trees will decrease the variance in predictions,
improving the model’s test-time accuracy.
Training time increases roughly linearly in the number
of trees.
Number of features to use as candidates for splitting at
each tree node. The number is specified as a fraction or
function of the total number of features. Decreasing this
number will speed up training, but can sometimes
impact performance if too low.
~ 98.67 % Accuracy

Trees See Subsets of Examples

Or Subsets of Features

Thank you
tboetticher@cloudera.com

Random Decision Forests at Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Random Decision Forests at Scale

Similar to Random Decision Forests at Scale (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Random Decision Forests at Scale

Editor's Notes