SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Accelerating Random Forests in Scikit-Learn 
Gilles Louppe 
Universite de Liege, Belgium 
August 29, 2014 
1 / 26
Motivation 
... and many more applications ! 
2 / 26
About 
Scikit-Learn 
 Machine learning library for Python 
 Classical and well-established 
algorithms 
 Emphasis on code quality and usability 
Myself 
 @glouppe 
 PhD student (Liege, Belgium) 
 Core developer on Scikit-Learn since 2011 
Chief tree hugger 
scikit 
3 / 26
Outline 
1 Basics 
2 Scikit-Learn implementation 
3 Python improvements 
4 / 26
Machine Learning 101 
 Data comes as... 
A set of samples L = f(xi ; yi )ji = 0; : : : ;N  1g, with 
Feature vector x 2 Rp (= input), and 
Response y 2 R (regression) or y 2 f0; 1g (classi
cation) (= 
output) 
 Goal is to... 
Find a function ^y = '(x) 
Such that error L(y; ^y) on new (unseen) x is minimal 
5 / 26
Decision Trees 
푡2 
푡1 
풙 
푋푡1 ≤ 푣푡1 
푡10 
푡3 
≤ 
푡4 푡5 푡6 푡7 
푡8 푡9 푡11 푡12 푡13 
푡14 푡15 푡16 푡17 
푋푡3 ≤ 푣푡3 
푋푡6 ≤ 푣푡6 
푋푡10 ≤ 푣푡10 
푝(푌 = 푐|푋 = 풙) 
Split node 
≤  Leaf node 
 
 
 
≤ 
≤ 
t 2 ' : nodes of the tree ' 
Xt : split variable at t 
vt 2 R : split threshold at t 
'(x) = arg maxc2Y p(Y = cjX = x) 
6 / 26
Random Forests 
풙 
휑1 휑푀 
푝휑1 (푌 = 푐|푋 = 풙) 
… 
푝휑푚 (푌 = 푐|푋 = 풙) 
Σ 
푝휓(푌 = 푐|푋 = 풙) 
Ensemble of M randomized decision trees 'm 
 (x) = arg maxc2Y 
1M 
PM 
m=1 p'm(Y = cjX = x) 
7 / 26
Learning from data 
function BuildDecisionTree(L) 
Create node t 
if the stopping criterion is met for t then 
byt = some constant value 
else 
Find the best partition L = LL [ LR 
tL = BuildDecisionTree(LL) 
tR = BuildDecisionTree(LR) 
end if 
return t 
end function 
8 / 26
Outline 
1 Basics 
2 Scikit-Learn implementation 
3 Python improvements 
9 / 26
History 
Time for building a Random Forest (relative to version 0.10) 
1 0.99 0.98 
0.33 0.11 0.04 
0.10 0.11 0.12 0.13 0.14 0.15 
0.10 : January 2012 
 First sketch at sklearn.tree and sklearn.ensemble 
 Random Forests and Extremely Randomized Trees modules 
10 / 26
History 
Time for building a Random Forest (relative to version 0.10) 
1 0.99 0.98 
0.33 0.11 0.04 
0.10 0.11 0.12 0.13 0.14 0.15 
0.11 : May 2012 
 Gradient Boosted Regression Trees module 
 Out-of-bag estimates in Random Forests 
10 / 26
History 
Time for building a Random Forest (relative to version 0.10) 
1 0.99 0.98 
0.33 0.11 0.04 
0.10 0.11 0.12 0.13 0.14 0.15 
0.12 : October 2012 
 Multi-output decision trees 
10 / 26
History 
Time for building a Random Forest (relative to version 0.10) 
1 0.99 0.98 
0.33 0.11 0.04 
0.10 0.11 0.12 0.13 0.14 0.15 
0.13 : February 2013 
 Speed improvements 
Rewriting from Python to Cython 
 Support of sample weights 
 Totally randomized trees embedding 
10 / 26
History 
Time for building a Random Forest (relative to version 0.10) 
1 0.99 0.98 
0.33 0.11 0.04 
0.10 0.11 0.12 0.13 0.14 0.15 
0.14 : August 2013 
 Complete rewrite of sklearn.tree 
Refactoring 
Cython enhancements 
 AdaBoost module 
10 / 26
History 
Time for building a Random Forest (relative to version 0.10) 
1 0.99 0.98 
0.33 0.11 0.04 
0.10 0.11 0.12 0.13 0.14 0.15 
0.15 : August 2014 
 Further speed and memory improvements 
Better algorithms 
Cython enhancements 
 Better parallelism 
 Bagging module 
10 / 26
Implementation overview 
 Modular implementation, designed with a strict separation of 
concerns 
Builders : for building and connecting nodes into a tree 
Splitters : for
nding a split 
Criteria : for evaluating the goodness of a split 
Tree : dedicated data structure 
 Ecient algorithmic formulation [See Louppe, 2014] 
Tips. An ecient algorithm is better than a bad one, even if 
the implementation of the latter is strongly optimized. 
Dedicated sorting procedure 
Ecient evaluation of consecutive splits 
 Close to the metal, carefully coded, implementation 
2300+ lines of Python, 3000+ lines of Cython, 1700+ lines of tests 
# But we kept it stupid simple for users! 
clf = RandomForestClassifier() 
clf.fit(X_train, y_train) 
y_pred = clf.predict(X_test) 
11 / 26
Development cycle 
User feedback 
Benchmarks 
Profiling 
Algorithmic 
and code 
improvements 
Implementation 
Peer review 
12 / 26
Continuous benchmarks 
 During code review, changes in the tree codebase are 
monitored with benchmarks. 
 Ensure performance and code quality. 
 Avoid code complexi
cation if it is not worth it. 
13 / 26
Outline 
1 Basics 
2 Scikit-Learn implementation 
3 Python improvements 
14 / 26
Disclaimer. Early optimization is the root of all evil. 
(This took us several years to get it right.) 
15 / 26
Pro
ling 
Use pro
ling tools for identifying bottlenecks. 
In [1]: clf = DecisionTreeClassifier() 
# Timer 
In [2]: %timeit clf.fit(X, y) 
1000 loops, best of 3: 394 mu s per loop 
# memory_profiler 
In [3]: %memit clf.fit(X, y) 
peak memory: 48.98 MiB, increment: 0.00 MiB 
# cProfile 
In [4]: %prun clf.fit(X, y) 
ncalls tottime percall cumtime percall filename:lineno(function) 
390/32 0.003 0.000 0.004 0.000 _tree.pyx:1257(introsort) 
4719 0.001 0.000 0.001 0.000 _tree.pyx:1229(swap) 
8 0.001 0.000 0.006 0.001 _tree.pyx:1041(node_split) 
405 0.000 0.000 0.000 0.000 _tree.pyx:123(impurity_improvement) 
1 0.000 0.000 0.007 0.007 tree.py:93(fit) 
2 0.000 0.000 0.000 0.000 {method 'argsort' of 'numpy.ndarray' 405 0.000 0.000 0.000 0.000 _tree.pyx:294(update) 
... 
16 / 26
Pro
ling (cont.) 
# line_profiler 
In [5]: %lprun -f DecisionTreeClassifier.fit clf.fit(X, y) 
Line % Time Line Contents 
================================= 
... 
256 4.5 self.tree_ = Tree(self.n_features_, self.n_classes_, self.n_outputs_) 
257 
258 # Use BestFirst if max_leaf_nodes given; use DepthFirst otherwise 
259 0.4 if max_leaf_nodes  0: 
260 0.5 builder = DepthFirstTreeBuilder(splitter, min_samples_split, 
261 0.6 self.min_samples_leaf, 262 else: 
263 builder = BestFirstTreeBuilder(splitter, min_samples_split, 
264 self.min_samples_leaf, max_depth, 
265 max_leaf_nodes) 
266 
267 22.4 builder.build(self.tree_, X, y, sample_weight) 
... 
17 / 26
Call graph 
python -m cProfile -o profile.prof script.py 
gprof2dot -f pstats profile.prof -o graph.dot 
18 / 26
Python is slow :-( 
 Python overhead is too large for high-performance code. 
 Whenever feasible, use high-level operations (e.g., SciPy or 
NumPy operations on arrays) to limit Python calls and rely 
on highly-optimized code. 
def dot_python(a, b): # Pure Python (2.09 ms) 
s = 0 
for i in range(a.shape[0]): 
s += a[i] * b[i] 
return s 
np.dot(a, b) # NumPy (5.97 us) 
 Otherwise (and only then !), write compiled C extensions 
(e.g., using Cython) for critical parts. 
cpdef dot_mv(double[::1] a, double[::1] b): # Cython (7.06 us) 
cdef double s = 0 
cdef int i 
for i in range(a.shape[0]): 
s += a[i] * b[i] 
return s 
19 / 26
Stay close to the metal 
 Use the right data type for the right operation. 
 Avoid repeated access (if at all) to Python objects. 
Trees are represented by single arrays. 
Tips. In Cython, check for hidden Python overhead. Limit 
yellow lines as much as possible ! 
cython -a tree.pyx 
20 / 26

Weitere ähnliche Inhalte

Was ist angesagt?

Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...
Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...
Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...Jonas Traub
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNNShuai Zhang
 
Semi supervised learning machine learning made simple
Semi supervised learning  machine learning made simpleSemi supervised learning  machine learning made simple
Semi supervised learning machine learning made simpleDevansh16
 
Deep Learning With Python | Deep Learning And Neural Networks | Deep Learning...
Deep Learning With Python | Deep Learning And Neural Networks | Deep Learning...Deep Learning With Python | Deep Learning And Neural Networks | Deep Learning...
Deep Learning With Python | Deep Learning And Neural Networks | Deep Learning...Simplilearn
 
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...Taegyun Jeon
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes ClassifierYiqun Hu
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksYoonho Lee
 
Multi Layer Perceptron & Back Propagation
Multi Layer Perceptron & Back PropagationMulti Layer Perceptron & Back Propagation
Multi Layer Perceptron & Back PropagationSung-ju Kim
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
 
Derivation of Convolutional Neural Network from Fully Connected Network Step-...
Derivation of Convolutional Neural Network from Fully Connected Network Step-...Derivation of Convolutional Neural Network from Fully Connected Network Step-...
Derivation of Convolutional Neural Network from Fully Connected Network Step-...Ahmed Gad
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.pptbutest
 
Introduction to Neural Networks with Python
Introduction to Neural Networks with PythonIntroduction to Neural Networks with Python
Introduction to Neural Networks with PythondataHacker. rs
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
 
Knapsack problem solved by Genetic Algorithms
Knapsack problem solved by Genetic AlgorithmsKnapsack problem solved by Genetic Algorithms
Knapsack problem solved by Genetic AlgorithmsStelios Krasadakis
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural NetworkKnoldus Inc.
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 

Was ist angesagt? (20)

Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
 
Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...
Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...
Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Semi supervised learning machine learning made simple
Semi supervised learning  machine learning made simpleSemi supervised learning  machine learning made simple
Semi supervised learning machine learning made simple
 
Deep Learning With Python | Deep Learning And Neural Networks | Deep Learning...
Deep Learning With Python | Deep Learning And Neural Networks | Deep Learning...Deep Learning With Python | Deep Learning And Neural Networks | Deep Learning...
Deep Learning With Python | Deep Learning And Neural Networks | Deep Learning...
 
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
 
Multi Layer Perceptron & Back Propagation
Multi Layer Perceptron & Back PropagationMulti Layer Perceptron & Back Propagation
Multi Layer Perceptron & Back Propagation
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
Post pruning
Post pruning Post pruning
Post pruning
 
Derivation of Convolutional Neural Network from Fully Connected Network Step-...
Derivation of Convolutional Neural Network from Fully Connected Network Step-...Derivation of Convolutional Neural Network from Fully Connected Network Step-...
Derivation of Convolutional Neural Network from Fully Connected Network Step-...
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
 
Introduction to Neural Networks with Python
Introduction to Neural Networks with PythonIntroduction to Neural Networks with Python
Introduction to Neural Networks with Python
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Knapsack problem solved by Genetic Algorithms
Knapsack problem solved by Genetic AlgorithmsKnapsack problem solved by Genetic Algorithms
Knapsack problem solved by Genetic Algorithms
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 

Andere mochten auch

K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnSarah Guido
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnArnaud Joly
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnKan Ouivirach, Ph.D.
 
Intro to scikit-learn
Intro to scikit-learnIntro to scikit-learn
Intro to scikit-learnAWeber
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnAWeber
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learnQingkai Kong
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learnodsc
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Gael Varoquaux
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learnJeff Klukas
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanChetan Khatri
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnAsim Jalis
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017Francesco Mosconi
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPôle Systematic Paris-Region
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learnYoss Cohen
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnMatt Hagy
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/CategorizationOswal Abhishek
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
 

Andere mochten auch (20)

K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
 
Intro to scikit-learn
Intro to scikit-learnIntro to scikit-learn
Intro to scikit-learn
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learn
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 

Ähnlich wie Accelerating Random Forests in Scikit-Learn

SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...South Tyrol Free Software Conference
 
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Piotr Przymus
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1Kenta Oono
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksKenta Oono
 
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...Masashi Shibata
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Christian Peel
 
Deep Learning, Scala, and Spark
Deep Learning, Scala, and SparkDeep Learning, Scala, and Spark
Deep Learning, Scala, and SparkOswald Campesato
 
Google Big Data Expo
Google Big Data ExpoGoogle Big Data Expo
Google Big Data ExpoBigDataExpo
 
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Yusuke Izawa
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developersAbdul Muneer
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014PyData
 
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Raffi Khatchadourian
 

Ähnlich wie Accelerating Random Forests in Scikit-Learn (20)

SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel  write Python code, get Fortran ...
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
 
python.ppt
python.pptpython.ppt
python.ppt
 
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning Frameworks
 
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
Deep Learning meetup
Deep Learning meetupDeep Learning meetup
Deep Learning meetup
 
Deep Learning, Scala, and Spark
Deep Learning, Scala, and SparkDeep Learning, Scala, and Spark
Deep Learning, Scala, and Spark
 
Google Big Data Expo
Google Big Data ExpoGoogle Big Data Expo
Google Big Data Expo
 
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developers
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Matopt
MatoptMatopt
Matopt
 
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
 
python.ppt
python.pptpython.ppt
python.ppt
 
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
 

Kürzlich hochgeladen

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 

Kürzlich hochgeladen (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

Accelerating Random Forests in Scikit-Learn

  • 1. Accelerating Random Forests in Scikit-Learn Gilles Louppe Universite de Liege, Belgium August 29, 2014 1 / 26
  • 2. Motivation ... and many more applications ! 2 / 26
  • 3. About Scikit-Learn Machine learning library for Python Classical and well-established algorithms Emphasis on code quality and usability Myself @glouppe PhD student (Liege, Belgium) Core developer on Scikit-Learn since 2011 Chief tree hugger scikit 3 / 26
  • 4. Outline 1 Basics 2 Scikit-Learn implementation 3 Python improvements 4 / 26
  • 5. Machine Learning 101 Data comes as... A set of samples L = f(xi ; yi )ji = 0; : : : ;N 1g, with Feature vector x 2 Rp (= input), and Response y 2 R (regression) or y 2 f0; 1g (classi
  • 6. cation) (= output) Goal is to... Find a function ^y = '(x) Such that error L(y; ^y) on new (unseen) x is minimal 5 / 26
  • 7. Decision Trees 푡2 푡1 풙 푋푡1 ≤ 푣푡1 푡10 푡3 ≤ 푡4 푡5 푡6 푡7 푡8 푡9 푡11 푡12 푡13 푡14 푡15 푡16 푡17 푋푡3 ≤ 푣푡3 푋푡6 ≤ 푣푡6 푋푡10 ≤ 푣푡10 푝(푌 = 푐|푋 = 풙) Split node ≤ Leaf node ≤ ≤ t 2 ' : nodes of the tree ' Xt : split variable at t vt 2 R : split threshold at t '(x) = arg maxc2Y p(Y = cjX = x) 6 / 26
  • 8. Random Forests 풙 휑1 휑푀 푝휑1 (푌 = 푐|푋 = 풙) … 푝휑푚 (푌 = 푐|푋 = 풙) Σ 푝휓(푌 = 푐|푋 = 풙) Ensemble of M randomized decision trees 'm (x) = arg maxc2Y 1M PM m=1 p'm(Y = cjX = x) 7 / 26
  • 9. Learning from data function BuildDecisionTree(L) Create node t if the stopping criterion is met for t then byt = some constant value else Find the best partition L = LL [ LR tL = BuildDecisionTree(LL) tR = BuildDecisionTree(LR) end if return t end function 8 / 26
  • 10. Outline 1 Basics 2 Scikit-Learn implementation 3 Python improvements 9 / 26
  • 11. History Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 0.10 : January 2012 First sketch at sklearn.tree and sklearn.ensemble Random Forests and Extremely Randomized Trees modules 10 / 26
  • 12. History Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 0.11 : May 2012 Gradient Boosted Regression Trees module Out-of-bag estimates in Random Forests 10 / 26
  • 13. History Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 0.12 : October 2012 Multi-output decision trees 10 / 26
  • 14. History Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 0.13 : February 2013 Speed improvements Rewriting from Python to Cython Support of sample weights Totally randomized trees embedding 10 / 26
  • 15. History Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 0.14 : August 2013 Complete rewrite of sklearn.tree Refactoring Cython enhancements AdaBoost module 10 / 26
  • 16. History Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 0.15 : August 2014 Further speed and memory improvements Better algorithms Cython enhancements Better parallelism Bagging module 10 / 26
  • 17. Implementation overview Modular implementation, designed with a strict separation of concerns Builders : for building and connecting nodes into a tree Splitters : for
  • 18. nding a split Criteria : for evaluating the goodness of a split Tree : dedicated data structure Ecient algorithmic formulation [See Louppe, 2014] Tips. An ecient algorithm is better than a bad one, even if the implementation of the latter is strongly optimized. Dedicated sorting procedure Ecient evaluation of consecutive splits Close to the metal, carefully coded, implementation 2300+ lines of Python, 3000+ lines of Cython, 1700+ lines of tests # But we kept it stupid simple for users! clf = RandomForestClassifier() clf.fit(X_train, y_train) y_pred = clf.predict(X_test) 11 / 26
  • 19. Development cycle User feedback Benchmarks Profiling Algorithmic and code improvements Implementation Peer review 12 / 26
  • 20. Continuous benchmarks During code review, changes in the tree codebase are monitored with benchmarks. Ensure performance and code quality. Avoid code complexi
  • 21. cation if it is not worth it. 13 / 26
  • 22. Outline 1 Basics 2 Scikit-Learn implementation 3 Python improvements 14 / 26
  • 23. Disclaimer. Early optimization is the root of all evil. (This took us several years to get it right.) 15 / 26
  • 24. Pro
  • 26. ling tools for identifying bottlenecks. In [1]: clf = DecisionTreeClassifier() # Timer In [2]: %timeit clf.fit(X, y) 1000 loops, best of 3: 394 mu s per loop # memory_profiler In [3]: %memit clf.fit(X, y) peak memory: 48.98 MiB, increment: 0.00 MiB # cProfile In [4]: %prun clf.fit(X, y) ncalls tottime percall cumtime percall filename:lineno(function) 390/32 0.003 0.000 0.004 0.000 _tree.pyx:1257(introsort) 4719 0.001 0.000 0.001 0.000 _tree.pyx:1229(swap) 8 0.001 0.000 0.006 0.001 _tree.pyx:1041(node_split) 405 0.000 0.000 0.000 0.000 _tree.pyx:123(impurity_improvement) 1 0.000 0.000 0.007 0.007 tree.py:93(fit) 2 0.000 0.000 0.000 0.000 {method 'argsort' of 'numpy.ndarray' 405 0.000 0.000 0.000 0.000 _tree.pyx:294(update) ... 16 / 26
  • 27. Pro
  • 28. ling (cont.) # line_profiler In [5]: %lprun -f DecisionTreeClassifier.fit clf.fit(X, y) Line % Time Line Contents ================================= ... 256 4.5 self.tree_ = Tree(self.n_features_, self.n_classes_, self.n_outputs_) 257 258 # Use BestFirst if max_leaf_nodes given; use DepthFirst otherwise 259 0.4 if max_leaf_nodes 0: 260 0.5 builder = DepthFirstTreeBuilder(splitter, min_samples_split, 261 0.6 self.min_samples_leaf, 262 else: 263 builder = BestFirstTreeBuilder(splitter, min_samples_split, 264 self.min_samples_leaf, max_depth, 265 max_leaf_nodes) 266 267 22.4 builder.build(self.tree_, X, y, sample_weight) ... 17 / 26
  • 29. Call graph python -m cProfile -o profile.prof script.py gprof2dot -f pstats profile.prof -o graph.dot 18 / 26
  • 30. Python is slow :-( Python overhead is too large for high-performance code. Whenever feasible, use high-level operations (e.g., SciPy or NumPy operations on arrays) to limit Python calls and rely on highly-optimized code. def dot_python(a, b): # Pure Python (2.09 ms) s = 0 for i in range(a.shape[0]): s += a[i] * b[i] return s np.dot(a, b) # NumPy (5.97 us) Otherwise (and only then !), write compiled C extensions (e.g., using Cython) for critical parts. cpdef dot_mv(double[::1] a, double[::1] b): # Cython (7.06 us) cdef double s = 0 cdef int i for i in range(a.shape[0]): s += a[i] * b[i] return s 19 / 26
  • 31. Stay close to the metal Use the right data type for the right operation. Avoid repeated access (if at all) to Python objects. Trees are represented by single arrays. Tips. In Cython, check for hidden Python overhead. Limit yellow lines as much as possible ! cython -a tree.pyx 20 / 26
  • 32. Stay close to the metal (cont.) Take care of data locality and contiguity. Make data contiguous to leverage CPU prefetching and cache mechanisms. Access data in the same way it is stored in memory. Tips. If accessing values row-wise (resp. column-wise), make sure the array is C-ordered (resp. Fortran-ordered). cdef int[::1, :] X = np.asfortranarray(X, dtype=np.int) cdef int i, j = 42 cdef s = 0 for i in range(...): s += X[i, j] # Fast s += X[j, i] # Slow If not feasible, use pre-buering. 21 / 26
  • 33. Stay close to the metal (cont.) Arrays accessed with bare pointers remain the fastest solution we have found (sadly). NumPy arrays or MemoryViews are slightly slower Require some pointer kung-fu # 7.06 us # 6.35 us 22 / 26
  • 34. Ecient parallelism in Python is possible ! 23 / 26
  • 35. Joblib Scikit-Learn implementation of Random Forests relies on joblib for building trees in parallel. Multi-processing backend Multi-threading backend Require C extensions to be GIL-free Tips. Use nogil declarations whenever possible. Avoid memory dupplication trees = Parallel(n_jobs=self.n_jobs)( delayed(_parallel_build_trees)( tree, X, y, ...) for i, tree in enumerate(trees)) 24 / 26
  • 36. A winning strategy Scikit-Learn implementation proves to be one of the fastest among all libraries and programming languages. 14000 12000 10000 8000 6000 4000 2000 0 Fit time (s ) 203.01 211.53 4464.65 3342.83 randomForest 1518.14 1711.94 R, Fortran 1027.91 13427.06 10941.72 Scikit-Learn-RF Scikit-Learn-ETs OpenCV-RF OpenCV-ETs OK3-RF OK3-ETs Weka-RF R-RF Orange-RF Scikit-Learn Python, Cython OpenCV C++ OK3 C Weka Java Orange Python 25 / 26
  • 37. Summary The open source development cycle really empowered the Scikit-Learn implementation of Random Forests. Combine algorithmic improvements with code optimization. Make use of pro
  • 38. ling tools to identify bottlenecks. Optimize only critical code ! 26 / 26