1. Tech Talk: TPOT
“The Data Science Assisstant”
Francis Nguyen
Hoffman Lab
July, 2017
2. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
TPOT logo from official documentation @ http://rhiever.github.io/tpot/
Introduction: What is TPOT?
3. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
TPOT logo from official documentation @ http://rhiever.github.io/tpot/
Introduction: What is TPOT?
4. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image adapted from official TPOT documentation @ http://rhiever.github.io/tpot/
Introduction: How TPOT works
5. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image adapted from official TPOT documentation @ http://rhiever.github.io/tpot/
Introduction: How TPOT works
Automated by
scikit-learn
Manual Steps Manual Step
6. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image adapted from official TPOT documentation @ http://rhiever.github.io/tpot/
Introduction: How TPOT works
Manual Steps Manual Step
Automated by
scikit-learn
8. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Introduction: Scikit built-ins
Exhaustive Grid Search Randomized Parameter Optimization
Both methods...
● ...help find optimal hyperparameters for a given model
● ...are very easy to use
● ...are easily parallelizable
10. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Introduction: Scikit built-ins
Exhaustive Grid Search
Kernel Error
Penalty (C)
linear 1
linear 10
linear 100
rbf 1
rbf 10
Randomized Parameter Optimization
Can be very slow!
11. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Introduction: Scikit built-ins
● Used when exhaustive grid searches are too computationally intensive
● Random sampling means that adding more parameters doesn’t reduce
performance per se
Exhaustive Grid Search Randomized Parameter Optimization
Screenshots of official scikit-learn documentation @ http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
12. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Introduction: TPOT
Screenshots of official scikit-learn documentation @ http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
Exhaustive Grid Search Randomized Parameter
Optimization
Tree-based Pipeline
OpTimization
Speed Very slow
Scalable to project
constraints
Scalable to project
constraints
Breadth
Searches all possible
solutions
Randomly selects
solutions
Approaches best
solution via genetic
programming
Steps
Required
Data cleanup; model and
hyperparameter choice
Data cleanup; model and
hyperparameter choice
Data cleanup
13. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Introduction: TPOT
Screenshots of official scikit-learn documentation @ http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
14. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Figure from Olson et al., EvoApplications (2016) pp123-137
Introduction: How TPOT works
15. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Figure from Olson et al., EvoApplications (2016) pp123-137
Introduction: How TPOT works
Feature Selection
Or Construction
16. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Figure from Olson et al., EvoApplications (2016) pp123-137
Introduction: How TPOT works
Combination
17. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Figure from Olson et al., EvoApplications (2016) pp123-137
Introduction: How TPOT works
Classification
18. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Figure from Olson et al., EvoApplications (2016) pp123-137
Introduction: How TPOT works
19. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Figure from Olson et al., EvoApplications (2016) pp123-137
Introduction: How TPOT works
20. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image from official TPOT documentation @ http://rhiever.github.io/tpot/
Introduction: Genetic Programming
Step 1: Create population_size (default 100) random classification algorithms
21. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image from official TPOT documentation @ http://rhiever.github.io/tpot/
Introduction: Genetic Programming
Step 1: Create population_size (default 100) random classification algorithms
22. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Introduction: Genetic Programming
Step 2: Evaluate their performance on the metric specified by scoring (default:
“accuracy”, but can do “f1”, “recall” etc.)
23. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Step 2: Evaluate their performance on the metric specified by scoring (default:
“accuracy”, but can do “f1”, “recall” etc.)
Step 3: Create new population out of:
● 10% copies of the best performing algorithm
● 90% based on “three-way tournaments” on the rest of the data
○ Accuracy and simplicity are optimized for here
Image from official TPOT documentation @ http://rhiever.github.io/tpot/
Introduction: Genetic Programming
24. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Introduction: Genetic Programming
Step 2: Evaluate their performance on the metric specified by scoring (default:
“accuracy”, but can do “f1”, “recall” etc.)
Step 3: Create new population out of:
● 10% copies of the best performing algorithm
● 90% based on “three-way tournaments” on the rest of the data
○ Accuracy and simplicity are optimized for here
Step 4: Mutate pipelines according to mutation_rate and crossover_rate:
● Similarly to mutations in DNA, pipeline operators may be replaced, inserted, or
deleted according to the mutation_rate parameter
Image from official TPOT documentation @ http://rhiever.github.io/tpot/
25. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Introduction: Genetic Programming
Step 2: Evaluate their performance on the metric specified by scoring (default:
“accuracy”, but can do “f1”, “recall” etc.)
Step 3: Create new population out of:
● 10% copies of the best performing algorithm
● 90% based on “three-way tournaments” on the rest of the data
○ Accuracy and simplicity are optimized for here
Step 4: Mutate pipelines according to mutation_rate and crossover_rate:
● Similarly to mutations in DNA, pipeline operators may be replaced, inserted, or
deleted according to the mutation_rate parameter
● Crossover mutations, where parts of one pipeline are cut-and-pasted into another
pipeline, can be controlled via the crossover_rate parameter
26. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Introduction: Genetic Programming
Step 2: Evaluate their performance on the metric specified by scoring (default:
“accuracy”, but can do “f1”, “recall” etc.)
Step 3: Create new population out of:
● 10% copies of the best performing algorithm
● 90% based on “three-way tournaments” on the rest of the data
○ Accuracy and simplicity are optimized for here
Step 4: Mutate pipelines according to mutation_rate and crossover_rate:
● Similarly to mutations in DNA, pipeline operators may be replaced, inserted, or
deleted according to the mutation_rate parameter
● Crossover mutations, where parts of one pipeline are cut-and-pasted into another
pipeline, can be controlled via the crossover_rate parameter
Step 5: Repeat steps 2-4 n times (where n is controlled via the generations parameter)
● Subsequent generations will only be offspring_size large
● In total, TPOT evaluates population_size + generations *
offspring_size pipelines
27. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Usage: installation
Requires:
● numpy, scipy, scikit-learn (via pip or conda)
● deap, update_checker, tqdm (via pip)
● Py-xgboost (via pip) (Optional) (Warning: crashes on download.q, ill-behaved)
● Tpot (via pip)
Will install a command-line utility along with the python library
29. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image from official TPOT documentation @ http://rhiever.github.io/tpot/
Usage: Python example
30. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image from official TPOT documentation @ http://rhiever.github.io/tpot/
Usage: Python example
31. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image from official TPOT documentation @ http://rhiever.github.io/tpot/
Usage: Python example
32. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image from official TPOT documentation @ http://rhiever.github.io/tpot/
Usage: Python example
33. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image from official TPOT documentation @ http://rhiever.github.io/tpot/
Usage: Command-line
34. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image from official TPOT documentation @ http://rhiever.github.io/tpot/
Usage: Command-line
Things to note with the command-line interface:
● -is should be specified
● The input file should have column names; -target should be the
classification column name
● -njobs is meant to be used within a parallel environment:
○ When using qlogin, qsub, or qrsh, use -pe smp <n> to reserve
<n> cores on your target machine
35. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Limitations:
● Only finds solutions with scikit-learn
● Only works on supervised classification/regression problems
● Long run times - documentation recommends running it for days or longer for best results
● Strangely difficult (but possible) to install on the cluster - has many dependencies which
must be installed in order, one of which will run into memory issues on download.q
● Gives no insight on why particular model/hyperparameters were chosen
38. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image from official TPOT documentation @ http://rhiever.github.io/tpot/
Introduction: How TPOT works
39. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image from official TPOT documentation @ http://rhiever.github.io/tpot/
Introduction: Genetic Programming
Three-way tournament:
Given three random pipelines from the existing population...
40. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image from official TPOT documentation @ http://rhiever.github.io/tpot/
Introduction: Genetic Programming
Three-way tournament:
Given three random pipelines from the existing population…
...remove the worst performing one...
41. TPOT
1. Introduction
2. Usage
3. Limitations
4. Conclusions
Image from official TPOT documentation @ http://rhiever.github.io/tpot/
Introduction: Genetic Programming
Three-way tournament:
Given three random pipelines from the existing population…
...remove the worst performing one…
...then remove the most complex of the two
Hinweis der Redaktion
GECCO
GECCO
Apparently used if you have a computational load budget that you can’t exceed - also adding more parameters doesn’t reduce performance
Apparently used if you have a computational load budget that you can’t exceed - also adding more parameters doesn’t reduce performance
Apparently used if you have a computational load budget that you can’t exceed - also adding more parameters doesn’t reduce performance
EvoApplications 2016: Applications of Evolutionary Computation pp 123-137
EvoApplications 2016: Applications of Evolutionary Computation pp 123-137
EvoApplications 2016: Applications of Evolutionary Computation pp 123-137
EvoApplications 2016: Applications of Evolutionary Computation pp 123-137
EvoApplications 2016: Applications of Evolutionary Computation pp 123-137
EvoApplications 2016: Applications of Evolutionary Computation pp 123-137