Weitere ähnliche Inhalte
Ähnlich wie PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions
Ähnlich wie PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions (13)
Kürzlich hochgeladen (20)
PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions
- 1. Why Ensembles Win
Data Mining Competitions
A Predictive Analytics Center of Excellence (PACE) Tech Talk
November 14, 2012
Dean Abbott
Abbott Analytics, Inc.
Blog: http://abbottanalytics.blogspot.com
URL: http://www.abbottanalytics.com
Twitter: @deanabb
Email: dean@abbottanalytics.com
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
1
- 2. Outline
Motivation for Ensembles
How Ensembles are Built
Do Ensembles Violate Occams Razor?
Why Do Ensembles Win?
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
2
- 3. PAKDD Cup 2007 Results: Score
Metric Changes Winner
Par4cipant
AUCROC
AUCROC
Top
Decile
Top
Decile
Modeling
Par4cipant
Affilia4on
Modeling
Technique
Affilia4on
Type
-‐ (Trapezoid (Trapezoidal
Rule)
Response
Rate
Response
Implementa4on
-‐>
Loca4on
-‐>
>
al
Rule)-‐>
Rank
-‐>
-‐>
Rate
Rank
-‐>
Ensembles
TreeNet
+
Logis-c
Regression
Salford
Systems
Mainland
China
Prac--oner
70.01%
1
13.00%
7
Probit
Regression
SAS
USA
Prac--oner
69.99%
2
13.13%
6
MLP
+
n-‐Tuple
Classifier
Brazil
Prac--oner
69.62%
3
13.88%
1
TreeNet
Salford
Systems
USA
Prac--oner
69.61%
4
13.25%
4
TreeNet
Salford
Systems
Mainland
China
Prac--oner
69.42%
5
13.50%
2
Ridge
Regression
Rank
Belgium
Prac--oner
69.28%
6
12.88%
9
2-‐Layer
Linear
Regression
USA
Prac--oner
69.14%
7
12.88%
9
Logis-c
Regression
+
Decision
Stump
+
AdaBoost
+
VFI
Mainland
China
Academia
69.10%
8
13.25%
4
Logis-c
Average
of
Single
Decision
Func-ons
Australia
Prac--oner
68.85%
9
12.13%
17
Logis-c
Regression
Weka
Singapore
Academia
68.69%
10
12.38%
16
Logis-c
Regression
Mainland
China
Prac--oner
68.58%
11
12.88%
9
Decision
Tree
+
Neural
Network
+
Logis-c
Regression
Singapore
68.54%
12
13.00%
7
Scorecard
Linear
Addi-ve
Model
Xeno
USA
Prac--oner
68.28%
13
11.75%
20
Random
Forest
Weka
USA
68.04%
14
12.50%
14
Expanding
Regression
Tree
+
RankBoost
+
Bagging
Weka
Mainland
China
Academia
68.02%
15
12.50%
14
SAS
+
Salford
Logis-c
Regression
Systems
India
Prac--oner
67.58%
16
12.00%
19
J48
+
BayesNet
Weka
Mainland
China
Academia
67.56%
17
11.63%
21
Neural
Network
+
General
Addi-ve
Model
Tiberius
USA
Prac--oner
67.54%
18
11.63%
21
Decision
Tree
+
Neural
Network
Mainland
China
Academia
67.50%
19
12.88%
9
Decision
Tree
+
Neural
Network
+
Logis-c
Regression
SAS
USA
Academia
66.71%
20
13.50%
2
Neural
Network
SAS
USA
Academia
66.36%
21
12.13%
17
Decision
Tree
+
Neural
Network
+
Logis-c
Regression
SAS
USA
Academia
65.95%
22
11.63%
21
Neural
Network
SAS
USA
Academia
65.69%
23
9.25%
32
Mul--‐dimension
Balanced
Random
Forest
Mainland
China
Academia
65.42%
24
12.63%
13
Neural
Network
SAS
USA
Academia
65.28%
25
11.00%
26
CHAID
Decision
Tree
SPSS
Argen-na
Academia
64.53%
26
11.25%
24
Under-‐Sampling
Based
on
Clustering
+
CART
Decision
Tree
Taiwan
Academia
64.45%
27
11.13%
25
Decision
Tree
+
Neural
Network
+
Polynomial
Regression
SAS
USA
Academia
64.26%
28
9.38%
30
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
3
- 4. Netflix Prize
2006 Netflix State-of-the-art (Cinematch)
RMSE = 0.9525
Prize: reduce this RMSE by 10% => 0.8572
2007: Korbell team Progress Prize winner
– 107 algorithm ensemble
– Top algorithm: SVD with RMSE = 0.8914
– 2nd algorithm: Restricted Boltzmann Machine with RMSE =
0.8990
– Mini-ensemble (SVD+RBM) has RMSE = 0.88
http://techblog.netflix.com/2012/04/netflix-
recommendations-beyond-5-stars.html
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
4
- 5. Common Kinds of Ensembles
vs. Single Models
Ensembles {
Single
Classifiers
From Zhuowen Tu, “Ensemble Classification Methods: Bagging,
Boosting, and Random Forests”
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
5
- 6. What are Model Ensembles?
Combining outputs from multiple models into single
decision
Models can be created using the same algorithm, or
several different algorithms
Decision Logic
Ensemble Prediction
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
6
- 7. Creating Model Ensembles Step 1:
Generate Component Models
Can Vary Data or Single data set
Model Parameters:
Case (Record) Weights —
bootstrapping, sampling
Data Values —
add noise, recode data
Learning Parameters —
vary learning rates, pruning
severity, random seeds
Variable Subsets — Multiple models
vary candidate inputs, and predictions
features
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
7
- 8. Creating Model Ensembles Step 2:
Combining Models
Combining Methods Multiple models
– Estimation: Average Outputs and predictions
– Classification: Average
probabilities or vote
(best M of N)
Variance Reduction
– Build complex, overfit models Combine
– All models built in same manner
Bias Reduction
– Build simple models
– Subsequent models weight
records with errors more (or
model actual errors)
Decision or
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. Prediction Value
8
- 9. How Model Complexity Effects Errors
Giovanni Seni , John Elder, Ensemble Methods in Data Mining:
Improving Accuracy Through Combining Predictions, Morgan and
Claypool Publishers, 2010 (ISBN: 978-1608452842)
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
9
- 10. Commonly Used Information-
Theoretic Complexity Penalties
BIC: Baysian Information Criterion
AIC: Akaike Information Criterion
MDL: Minimum Description Length
For a nice summary:
http://en.wikipedia.org/wiki/Regularization_(mathematics)
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
10
- 11. Four Keys to Effective
Ensembling
Diversity of opinion
Independence
Decentralization
Aggregation
From The Wisdom of Crowds, James
Surowiecki
11
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
11
- 12. Bagging
Bagging Method
– Create many data sets by
bootstrapping (can also do this
with cross validation)
– Create one decision tree for
each data set
– Combine decision trees by
averaging (or voting) final
decisions
– Primarily reduces model
variance rather than bias
Results
– On average, better than any Final
Answer
individual tree
(average)
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
12
- 13. Boosting (Adaboost)
Boosting Method
– Creating tree using training data set Reweight
examples
– Score each data point, indicating when each where
incorrect decision is made (errors) classification
incorrect
– Retrain, giving rows with incorrect decisions
more weight. Repeat Combine
– Final prediction is a weighted average of all models via
weighted sum
models-> model regularization.
– Best to create weak models—simple models
(just a few splits for a decision tree) and let
the boosting iterations find the complexity.
– Often used with trees or Naïve Bayes
Results
– Usually better than individual tree or Bagging
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
13
- 14. Random Forest Ensembles
Random Forest (RF) Method
– Exact same methodology as
Bagging, but with a twist
– At each split, rather than using the
entire set of candidate inputs, use
a random subset of candidate
inputs
– Generates diversity of samples and
inputs (splits)
Results
– On average, better than any Final
individual tree, Bagging, or even Answer
Boosting (average)
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
14
- 15. Stochastic Gradient Boosting
Implemented in MART (Jerry Friedman), and
TreeNet (Salford Systems) Predict errors in
ensemble tree
Algorithm
so far
– Begin with a simple model—a constant value
for a model Combine
– Build a simple tree (perhaps 6 terminal nodes) models via
—now there are 6 possible levels, whereas weighted sum
before there was one level
– Score the model and compute errors. The score Build
is the sum of all previous trees, weighted by a
learning rate
– Build a new tree with the errors as the target
variable.
Results
– TreeNet has won 2 KDD-Cup competitions and
numerous others
– It is less prone to outliers and overfit than
Adaboost Final Answer
(additive model)
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
15
- 16. Ensembles of Trees: Smoothers
Ensembles smooth jagged decision boundaries
Pictures from
T.G. Dietterich. Ensemble methods in machine learning. In
Multiple Classier Systems, Cagliari, Italy, 2000.
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
16
- 17. Heterogeneous Model
Ensembles on Glass Data
Max Error Min Error Avera ge Error Model prediction diversity
40 % obtained by using different
algorithms: tree, NN, RBF,
35 % Gaussian, Regression, k-NN
Percent Classification Error
30 % Combining 3-5 models on
average better than best
25 %
single model
20 %
Combining all 6 models not
15 % best (best is 3&4 model
combination), but is close
10 %
The is an example of reducing
5% model variance through
0%
ensembles, but not model bias
1 2 3 4 5 6
Number Models Combin ed
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
17
- 18. Direct Marketing Example:
Considerations or I-Miner
From Abbott, D.W., "How to Improve Customer
Acquisition Models with Ensembles", presented at
Predictive Analytics World Conference, Washington,
D.C., October 20, 2009.
Steps:
1. Join by record—all models applied to same data in
same row order
2. Change probability names
3. Average probabilities
1. Decision is avg_prob > threshold
4. Decile Probability Ranks
18
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
- 19. Direct Marketing Example: Variable
Inclusion in Model Ensembles
Twenty-Five different # Models with Common
Variables
variables represented # Models # Variables
in the ten models
Only five were
represented in seven
or more models
Twelve were From Abbott, D.W., "How to Improve
represented in one or Customer Acquisition Models with
Ensembles", presented at
two models Predictive Analytics World
Conference, Washington, D.C.,
October 20, 2009.
19
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
- 20. Fraud Detection Example:
Deployment Stream
Model scoring
picks up scores
from each
model, combines
in an ensemble,
and pushes
scores back to
database
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
20
- 21. Fraud Detection Example: Overall
Model Score on Validation Data
Total Score (from validation population)
“Score”
10.0 9.5 weights
8.8
false
Normalized Score
9.0 7.5 7.0
8.0 7.2 7.2 6.8 6.9 7.2 alarms
7.0 6.1 6.3 6.8 6.3
5.3 5.7 5.3 and
6.0
5.0 sensitivi
4.0 ty
3.0
2.0 1.0
1.0 Overall,
ensemble
g
W t Te rst
Te g
er e 5 ge
5 st
e r ve e
10
se 1 1
1
2
3
4
5
6
7
8
9
is
in
st tin
A v A bl
e
s o
Av ag ra
st
ag B
m
or s
Be W
clearly
En
e
best, and
much
Model better
than best
From Abbott, D, and Tom Konchan, “Advanced Fraud Detection on
Techniques for Vendor Payments”, Predictive Analytics Summit,
testing
San Diego, CA, February 24, 2011.
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. data 21
- 22. Are Ensembles Better?
Accuracy? Yes
Interpretability? No
Do Ensembles contradict Occam’s Razor?
– Principle: simpler models generalize better; avoid
overfit!
– They are more complex than single models (RF
may have hundreds of trees in the ensemble)
– Yet these more complex models perform better on
held-out data
– But…are they really more complex?
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
22
- 23. Generalized Degrees of
Freedom
Linear Regression: a degree of freedom in the
model is simple a parameter
– Does not extrapolate to non-linear methods
– Number of “parameters” in non-linear methods can
produce more complexity or less
Enter…Generalized Degrees of Freedom (GDF)
– GDF (Ye 1998) “randomly perturbs (adds noise to)
the output variable, re-runs the modeling
procedure, and measures the changes to the
estimates” (for same number of parameters)
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
23
- 24. The Math of GDF
From Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving
Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010
(ISBN: 978-1608452842)
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
24
- 25. The Effect of GDF
From Elder, J.F.E IV, “The Generalization Paradox of Ensembles”, Journal of
Computational and Graphical Statistics, Volume 12, Number 4, Pages 853–864
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
25
- 26. Why Ensembles Win
Performance, performance, performance
Single model sometimes provide insufficient accuracy
– Neural networks become stuck in local minima
– Decision trees
Run out of data
Are greedy—can get fooled early
– Single algorithms keep pushing performance using the same
ideas (basis function / algorithm), and are incapable of
thinking outside of their box
Different algorithms or algorithms built using
resample data achieve the same level of accuracy but
on different cases—they identify different ways to get
the same level of accuracy
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
26
- 27. Conclusion
Ensembles can achieve significant model
performance improvements
The key to good ensembles is diversity in
sampling and variable selection
Can be applied to single algorithm, or across
multiple algorithms
Just do it!
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
27
- 28. References
Giovanni Seni , John Elder, Ensemble Methods in Data Mining:
Improving Accuracy Through Combining Predictions, Morgan and
Claypool Publishers, 2010 (ISBN: 978-1608452842)
Elder, J.F.E IV, “The Generalization Paradox of Ensembles”, Journal
of Computational and Graphical Statistics, Volume 12, Number 4,
Pages 853–864 DOI: 10.1198/1061860032733
Abbott, D.W., “The Benefits of Creating Ensembles of Classifiers”,
Abbott Analytics, Inc., http://www.abbottanalytics.com/white-paper-
classifiers.php
Abbott, D.W., “A Comparison of Algorithms at PAKDD2007”, Blog
post at http://abbottanalytics.blogspot.com/2007/05/comparison-of-
algorithms-at-pakdd2007.html
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
28
- 29. References
Tu, Zhuowen, “Ensemble Classification Methods: Bagging, Boosting,
and Random Forests”, http://www.loni.ucla.edu/~ztu/courses/
2010_CS_spring/cs269_2010_ensemble.pdf
Ye, J. (1998), “On Measuring and Correcting the Effects of Data
Mining and Model Selection,” Journal of the American Statistical
Association, 93, 120–131.
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
29