SlideShare ist ein Scribd-Unternehmen logo
1 von 61
Downloaden Sie, um offline zu lesen
An Overview of Predictive Modeling
Max Kuhn, Ph.D
Pfizer Global R&D
Groton, CT
max.kuhn@pfizer.com
Outline
“Predictive modeling” definition
Some example applications
A short overview and example
How is this different from what statisticians already do?
What can drive choice of methodology?
Where should we focus our efforts?
Max Kuhn Predictive Modeling 2/61
2/61
“Predictive Modeling”
Define That!
Rather than saying that method X is a predictive model, I would
say:
Predictive Modeling
is the process of creating a model whose primary goal is to
achieve high levels of accuracy.
In other words, a situation where we are concerned with making
the best possible prediction on an individual data instance.
(aka pattern recognition)(aka machine learning)
Max Kuhn Predictive Modeling 4/61
4/61
Models
So, in theory, a linear or logistic regression model is a predictive
model?
Yes.
As will be emphasized during this talk:
the quality of the prediction is the focus
model interpretability and inferential capacity are not as
important
Max Kuhn Predictive Modeling 5/61
5/61
Where Does This Fit Within Data Science
From Drew Conway (http://drewconway.com/zia/2013/3/26/the-
data-science-venn-diagram)
Max Kuhn Predictive Modeling 6/61
6/61
Example Applications
Examples
spam detection: we want the most accurate prediction that
minimizes false positives and eliminates spam
plane delays, travel time, etc.
customer volume
sentiment analysis of text
sale price of a property
For example, does anyone care why an email or SMS is labeled as
spam?
Max Kuhn Predictive Modeling 8/61
8/61
Quantitative Structure Activity Relationships
(QSAR)
Pharmaceutical companies screen millions of molecules to see if
they have good properties, such as:
biologically potent
safe
soluble, permeable, drug–like, etc
We synthesize many molecules and run lab tests (“assays”) to
estimate the characteristics listed above.
When a medicinal chemist designs a new molecule, he/she would
like a prediction to help assess whether we should synthesized it.
Max Kuhn Predictive Modeling 9/61
9/61
Medical Diagnosis
A patient might be concerned about having cancer on the basis
of some on physiological condition their doctor has observed.
Any result based on imaging or a lab test should, above all, be as
accurate as possible (more on this later).
If the test does indicate cancer, very few people would want to
know if it was due to high levels of:
human chorionic gonadotropin (HCG)
alpha-fetoprotein (AFP) or
lactate dehydrogenase (LDH)
Accuracy is the concern
Max Kuhn Predictive Modeling 10/61
10/61
Managing Customer Churn/Defection
When a customer has the potential to switch vendors (e.g. phone
contract ends), we might want to estimate
the probability that they will churn
the expected monetary loss from churn
Based on these quantities, you may want to offer a customer an
incentive to stay.
Poor predictions will result in loss of revenue from a customer
loss or from a inappropriately applied incentive.
The goal is to minimize financial loss at the individual customer
level
Max Kuhn Predictive Modeling 11/61
11/61
An Overview of the Modeling Process
Cell Segmentation
Max Kuhn Predictive Modeling 13/61
13/61
Cell Segmentation
Individual cell results are aggregated so that decisions can be
made about specific compounds
Improperly segmented objects might compromise the quality of
the data so an algorithmic filter is needed.
In this application, we have measurements on the size, intensity,
shape or several parts of the cell (e.g. nucleus, cell body,
cytoskeleton).
Can these measurements be used to predict poorly segmentation
using a set of manually labeled cells?
Max Kuhn Predictive Modeling 14/61
14/61
The Data
The authors scored 2019 cells into these two bins:
well–segmented (WS) or poorly–segmented (PS).
There are 58 measurements in each cell that can be used as
predictors.
The data are in the caret package.
Max Kuhn Predictive Modeling 15/61
15/61
Model Building Steps
Common steps during model building are:
estimating model parameters (i.e. training models)
determining the values of tuning parameters that cannot be
directly calculated from the data
calculating the performance of the final model that will
generalize to new data
Max Kuhn Predictive Modeling 16/61
16/61
Model Building Steps
How do we “spend” the data to find an optimal model? We
typically split data into training and test data sets:
Training Set: these data are used to estimate model
parameters and to pick the values of the complexity
parameter(s) for the model.
Test Set (aka validation set): these data can be used to get
an independent assessment of model efficacy. They should not
be used during model training.
Max Kuhn Predictive Modeling 17/61
17/61
Spending Our Data
The more data we spend, the better estimates we’ll get (provided
the data is accurate). Given a fixed amount of data,
too much spent in training won’t allow us to get a good
assessment of predictive performance. We may find a model
that fits the training data very well, but is not generalizable
(over–fitting)
too much spent in testing won’t allow us to get good
estimates of model parameters
Max Kuhn Predictive Modeling 18/61
18/61
Spending Our Data
Statistically, the best course of action would be to use all the
data for model building and use statistical methods to get good
estimates of error.
From a non–statistical perspective, many consumers of of these
models emphasize the need for an untouched set of samples the
evaluate performance.
The authors designated a training set (n = 1009) and a test set
(n = 1010).
Max Kuhn Predictive Modeling 19/61
19/61
Over–Fitting
Over–fitting occurs when a model inappropriately picks up on
trends in the training set that do not generalize to new samples.
When this occurs, assessments of the model based on the
training set can show good performance that does not reproduce
in future samples.
Some models have specific “knobs” to control over-fitting
neighborhood size in nearest neighbor models is an example
the number if splits in a tree model
Max Kuhn Predictive Modeling 20/61
20/61
Over–Fitting
Often, poor choices for these parameters can result in over-fitting
For example, the next slide shows a data set with two predictors.
We want to be able to produce a line (i.e. decision boundary)
that differentiates two classes of data.
Two new points are to be predicted. A 5–nearest neighbor model
is illustrated.
Max Kuhn Predictive Modeling 21/61
21/61
K–Nearest Neighbors Classification
Predictor A
PredictorB
0.0
0.2
0.4
0.6
0.0 0.2 0.4 0.6
q
Class 1 Class 2
Max Kuhn Predictive Modeling 22/61
22/61
Over–Fitting
On the next slide, two classification boundaries are shown for the
a different model type not yet discussed.
The difference in the two panels is solely due to different choices
in tuning parameters.
One over–fits the training data.
Max Kuhn Predictive Modeling 23/61
23/61
Two Model Fits
Predictor A
PredictorB
0.2
0.4
0.6
0.8
0.2 0.4 0.6 0.8
Model #1
0.2 0.4 0.6 0.8
Model #2
Max Kuhn Predictive Modeling 24/61
24/61
Characterizing Over–Fitting Using the Training Set
One obvious way to detect over–fitting is to use a test set.
However, repeated “looks” at the test set can also lead to
over–fitting
Resampling the training samples allows us to know when we are
making poor choices for the values of these parameters (the test
set is not used).
Examples are cross–validation (in many varieties) and the
bootstrap.
These procedures repeated split the training data into subsets
used for modeling and performance evaluation.
Max Kuhn Predictive Modeling 25/61
25/61
The Big Picture
We think that resampling will give us honest estimates of future
performance, but there is still the issue of which sub–model to
select (e.g. 5 or 10 NN).
One algorithm to select sub–models:
Define sets of model parameter values to evaluate;
for each parameter set do
for each resampling iteration do
Hold–out specific samples ;
Fit the model on the remainder;
Predict the hold–out samples;
end
Calculate the average performance across hold–out predictions
end
Determine the optimal parameter value;
Create final model with entire training set and optimal parameter
value;Max Kuhn Predictive Modeling 26/61
26/61
K–Nearest Neighbors Tuning
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.80
0.81
10 20 30 40 50
#Neighbors
Accuracy(RepeatedCross−Validation)
Max Kuhn Predictive Modeling 27/61
27/61
Next Steps
Normally, we would try different approaches to improving
performance:
different models,
other pre–processing techniques,
model ensembles, and/or
feature selection (i.e. variable selection)
For example, the degree of correlation between the predictors in
these data is fairly high. This may have had a negative impact on
the model.
We can fit another model that uses a sufficient number of
principal components in the K–nearest neighbor model (instead
of the original predictors).
Max Kuhn Predictive Modeling 28/61
28/61
Pre-Processing Comparison
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
0.80
0.81
0.82
10 20 30 40 50
#Neighbors
Accuracy(Cross−Validation)
pre_processing q qBasic PCA
Max Kuhn Predictive Modeling 29/61
29/61
Evaluating the Final Model Candidate
Suppose we evaluated a panel of models and the cross–validation
results indicated that the K–nearest neighbor model using PCA
had the best cross–validation results.
We would use the test set to verify that there were no
methodological errors.
The test set accuracy was 80.7%.
The CV accuracy was 82%.
Pretty Close!
Max Kuhn Predictive Modeling 30/61
30/61
Comparisons to Traditional Statistical
Analyses
Good References On This Topic
Breiman, L. (1997). No Bayesians in foxholes. IEEE Expert,
12(6), 21–24.
Breiman, L. (2001). Statistical modeling: The two cultures.
Statistical Science, 16(3), 199–215. (plus discussants)
Ayres, I. (2008). Super Crunchers: Why Thinking-By-Numbers is
the New Way To Be Smart, Bantam.
Shmueli, G. (2010). To explain or to predict? Statistical Science,
25(3), 289–310.
Boulesteix, A.-L., & Schmid, M. (2014). Machine learning versus
statistical modeling. Biometrical Journal, 56(4), 588–593. (and
other articles in this issue)
Max Kuhn Predictive Modeling 32/61
32/61
Empirical Validation
Previously there was a emphasis on empirical assessments of
accuracy.
While accuracy isn’t the best measure, metrics related to errors
of some sort are way more important here than in traditional
statistics.
Statistical criteria (e.g. lack of fit tests, p–values, likelihoods,
etc.) are not directly related to performance.
Friedman (2001) describes an example related to boosted trees
with MLE:
“[...] degrading the likelihood by overfitting actually
improves misclassification error rates. Although perhaps
counterintuitive, this is not a contradiction; likelihood and
error rate measure different aspects of fit quality.”
Max Kuhn Predictive Modeling 33/61
33/61
Empirical Validation
Even some measure that are asymptotically related to error rates
(e.g. AIC, BIC) are still insufficient.
Also, a number of statistical criteria (e.g. adjusted R2
) require
degrees of freedom. In many predictive models, these do not
exist or are much larger than the training set size.
Finally, there is often an interest in measuring nonstandard loss
functions and optimizing models on this basis.
For the customer churn example, the loss of revenue related to
false negatives and false positives may be different. The loss
function may not fit nicely into standard decision theory so
evidence–based assessments are important.
Max Kuhn Predictive Modeling 34/61
34/61
Statistical Formalism
Traditional statistics are often used for making inferential
statements regarding parameters or contrasts of parameters.
For this reason, there is a fixation on the appropriateness of the
models related to
necessary distributional assumptions required for theory
relatively simple model structure to keep the math trackable
the sanctity of degrees of freedom.
This is critical to make appropriate inferences on model
parameters.
However, letting these go allows the user greater flexibility to
increase accuracy.
Max Kuhn Predictive Modeling 35/61
35/61
Examples
complex or nonlinear pre–processing (e.g. PCA, spatial sign)
ensembles of models (boosting, bagging, random forests)
overparameterized but highly regularized nonlinear models
(SVN, NNets)
Quinlan (1993) describing pessimistic pruning, notes that it:
“does violence to statistical notions of sampling and
confidence limits, so the reasoning should be taken with a
grain of salt.”
Breiman (1997) notes that there were “No Bayesians in
foxholes.”
Max Kuhn Predictive Modeling 36/61
36/61
Three Sweeping Generalizations About Models
1 In the absence of any knowledge about the prediction
problem, no model can be said to be uniformly better than any
other (No Free Lunch theorem)
2 There is an inverse relationship between model accuracy and
model interpretability
3 Statistical validity of a model is not always connected to
model accuracy
Max Kuhn Predictive Modeling 37/61
37/61
Segmentation CV Results for Different Models
Confidence Level: 0.95
Accuracy
CART
LDA
glmnet
PLS
bagging
FDA
KNN_PCA
SVM
C5.0
GBM
RF
0.79 0.80 0.81 0.82 0.83 0.84
Max Kuhn Predictive Modeling 38/61
38/61
How Does Statistics Contribute to Predictive
Modeling?
Many statistical principals still apply and have greatly improved
the field:
resampling
the variance–bias tradeoff
sampling bias in big data
parsimony principal (AOTBE, keep the simpler model)
L1 and/or L2 regularization
Statistical improvements to boosting are a great example of this.
Max Kuhn Predictive Modeling 39/61
39/61
What Can Drive Choice of
Methodology?
Cross–Validation Cost Estimates
(HPC Case Study, APM, pg 454)
Cost
C5.0
Random Forests
Bagging
SVM (Weights)
C5.0 (Costs)
CART (Costs)
Bagging (Costs)
SVM
Neural Networks
CART
LDA (Sparse)
LDA
FDA
PLS
0.2 0.3 0.4 0.5 0.6 0.7 0.8
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
qq
q
Max Kuhn Predictive Modeling 41/61
41/61
Important Considerations
It is fairly common to have a number of different models with
equivalent levels of performance.
If a predictive model will be deployed, there are a few practical
considerations that might drive the choice of model.
if the prediction equation is to be numerically optimized,
models with smooth functions might be preferable.
the characteristics of the data (e.g. multicollinearity,
imbalanced, etc) will often constrain the feasible set of models.
large amounts of unlabeled data can be exploited by some
models
Max Kuhn Predictive Modeling 42/61
42/61
Ease of Use/Model Development
If a large number of models will be created, using a low
maintenance technique is probably a good idea.
QSAR is a good example. Most companies maintain a large
number of models for different endpoints. These models are
constantly updated too.
Trees (and their ensembles) tend to be low maintenance; neural
networks are not.
However, if the model must have the absolute best possible error
rate, you might be willing to put a lot of work into the model
(e.g. deep neural networks)
Max Kuhn Predictive Modeling 43/61
43/61
How will the model be deployed?
If the prediction equation needs to be encoded into software or a
database, concise equations have the advantage.
For example, random forests require a large number of unpruned
trees. Even for a reasonable data set, the total forest can have
thousands of terminal nodes (39,437 if/then statements for the
cell model)
Bagged trees require dozens of trees and might yield the same
level of performance but with a smaller footprint
Some models (K-NN, support vector machines) can require
storage of the original training set, which may be infeasible.
Neural networks, MARS and a few other models can yield
compact nonlinear prediction equations.
Max Kuhn Predictive Modeling 44/61
44/61
Ethical Considerations
There are some situations where using a sub–optimal model can
be unethical.
My experiences in molecular diagnostics gave me some
perspective on the friction point between pure performance and
notions about validity.
Another generalization: doctors and regulators are apprehensive
about black–box models since they cannot make sense of it and
assume that the validation of these models is lacking.
I had an experience as a “end user” of a model (via a lab test)
when I was developing algorithms for a molecular diagnostic
company.
Max Kuhn Predictive Modeling 45/61
45/61
Antimicrobial Susceptibility Testing
Max Kuhn Predictive Modeling 46/61
46/61
Klebsiella pneumoniae
klebsiella-pneumoniae.org/
Max Kuhn Predictive Modeling 47/61
47/61
Where should we focus our efforts?
Diminishing Returns on Models
Honestly, I think that we have plenty of effective models.
Very few problems are solved by inventing yet another predictive
model.
Exceptions:
using unlabeled data
severe class imbalances
feature engineering algorithms (e.g. MARS, autoencoders, etc)
applicability domain techniques and prediction confidence
assessments
It is more important to improve the features that go into the
model.
Max Kuhn Predictive Modeling 50/61
50/61
What About Big Data?
The advantages and issues related to Big Data can be broken
down into two areas:
Big N: more samples or cases
Big P: more variables or attributes or fields
Mostly, I think the catchphrase is associated more with N than
P.
Does Big Data solve my problems?
Maybe1
1
the basic answer given by every statistician throughout time
Max Kuhn Predictive Modeling 51/61
51/61
Can You Be More Specific?
It depends on
what are you using it for?
does it solve some unmet need?
does it get in the way?
Basically, it comes down to:
Bigger Data = Better Data
at least not necessarily.
Max Kuhn Predictive Modeling 52/61
52/61
Big N - Interpolation
One situation where it probably doesn’t help is when samples are
added within the mainstream of the data
In effect, we are just filling in the predictor space by increasing
the granularity.
After the first 10,000 or so observations, the model will not
change very much.
This does pose an interesting interaction within the
variance–bias trade off.
Big N goes a long way to reducing the model variance. Given
this, can high variance/low bias models be improved?
Max Kuhn Predictive Modeling 53/61
53/61
High Variance Model with Big N
Maybe.
Many high(ish) variance/low bias models (e.g. trees, neural
networks, etc.) tend to be very complex and computationally
demanding.
Adding more data allows these models to more accurately reflect
the complexity of the data but would require specialized solutions
to be feasible.
At this point, the best approach is supplanted by the available
approaches (not good).
Form should still follow function.
Max Kuhn Predictive Modeling 54/61
54/61
Model Variance and Bias
Predictor
Outcome
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2 4 6 8 10
Max Kuhn Predictive Modeling 55/61
55/61
Low Variance Model with Big N
What about low variance/high bias models (e.g. logistic and
linear regression)?
There is some room here for improvement since the abundance of
data allows more opportunities for exploratory data analysis to
tease apart the functional forms to lower the bias (i.e. improved
feature engineering) or to select features.
For example, non–linear terms for logistic regression models can
be parametrically formalized based on the results of spline or
loess smoothers.
Max Kuhn Predictive Modeling 56/61
56/61
Diminishing Returns on N∗
At some point, adding more (∗
of the same) data does not do
any good.
Performance stabilizes but computational complexity increases.
The modeler becomes hand–cuffed to whatever technology is
available to handle large amounts of data.
Determining what data to use is more important than worrying
about the technology required to fit the model.
Max Kuhn Predictive Modeling 57/61
57/61
Global Versus Local Models in QSAR
When developing a compound, once a “hit” is obtained, the
chemists begin to tweak the structure to make improvements.
The leads to a chemical series
We could build QSAR models on the large number of existing
compounds (a global model) or on the series of interest (a local
model).
Our experience is that local models beat global models the
majority of the time.
Here, fewer (of the most relevant) compounds are better.
Our first inclination is to use all the data because our (overall)
error rate should get better.
Like politics, All Problems Are Local.
Max Kuhn Predictive Modeling 58/61
58/61
Data Quality (QSAR Again)
One “Tier 1” screen is an assay for logP (the partition
coefficient) which we use as a measure of “greasyness”.
Tier 1 means that logP is estimated for most compounds (via
model and/or assay)
There was an existing, high–throughput assay on a large number
of historical compounds. However, the data quality was poor.
Several years ago, a new assay was developed that was lower
throughput and higher quality. This became the default assay for
logP.
The model was re-built on a small (N ≈ 1, 000) set chemically
diverse compounds.
In the end, fewer compounds were assayed but the model
performance was much better and costs were lowered.
Max Kuhn Predictive Modeling 59/61
59/61
Big N and/or P - Reducing Extrapolation
However, Big N might start to sample from rare populations.
For example:
a customer cluster that is less frequent but has high
profitability (via Big N).
a specific mutation or polymorphism (i.e. a rare event) that
helps derive a new drug target (via Big N and Big P)
highly non–linear “activity cliffs” in computational chemistry
can be elucidated (via Big N)
Now, we have the ability to solve some unmet need.
Max Kuhn Predictive Modeling 60/61
60/61
Thanks!

Weitere ähnliche Inhalte

Was ist angesagt?

모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
r-kor
 
An Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social SciencesAn Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social Sciences
fsmart01
 

Was ist angesagt? (20)

Table
TableTable
Table
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
 
An Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social SciencesAn Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social Sciences
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for Regression
 
Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Workshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RWorkshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with R
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning Challenge
 
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
 
Xgboost
XgboostXgboost
Xgboost
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challenges
 
Feature Reduction Techniques
Feature Reduction TechniquesFeature Reduction Techniques
Feature Reduction Techniques
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
 

Andere mochten auch

Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
Vivian S. Zhang
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
Vivian S. Zhang
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 

Andere mochten auch (12)

A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 

Ähnlich wie Max Kuhn's talk on R machine learning

Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
Ali T. Lotia
 
Analogy Based Defect Prediction Model Elham Paikari Department of ...
Analogy Based Defect Prediction Model Elham Paikari Department of ...Analogy Based Defect Prediction Model Elham Paikari Department of ...
Analogy Based Defect Prediction Model Elham Paikari Department of ...
butest
 
Jacobs Kiefer Bayes Guide 3 10 V1
Jacobs Kiefer Bayes Guide 3 10 V1Jacobs Kiefer Bayes Guide 3 10 V1
Jacobs Kiefer Bayes Guide 3 10 V1
Michael Jacobs, Jr.
 
Codecamp Iasi 7 mai 2011 Monte Carlo Simulation
Codecamp Iasi 7 mai 2011 Monte Carlo SimulationCodecamp Iasi 7 mai 2011 Monte Carlo Simulation
Codecamp Iasi 7 mai 2011 Monte Carlo Simulation
Codecamp Romania
 

Ähnlich wie Max Kuhn's talk on R machine learning (20)

Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
 
Industrial Examples - Process Capability in Total Quality Management
Industrial Examples - Process Capability in Total Quality ManagementIndustrial Examples - Process Capability in Total Quality Management
Industrial Examples - Process Capability in Total Quality Management
 
ML MODULE 5.pdf
ML MODULE 5.pdfML MODULE 5.pdf
ML MODULE 5.pdf
 
2020 trends in biostatistics what you should know about study design - slid...
2020 trends in biostatistics   what you should know about study design - slid...2020 trends in biostatistics   what you should know about study design - slid...
2020 trends in biostatistics what you should know about study design - slid...
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
MUMS: Transition & SPUQ Workshop - Stochastic Simulators: Issues, Methods, Un...
MUMS: Transition & SPUQ Workshop - Stochastic Simulators: Issues, Methods, Un...MUMS: Transition & SPUQ Workshop - Stochastic Simulators: Issues, Methods, Un...
MUMS: Transition & SPUQ Workshop - Stochastic Simulators: Issues, Methods, Un...
 
Financial Risk Mgt - Lec 11 by Dr. Syed Muhammad Ali Tirmizi
Financial Risk Mgt - Lec 11 by Dr. Syed Muhammad Ali TirmiziFinancial Risk Mgt - Lec 11 by Dr. Syed Muhammad Ali Tirmizi
Financial Risk Mgt - Lec 11 by Dr. Syed Muhammad Ali Tirmizi
 
Webinar slides how to reduce sample size ethically and responsibly
Webinar slides   how to reduce sample size ethically and responsiblyWebinar slides   how to reduce sample size ethically and responsibly
Webinar slides how to reduce sample size ethically and responsibly
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
Analogy Based Defect Prediction Model Elham Paikari Department of ...
Analogy Based Defect Prediction Model Elham Paikari Department of ...Analogy Based Defect Prediction Model Elham Paikari Department of ...
Analogy Based Defect Prediction Model Elham Paikari Department of ...
 
A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...
 
Machine learning project
Machine learning project Machine learning project
Machine learning project
 
Sequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_modelsSequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_models
 
Toward a Unified Approach to Fitting Loss Models
Toward a Unified Approach to Fitting Loss ModelsToward a Unified Approach to Fitting Loss Models
Toward a Unified Approach to Fitting Loss Models
 
Jacobs Kiefer Bayes Guide 3 10 V1
Jacobs Kiefer Bayes Guide 3 10 V1Jacobs Kiefer Bayes Guide 3 10 V1
Jacobs Kiefer Bayes Guide 3 10 V1
 
Codecamp Iasi 7 mai 2011 Monte Carlo Simulation
Codecamp Iasi 7 mai 2011 Monte Carlo SimulationCodecamp Iasi 7 mai 2011 Monte Carlo Simulation
Codecamp Iasi 7 mai 2011 Monte Carlo Simulation
 
Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4
 
Six Sigma Confidence Interval Analysis (CIA) Training Module
Six Sigma Confidence Interval Analysis (CIA) Training ModuleSix Sigma Confidence Interval Analysis (CIA) Training Module
Six Sigma Confidence Interval Analysis (CIA) Training Module
 
Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...
 

Mehr von Vivian S. Zhang

Mehr von Vivian S. Zhang (15)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
 
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
 
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
 
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
 

Kürzlich hochgeladen

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 

Max Kuhn's talk on R machine learning

  • 1. An Overview of Predictive Modeling Max Kuhn, Ph.D Pfizer Global R&D Groton, CT max.kuhn@pfizer.com
  • 2. Outline “Predictive modeling” definition Some example applications A short overview and example How is this different from what statisticians already do? What can drive choice of methodology? Where should we focus our efforts? Max Kuhn Predictive Modeling 2/61 2/61
  • 4. Define That! Rather than saying that method X is a predictive model, I would say: Predictive Modeling is the process of creating a model whose primary goal is to achieve high levels of accuracy. In other words, a situation where we are concerned with making the best possible prediction on an individual data instance. (aka pattern recognition)(aka machine learning) Max Kuhn Predictive Modeling 4/61 4/61
  • 5. Models So, in theory, a linear or logistic regression model is a predictive model? Yes. As will be emphasized during this talk: the quality of the prediction is the focus model interpretability and inferential capacity are not as important Max Kuhn Predictive Modeling 5/61 5/61
  • 6. Where Does This Fit Within Data Science From Drew Conway (http://drewconway.com/zia/2013/3/26/the- data-science-venn-diagram) Max Kuhn Predictive Modeling 6/61 6/61
  • 8. Examples spam detection: we want the most accurate prediction that minimizes false positives and eliminates spam plane delays, travel time, etc. customer volume sentiment analysis of text sale price of a property For example, does anyone care why an email or SMS is labeled as spam? Max Kuhn Predictive Modeling 8/61 8/61
  • 9. Quantitative Structure Activity Relationships (QSAR) Pharmaceutical companies screen millions of molecules to see if they have good properties, such as: biologically potent safe soluble, permeable, drug–like, etc We synthesize many molecules and run lab tests (“assays”) to estimate the characteristics listed above. When a medicinal chemist designs a new molecule, he/she would like a prediction to help assess whether we should synthesized it. Max Kuhn Predictive Modeling 9/61 9/61
  • 10. Medical Diagnosis A patient might be concerned about having cancer on the basis of some on physiological condition their doctor has observed. Any result based on imaging or a lab test should, above all, be as accurate as possible (more on this later). If the test does indicate cancer, very few people would want to know if it was due to high levels of: human chorionic gonadotropin (HCG) alpha-fetoprotein (AFP) or lactate dehydrogenase (LDH) Accuracy is the concern Max Kuhn Predictive Modeling 10/61 10/61
  • 11. Managing Customer Churn/Defection When a customer has the potential to switch vendors (e.g. phone contract ends), we might want to estimate the probability that they will churn the expected monetary loss from churn Based on these quantities, you may want to offer a customer an incentive to stay. Poor predictions will result in loss of revenue from a customer loss or from a inappropriately applied incentive. The goal is to minimize financial loss at the individual customer level Max Kuhn Predictive Modeling 11/61 11/61
  • 12. An Overview of the Modeling Process
  • 13. Cell Segmentation Max Kuhn Predictive Modeling 13/61 13/61
  • 14. Cell Segmentation Individual cell results are aggregated so that decisions can be made about specific compounds Improperly segmented objects might compromise the quality of the data so an algorithmic filter is needed. In this application, we have measurements on the size, intensity, shape or several parts of the cell (e.g. nucleus, cell body, cytoskeleton). Can these measurements be used to predict poorly segmentation using a set of manually labeled cells? Max Kuhn Predictive Modeling 14/61 14/61
  • 15. The Data The authors scored 2019 cells into these two bins: well–segmented (WS) or poorly–segmented (PS). There are 58 measurements in each cell that can be used as predictors. The data are in the caret package. Max Kuhn Predictive Modeling 15/61 15/61
  • 16. Model Building Steps Common steps during model building are: estimating model parameters (i.e. training models) determining the values of tuning parameters that cannot be directly calculated from the data calculating the performance of the final model that will generalize to new data Max Kuhn Predictive Modeling 16/61 16/61
  • 17. Model Building Steps How do we “spend” the data to find an optimal model? We typically split data into training and test data sets: Training Set: these data are used to estimate model parameters and to pick the values of the complexity parameter(s) for the model. Test Set (aka validation set): these data can be used to get an independent assessment of model efficacy. They should not be used during model training. Max Kuhn Predictive Modeling 17/61 17/61
  • 18. Spending Our Data The more data we spend, the better estimates we’ll get (provided the data is accurate). Given a fixed amount of data, too much spent in training won’t allow us to get a good assessment of predictive performance. We may find a model that fits the training data very well, but is not generalizable (over–fitting) too much spent in testing won’t allow us to get good estimates of model parameters Max Kuhn Predictive Modeling 18/61 18/61
  • 19. Spending Our Data Statistically, the best course of action would be to use all the data for model building and use statistical methods to get good estimates of error. From a non–statistical perspective, many consumers of of these models emphasize the need for an untouched set of samples the evaluate performance. The authors designated a training set (n = 1009) and a test set (n = 1010). Max Kuhn Predictive Modeling 19/61 19/61
  • 20. Over–Fitting Over–fitting occurs when a model inappropriately picks up on trends in the training set that do not generalize to new samples. When this occurs, assessments of the model based on the training set can show good performance that does not reproduce in future samples. Some models have specific “knobs” to control over-fitting neighborhood size in nearest neighbor models is an example the number if splits in a tree model Max Kuhn Predictive Modeling 20/61 20/61
  • 21. Over–Fitting Often, poor choices for these parameters can result in over-fitting For example, the next slide shows a data set with two predictors. We want to be able to produce a line (i.e. decision boundary) that differentiates two classes of data. Two new points are to be predicted. A 5–nearest neighbor model is illustrated. Max Kuhn Predictive Modeling 21/61 21/61
  • 22. K–Nearest Neighbors Classification Predictor A PredictorB 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 q Class 1 Class 2 Max Kuhn Predictive Modeling 22/61 22/61
  • 23. Over–Fitting On the next slide, two classification boundaries are shown for the a different model type not yet discussed. The difference in the two panels is solely due to different choices in tuning parameters. One over–fits the training data. Max Kuhn Predictive Modeling 23/61 23/61
  • 24. Two Model Fits Predictor A PredictorB 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 Model #1 0.2 0.4 0.6 0.8 Model #2 Max Kuhn Predictive Modeling 24/61 24/61
  • 25. Characterizing Over–Fitting Using the Training Set One obvious way to detect over–fitting is to use a test set. However, repeated “looks” at the test set can also lead to over–fitting Resampling the training samples allows us to know when we are making poor choices for the values of these parameters (the test set is not used). Examples are cross–validation (in many varieties) and the bootstrap. These procedures repeated split the training data into subsets used for modeling and performance evaluation. Max Kuhn Predictive Modeling 25/61 25/61
  • 26. The Big Picture We think that resampling will give us honest estimates of future performance, but there is still the issue of which sub–model to select (e.g. 5 or 10 NN). One algorithm to select sub–models: Define sets of model parameter values to evaluate; for each parameter set do for each resampling iteration do Hold–out specific samples ; Fit the model on the remainder; Predict the hold–out samples; end Calculate the average performance across hold–out predictions end Determine the optimal parameter value; Create final model with entire training set and optimal parameter value;Max Kuhn Predictive Modeling 26/61 26/61
  • 27. K–Nearest Neighbors Tuning q q q q q q q q q q q q q q q q q q q q q q q q q 0.80 0.81 10 20 30 40 50 #Neighbors Accuracy(RepeatedCross−Validation) Max Kuhn Predictive Modeling 27/61 27/61
  • 28. Next Steps Normally, we would try different approaches to improving performance: different models, other pre–processing techniques, model ensembles, and/or feature selection (i.e. variable selection) For example, the degree of correlation between the predictors in these data is fairly high. This may have had a negative impact on the model. We can fit another model that uses a sufficient number of principal components in the K–nearest neighbor model (instead of the original predictors). Max Kuhn Predictive Modeling 28/61 28/61
  • 29. Pre-Processing Comparison q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.80 0.81 0.82 10 20 30 40 50 #Neighbors Accuracy(Cross−Validation) pre_processing q qBasic PCA Max Kuhn Predictive Modeling 29/61 29/61
  • 30. Evaluating the Final Model Candidate Suppose we evaluated a panel of models and the cross–validation results indicated that the K–nearest neighbor model using PCA had the best cross–validation results. We would use the test set to verify that there were no methodological errors. The test set accuracy was 80.7%. The CV accuracy was 82%. Pretty Close! Max Kuhn Predictive Modeling 30/61 30/61
  • 31. Comparisons to Traditional Statistical Analyses
  • 32. Good References On This Topic Breiman, L. (1997). No Bayesians in foxholes. IEEE Expert, 12(6), 21–24. Breiman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16(3), 199–215. (plus discussants) Ayres, I. (2008). Super Crunchers: Why Thinking-By-Numbers is the New Way To Be Smart, Bantam. Shmueli, G. (2010). To explain or to predict? Statistical Science, 25(3), 289–310. Boulesteix, A.-L., & Schmid, M. (2014). Machine learning versus statistical modeling. Biometrical Journal, 56(4), 588–593. (and other articles in this issue) Max Kuhn Predictive Modeling 32/61 32/61
  • 33. Empirical Validation Previously there was a emphasis on empirical assessments of accuracy. While accuracy isn’t the best measure, metrics related to errors of some sort are way more important here than in traditional statistics. Statistical criteria (e.g. lack of fit tests, p–values, likelihoods, etc.) are not directly related to performance. Friedman (2001) describes an example related to boosted trees with MLE: “[...] degrading the likelihood by overfitting actually improves misclassification error rates. Although perhaps counterintuitive, this is not a contradiction; likelihood and error rate measure different aspects of fit quality.” Max Kuhn Predictive Modeling 33/61 33/61
  • 34. Empirical Validation Even some measure that are asymptotically related to error rates (e.g. AIC, BIC) are still insufficient. Also, a number of statistical criteria (e.g. adjusted R2 ) require degrees of freedom. In many predictive models, these do not exist or are much larger than the training set size. Finally, there is often an interest in measuring nonstandard loss functions and optimizing models on this basis. For the customer churn example, the loss of revenue related to false negatives and false positives may be different. The loss function may not fit nicely into standard decision theory so evidence–based assessments are important. Max Kuhn Predictive Modeling 34/61 34/61
  • 35. Statistical Formalism Traditional statistics are often used for making inferential statements regarding parameters or contrasts of parameters. For this reason, there is a fixation on the appropriateness of the models related to necessary distributional assumptions required for theory relatively simple model structure to keep the math trackable the sanctity of degrees of freedom. This is critical to make appropriate inferences on model parameters. However, letting these go allows the user greater flexibility to increase accuracy. Max Kuhn Predictive Modeling 35/61 35/61
  • 36. Examples complex or nonlinear pre–processing (e.g. PCA, spatial sign) ensembles of models (boosting, bagging, random forests) overparameterized but highly regularized nonlinear models (SVN, NNets) Quinlan (1993) describing pessimistic pruning, notes that it: “does violence to statistical notions of sampling and confidence limits, so the reasoning should be taken with a grain of salt.” Breiman (1997) notes that there were “No Bayesians in foxholes.” Max Kuhn Predictive Modeling 36/61 36/61
  • 37. Three Sweeping Generalizations About Models 1 In the absence of any knowledge about the prediction problem, no model can be said to be uniformly better than any other (No Free Lunch theorem) 2 There is an inverse relationship between model accuracy and model interpretability 3 Statistical validity of a model is not always connected to model accuracy Max Kuhn Predictive Modeling 37/61 37/61
  • 38. Segmentation CV Results for Different Models Confidence Level: 0.95 Accuracy CART LDA glmnet PLS bagging FDA KNN_PCA SVM C5.0 GBM RF 0.79 0.80 0.81 0.82 0.83 0.84 Max Kuhn Predictive Modeling 38/61 38/61
  • 39. How Does Statistics Contribute to Predictive Modeling? Many statistical principals still apply and have greatly improved the field: resampling the variance–bias tradeoff sampling bias in big data parsimony principal (AOTBE, keep the simpler model) L1 and/or L2 regularization Statistical improvements to boosting are a great example of this. Max Kuhn Predictive Modeling 39/61 39/61
  • 40. What Can Drive Choice of Methodology?
  • 41. Cross–Validation Cost Estimates (HPC Case Study, APM, pg 454) Cost C5.0 Random Forests Bagging SVM (Weights) C5.0 (Costs) CART (Costs) Bagging (Costs) SVM Neural Networks CART LDA (Sparse) LDA FDA PLS 0.2 0.3 0.4 0.5 0.6 0.7 0.8 q q q q q q q q q q q q q q q qqq q q q qq q Max Kuhn Predictive Modeling 41/61 41/61
  • 42. Important Considerations It is fairly common to have a number of different models with equivalent levels of performance. If a predictive model will be deployed, there are a few practical considerations that might drive the choice of model. if the prediction equation is to be numerically optimized, models with smooth functions might be preferable. the characteristics of the data (e.g. multicollinearity, imbalanced, etc) will often constrain the feasible set of models. large amounts of unlabeled data can be exploited by some models Max Kuhn Predictive Modeling 42/61 42/61
  • 43. Ease of Use/Model Development If a large number of models will be created, using a low maintenance technique is probably a good idea. QSAR is a good example. Most companies maintain a large number of models for different endpoints. These models are constantly updated too. Trees (and their ensembles) tend to be low maintenance; neural networks are not. However, if the model must have the absolute best possible error rate, you might be willing to put a lot of work into the model (e.g. deep neural networks) Max Kuhn Predictive Modeling 43/61 43/61
  • 44. How will the model be deployed? If the prediction equation needs to be encoded into software or a database, concise equations have the advantage. For example, random forests require a large number of unpruned trees. Even for a reasonable data set, the total forest can have thousands of terminal nodes (39,437 if/then statements for the cell model) Bagged trees require dozens of trees and might yield the same level of performance but with a smaller footprint Some models (K-NN, support vector machines) can require storage of the original training set, which may be infeasible. Neural networks, MARS and a few other models can yield compact nonlinear prediction equations. Max Kuhn Predictive Modeling 44/61 44/61
  • 45. Ethical Considerations There are some situations where using a sub–optimal model can be unethical. My experiences in molecular diagnostics gave me some perspective on the friction point between pure performance and notions about validity. Another generalization: doctors and regulators are apprehensive about black–box models since they cannot make sense of it and assume that the validation of these models is lacking. I had an experience as a “end user” of a model (via a lab test) when I was developing algorithms for a molecular diagnostic company. Max Kuhn Predictive Modeling 45/61 45/61
  • 46. Antimicrobial Susceptibility Testing Max Kuhn Predictive Modeling 46/61 46/61
  • 48.
  • 49. Where should we focus our efforts?
  • 50. Diminishing Returns on Models Honestly, I think that we have plenty of effective models. Very few problems are solved by inventing yet another predictive model. Exceptions: using unlabeled data severe class imbalances feature engineering algorithms (e.g. MARS, autoencoders, etc) applicability domain techniques and prediction confidence assessments It is more important to improve the features that go into the model. Max Kuhn Predictive Modeling 50/61 50/61
  • 51. What About Big Data? The advantages and issues related to Big Data can be broken down into two areas: Big N: more samples or cases Big P: more variables or attributes or fields Mostly, I think the catchphrase is associated more with N than P. Does Big Data solve my problems? Maybe1 1 the basic answer given by every statistician throughout time Max Kuhn Predictive Modeling 51/61 51/61
  • 52. Can You Be More Specific? It depends on what are you using it for? does it solve some unmet need? does it get in the way? Basically, it comes down to: Bigger Data = Better Data at least not necessarily. Max Kuhn Predictive Modeling 52/61 52/61
  • 53. Big N - Interpolation One situation where it probably doesn’t help is when samples are added within the mainstream of the data In effect, we are just filling in the predictor space by increasing the granularity. After the first 10,000 or so observations, the model will not change very much. This does pose an interesting interaction within the variance–bias trade off. Big N goes a long way to reducing the model variance. Given this, can high variance/low bias models be improved? Max Kuhn Predictive Modeling 53/61 53/61
  • 54. High Variance Model with Big N Maybe. Many high(ish) variance/low bias models (e.g. trees, neural networks, etc.) tend to be very complex and computationally demanding. Adding more data allows these models to more accurately reflect the complexity of the data but would require specialized solutions to be feasible. At this point, the best approach is supplanted by the available approaches (not good). Form should still follow function. Max Kuhn Predictive Modeling 54/61 54/61
  • 55. Model Variance and Bias Predictor Outcome −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2 4 6 8 10 Max Kuhn Predictive Modeling 55/61 55/61
  • 56. Low Variance Model with Big N What about low variance/high bias models (e.g. logistic and linear regression)? There is some room here for improvement since the abundance of data allows more opportunities for exploratory data analysis to tease apart the functional forms to lower the bias (i.e. improved feature engineering) or to select features. For example, non–linear terms for logistic regression models can be parametrically formalized based on the results of spline or loess smoothers. Max Kuhn Predictive Modeling 56/61 56/61
  • 57. Diminishing Returns on N∗ At some point, adding more (∗ of the same) data does not do any good. Performance stabilizes but computational complexity increases. The modeler becomes hand–cuffed to whatever technology is available to handle large amounts of data. Determining what data to use is more important than worrying about the technology required to fit the model. Max Kuhn Predictive Modeling 57/61 57/61
  • 58. Global Versus Local Models in QSAR When developing a compound, once a “hit” is obtained, the chemists begin to tweak the structure to make improvements. The leads to a chemical series We could build QSAR models on the large number of existing compounds (a global model) or on the series of interest (a local model). Our experience is that local models beat global models the majority of the time. Here, fewer (of the most relevant) compounds are better. Our first inclination is to use all the data because our (overall) error rate should get better. Like politics, All Problems Are Local. Max Kuhn Predictive Modeling 58/61 58/61
  • 59. Data Quality (QSAR Again) One “Tier 1” screen is an assay for logP (the partition coefficient) which we use as a measure of “greasyness”. Tier 1 means that logP is estimated for most compounds (via model and/or assay) There was an existing, high–throughput assay on a large number of historical compounds. However, the data quality was poor. Several years ago, a new assay was developed that was lower throughput and higher quality. This became the default assay for logP. The model was re-built on a small (N ≈ 1, 000) set chemically diverse compounds. In the end, fewer compounds were assayed but the model performance was much better and costs were lowered. Max Kuhn Predictive Modeling 59/61 59/61
  • 60. Big N and/or P - Reducing Extrapolation However, Big N might start to sample from rare populations. For example: a customer cluster that is less frequent but has high profitability (via Big N). a specific mutation or polymorphism (i.e. a rare event) that helps derive a new drug target (via Big N and Big P) highly non–linear “activity cliffs” in computational chemistry can be elucidated (via Big N) Now, we have the ability to solve some unmet need. Max Kuhn Predictive Modeling 60/61 60/61