Weitere ähnliche Inhalte Ähnlich wie Evolution of regression ols to gps to mars (20) Mehr von Salford Systems (20) Evolution of regression ols to gps to mars1. Evolution of Regression:
From Classical Least Squares to Regularized
Regression to Machine Learning Ensembles
Covering MARS®, Generalized PathSeeker®, TreeNet®
Gradient Boosting and Random Forests®
A Brief Overview the 4 Part Webinar
at www.salford-systems.com
May 2013
Dan Steinberg
Mikhail Golovnya
Salford Systems
Salford Systems ©2013 1
2. Full Webinar Outline
• Regression Problem – quick overview
• Classical Least Squares – the starting point
• RIDGE/LASSO/GPS – regularized regression
• MARS – adaptive non-linear regression splines
Salford Systems ©2013 2
• CART Regression tree– quick overview
• Random Forest decision tree ensembles
• TreeNet Stochastic Gradient Boosted Trees
• Hybrid TreeNet/GPS (trees and regularized regression)
Webinar Part 1
Webinar Part 2
3. Regression
• Regression analysis at least 200 years old
o most used predictive modeling technique (including logistic regression)
• American Statistical Association reports 18,900 members
o Bureau of Labor Statistics reports more than 22,000 statisticians in 2008
• Many other professionals involved in the sophisticated analysis
of data not included in these counts
o Statistical specialists in marketing, economics, psychology, bioinformatics
o Machine Learning specialists and „Data Scientists‟
o Data Base professionals involved in data analysis
o Web analytics, social media analytics, text analytics
• Few of these other researchers will call themselves statisticians
o but may make extensive use of variations of regression
• One reason for popularity of regression: effective
Salford Systems ©2013 3
4. Regression Challenges
• Preparation of data – errors, missing values, etc.
o Largest part of typical data analysis (modelers often report
80% time)
o Missing values a huge headache (listwise deletion of rows)
• Determining which predictors to include in model
o Text book examples typically have 10 predictors available
o Hundreds, thousands, even tens and hundreds of thousands available
• Transformation or coding of predictors
o Conventional approaches: logarithm, power, inverse, etc..
o Required to obtain a good model
• High correlation among predictors
o With increasing numbers of predictors this complication
becomes more serious
Salford Systems ©2013 4
5. More Regression Challenges
• Obtaining “sensible” results (correct signs, no wild
outcomes)
• Detecting and modeling important interactions
o Typically never done because too difficult
• “Wide” data has more columns than rows
• Lack of external knowledge or theory to guide
modeling as more topics are modeled
Salford Systems ©2013 5
6. Boston Housing Data Set
• Concerns the housing values in Boston area
• Harrison, D. and D. Rubinfeld. Hedonic Prices and the
Demand For Clean Air.
o Journal of Environmental Economics and Management, v5, 81-102 , 1978
• Combined information from 10 separate governmental
and educational sources to produce data set
• 506 census tracts in City of Boston for the year 1970
o Goal: study relationship between quality of life variables and property values
o MV median value of owner-occupied homes in tract ($1,000‟s)
o CRIM per capita crime rates
o NOX concentration nitric oxides (p.p. 10 million) proxy for air pollution generally
o AGE percent built before 1940
o DIS weighted distance to centers of employment
o RM average number of rooms per house
o LSTAT % lower status of population (without some high school and male laborers)
o RAD index of accessibility to radial highways
o CHAS borders Charles River (0/1)
o INDUS percent of acreage non-retail business
o TAX property tax rate per $10,000
o PT pupil teacher ratio
o ZN proportion of neighborhood zoned for large lots (>25K sq ft)
Salford Systems ©2013 6
7. Ten Data Sources Organized
• US Census (1970)
• FBI (1970)
• MIT Boston Project
• Metropolitan Area Planning Commission (1972)
• Voigt, Ivers, and Associates (1965) (Land Use Survey)
• US Census Tract Maps
• Massachusetts Dept Of Education (1971-1972)
• Massachusetts Tax Payer‟s Foundation (1970)
• Transportation and Air Shed Simulation Model, Ingram, et. al.
Harvard University Dept of City and Regional Planning (1974)
• A. Schnare: An Empirical Analysis of the dimensions of
neighborhood quality. Ph.D. Thesis. Harvard. (1974)
• An excellent example of creative data blending
• Also excellent example of careful model construction
• Authors emphasize the quality (completeness of their data)
Salford Systems ©2013 7
8. Least Squares Regression
• LS – ordinary least squares regression
o Discovered by Legendre (1805) and Gauss (1809)
o Solve problems in astronomy using pen and paper
o Statistical foundation by Fisher in 1920s
o 1950s – use of electro-mechanical calculators
• The model is always of the form
• The response surface is a hyper-plane!
• A – the intercept term
• B1, B2, B3, … – parameter estimates
• A usually unique combination of values exists which
minimizes the mean squared error of predictions on the
learn sample
• Experimental approach to model building
Response = A + B1X1 + B2X2 + B3X3 + …
Salford Systems ©2013 8
9. Transformations In Original Paper
(For Historical Reference)
• RM number of rooms in house: RM2
• NOX raised to power p, experiments on value: NOXp
• DIS, RAD, LSTAT entered as logarithms of predictor
• Regression in paper is run on ln(MV)
• Considerable experimentation undertaken
• No train/test methodology
• Classical Regression agrees very closely with paper on
reported coefficients and R2=.81 (same w/o logging MV)
• Converting predictions back from logs yields MSE=15.77
• Note that this is learn sample only no testing performed
Salford Systems ©2013 9
11. BATTERY PARTITION: Rerun 80/20 Learn test 100 times
Salford Systems ©2013 11
Note partition sizes are constant
All three partitions change each cycle
Mean MSE=23.80
12. Least Squares Regression on Raw Boston Data
• 414 records in the learn
sample
• 92 records in the test
sample
• Good agreement L/T:
o LEARN MSE = 27.455
o TEST MSE = 26.147
• Used MARS in forward
stepwise LS mode to
generate this model
3-variable
Solution
-0.597 +5.247
-0.858
Salford Systems ©2013 12
13. Motivation for Regularized Regression
1960s and 1970s
• Unsatisfactory results based modeling physical processes
o Coefficients changed dramatically with small changes in data
o Some coefficients judged to be too large
o Appearance of coefficients with “wrong sign”
o Severe with substantial correlations among predictors
(multicollinearity)
• Solution (1970) Hoerl and Kennard, “Ridge Regression”
• Earlier version just for stabilization of coefficients 1962
o Initially poorly received by statistics profession
Salford Systems ©2013 13
14. Regression Formulas
• X matrix of potential predictors (NxK)
• Y column: the target or dependent variable (Nx1)
• Estimated = (X’X)-1(X’y) standard formula
• Ridge (X’X + rI)-1(X’y)
• Simplest version: constant added to diagonal
elements of the X’X matrix
• r=0 yields usual LS
• r=∞ yields degenerate model
• eed to find r that yields best generalization error
• Observe that there is a potentially distinct “solution”
for every value of the penalty term r
• Varying r traces a path of solutions
Salford Systems ©2013 14
15. Ridge Regression
• “Shrinkage” of regression coefficients towards zero
• If zero correlation among all predictors then shrinkage
will be uniform over all coefficients (same percentage)
• If predictors correlated then while the length of the
coefficient vector decreases some coefficients might
increase (in absoluter value)
• Coefficients intentionally biased but yields both more
satisfactory estimates and superior generalization
o Better performance (test MSE) on previously unseen data
• Coefficients much less variable even if biased
• Coefficients will be typically be closer to the “truth”
Salford Systems ©2013 15
16. Ridge Regression Features
• Ridge frequently fixes the wrong sign problem
• Suppose you have K predictors which happen to be
exact copies of each other
• RIDGE will give each a coefficient equal to 1/K
times the coefficient that would be given to just one
copy in a model
Salford Systems ©2013 16
17. Ridge Regression vs OLS
Salford Systems ©2013 17
Ridge Regression
Classical RegressionRidge: Worse on training data but much better on test data
Without test data must use Cross-Validation to determine how much to shrink
RIDGE TEST MSE=21.36
18. Lasso Regularized Regression
• Tibshirani (1996) an alternative to RIDGE regression
• Least Absolute Shrinkage and Selection Operator
• Desire to gain the stability and lower variance of ridge
regression while also performing variable selection
• Especially in the context of many possible predictors
looking for a simple, stable, low predictive variance
model
• Historical note: Lasso inspired by related work (1993) by
Leo Breiman (of CART and RandomForests fame) „non-
negative garotte‟.
• Breiman‟s simulation studies showed the potential for
improved prediction via selection and shrinkage
Salford Systems ©2013 18
19. Regularized Regression - Concepts
• Any regularized regression approach tries to balance model
performance and model complexity
• λ – regularization parameter, to be estimated
o λ = ∞ Null model zero-coefficients (maximum possible penalty)
o λ = 0 LS solution (no penalty)
Salford Systems ©2013 19
Mean Squared Error Model Complexity
LS Regression
Minimize
Minimize
Regularized Regression
Ridge:
Sum of squared
coefficients
Lasso:
Sum of absolute
coefficients
Compact:
Number of
coefficients
λ
20. Regularized Regression: Penalized Loss Functions
• RIDGE penalty squared
• LASSO penalty absolute value
• COMPACT penalty count of s
• General penalty
• RIDGE does no selection but Lasso and Compact select
• Power on is called the “elasticity” ( 0, 1, 2)
• Penalty to be estimated is a constant multiplying one of
the above functions of the vector
• Intermediate elasticities can be created: e.g. we could
have a 50/50 mix of RIDGE and LASSO yielding an
elasticity of 1.5
Salford Systems ©2013 20
21. LASSO Features
• With highly correlated predictors the LASSO will tend
to pick just one of them for model inclusion
• Dispersion of greater than for RIDGE
• Unlike AIC and BIC model selection methods that
penalize after the model is built these penalties
influence the s
• A convenient trick for estimating models with
regularization is weighted average of any two
of the major elasticities 0, 1, and 2. e.g.:
• w w) the “elastic net”)
Salford Systems ©2013 21
22. Computational Challenge
• For a given regularization (e.g LASSO) find the
optimal penalty on the term
• Find the best regularization from the family
• Potentially very many models to fit
Salford Systems ©2013 22
23. Computing Regularized Regressions -1
• Earliest versions of regularized regressions required
considerable computation as the penalty
parameter is unknown and must be estimated
• Lasso was originally computed by starting with no
penalty and gradually increasing the penalty
o So start with ALL vars in the model
o Gradually tighten the noose to squeeze predictors out
o Infeasible for problems with thousands of possible predictors
• Need to solve a quadratic programming problem
to optimize the Lasso solution for every penalty
value
Salford Systems ©2013 23
24. Computing Regularized Regressions -2
• Work by Friedman and others introduced very fast
forward stepping approaches
• Start with maximum penalty (no predictors)
• Progress forward with stopping rule
o Dealing with millions of predictors possible
• Coordinate gradient descent methods (next slides)
• Will still want test sample or cross-validation for
optimization
• Generalized PathSeeeker full range of regularization
from compact to Ridge (elasticies from 0 thru 2)
• Glmnet in R partial range of regularization from Lasso to
Ridge (elasticities from 1 to 2)
Salford Systems ©2013 24
25. GPS Algorithm
• Start with NO predictors in model
• Seek the path ( of solutions as function of penalty
strength
• Define pj( P/ j marginal change in Penalty
• Define gj( R/ j marginal change in Loss
• Define j( gj( pj( ratio (benefit/cost)
• Find max| j( to identify coefficient to update (j*)
• Update j* in the direction of sign j*
• R/ j requires computing inner products of
current residual with available predictors
o Easily parallelizable
Salford Systems ©2013 25
26. How to Forward Step
• At any stage of model development choose between
• Add a new variable to Update an existing
model variable coefficient
• Step sizes are small, initial coefficients for any model are
very small and are updated in very small increments
• This explains why the Ridge elasticity can have solutions
with less than all the variables
o Technically ridge does not select variables, it only shrinks
o In practice it can only add one variable per step
Salford Systems ©2013 26
27. Regularized Regression – Practical Algorithm
• Start with the zero-coefficient solution
• Look for best first step which moves one coefficient away from zero
o Reduces Learn Sample MSE
o Increases Penalty as the model has become more complex
• Next step: Update one of the coefficients by a small amount
o If the selected coefficient was zero, a new variable effectively enters into the model
o If the selected coefficient was not zero, the model is simply updated
Salford Systems ©2013 27
Current
Model
X1 0.0
X2 0.0
X3 0.2
X4 0.0
X5 0.4
X6 0.5
X7 0.0
X8 0.0
X1 0.0
X2 0.0
X3 0.2
X4 0.1
X5 0.4
X6 0.5
X7 0.0
X8 0.0
Introducing New Variable
Next
Model
Current
Model
X1 0.0
X2 0.0
X3 0.2
X4 0.0
X5 0.4
X6 0.5
X7 0.0
X8 0.0
X1 0.0
X2 0.0
X3 0.3
X4 0.1
X5 0.4
X6 0.5
X7 0.0
X8 0.0
Updating Existing Model
Next
Model
28. Path Building Process
• Elasticity Parameter – controls the variable selection
strategy along the path (using the LEARN sample
only), it can be between 0 and 2, inclusive
o Elasticity = 2 – fast approximation of Ridge Regression, introduces
variables as quickly as possible and then jointly varies the magnitude of
coefficients – lowest degree of compression
o Elasticity = 1 – fast approximation of Lasso Regression, introduces
variables sparingly letting the current active variables develop their
coefficients – good degree of compression versus accuracy
o Elasticity = 0 – fast approximation of Best Subset Regression, introduces
new variables only after the current active variables were fully developed
– excellent degree of compression but may loose accuracy
Zero
Coefficient
Model
A Variable
is Added
Sequence of
1-variable
models
A Variable
is Added
Sequence of
2-variable
models
A Variable
is Added
Sequence of
3-variable
models
Final
OLS
Solution
Variable Selection Strategy
Salford Systems ©2013 28
λ = ∞ λ = 0…
29. Points Versus Steps
• Each path(elasticity) will have different number of steps
• To facilitate model comparison among different paths,
the Point Selection Strategy extracts a fixed collection of
models into the points grid
o This eliminates some of the original irregularity among individual paths and
facilitates model extraction and comparison
Path 2: Steps OLS
Solution
Points
Path 1
Path 2
Path 3
Zero
Solution
Path 1: Steps
Path 3: Steps
Point Selection Strategy
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Salford Systems ©2013 29
30. LS versus GPS
• GPS (Generalized Path Seeker) introduced by Jerome Friedman in 2008 (Fast
Sparse Regression and Classification)
• Dramatically expands the pool of potential linear models by including different sets
of variables in addition to varying the magnitude of coefficients
• The optimal model of any desirable size can then be selected based on its
performance on the TEST sample
Learn Sample
OLS Regression
X1, X2 , X3, X4, X5, X6,…
Test Sample
X1, X2 , X3, X4, X5, X6,…
A Sequence of Linear Models
1-variable model
2-variable model
3-variable model
…
GPS Regression
Large Collection of Linear Models (Paths)
1-variable models, varying coefficients
2-variable models, varying coefficients
3-variable models, varying coefficients
…
Salford Systems ©2013 30
31. Paths Produced by SPM GPS
• Example of 21 paths with different variable selection
strategies
Salford Systems ©2013 31
32. Path Points on Boston Data
• Each path uses a different variable selection
strategy and separate coefficient updates
Point 30 Point 100 Point 150 Point 190
Path Development
Salford Systems ©2013 32
33. GPS on Boston Data
3-variable
Solution
• 414 records in the learn sample
• 92 records in the test sample
• 15% performance improvement
on the test sample
o GPS TEST MSE = 22.669
o LS MSE= 26.147
+5.247
-0.858
-0.597
LS
26.147
Salford Systems ©2013 33
34. Sentinel Solutions Detail
Salford Systems ©2013 34
• Along the path followed by GPS for every elasticity we identify the solution
(coefficient vector) best for each performance measure
• No attention is paid to model size here so you might still prefer to select a model
from the graphical display
36. How To Select a Best Model
• Regularized regression was originally invented to
help modelers obtain more intuitively acceptable
models
• Can think of the process as a search engine
generating predictive models
• User can decide based on
o Complexity of model
o Acceptability of coefficients magnitude, signs, predictors included)
• Clearly can be set to automatic mode
• Criterion could well be performance on test data
Salford Systems ©2013 36
37. Key Problems with GPS
• Still a linear regression!
• Response surface is still a global hyper-plane
• Incapable of discovering local structure in the data
• Develop non-linear algorithms that build response
surface locally based on the data itself
o By trying all possible data cuts as local boundaries
o By fitting first-order adaptive splines locally
o By exploiting regression trees and their ensembles
Salford Systems ©2013 37
38. From Linear to Non-linear
• Classical regression and regularized regression build
globally linear models
• Further accuracy can be achieved by building locally
linear models connected to each other at boundary
points called knots
• Function is known as a spline
• Each separate region of data represented by a “basis
function” (BF)
-10
0
10
20
30
40
50
60
0 10 20 30 40
LSTAT
MV
0
10
20
30
40
50
60
0 10 20 30 40
LSTAT
MV
Localize
Knots
Salford Systems ©2013 38
39. Finding Knots Automatically
• Stage-wise knot placement process on a flat-top function
0
20
40
60
80
0 30 60 90
X
Y
0
20
40
60
80
0 30 60 90X
Y
True Knots Knot 1 Knot 2 Knot 3
Knot 4 Knot 5 Knot 6
Salford Systems ©2013 39
Data
True Function
40. MARS Algorithm
• Multivariate Adaptive Regression Splines
• Introduced by Jerome Friedman in 1991
o (Annals of Statistics 19 (1): 1-67) (earlier discussion papers from 1988)
• Forward stage:
o Add pairs of BFs (direct and mirror pair of basis functions represents a single
knot) in a step-wise regression manner
o The process stops once a user specified upper limit is reached
• Backward stage:
o Remove BFs one at a time in a step-wise regression manner
o This creates a sequence of candidate models of declining complexity
• Selection stage:
o Select optimal model based on the TEST performance (modern approach)
o Select optimal model based on GCV criterion (legacy approach)
Salford Systems ©2013 40
41. MARS on Boston Data: TEST MSE=14.66
9-BF (7-variable)
Solution
Salford Systems ©2013 41
42. Non-linear Response Surface
• MARS automatically determined transition points between
various local regions
• This model provides major insights into the nature of the
relationship
• Observe in this model NOX appears linearly
Salford Systems ©2013 42
43. 200 Replications Learn/Test Partition
• Models were repeated
with 200 randomly
selected 20% test partitions
• GPS shows marginal
performance improvement
but much smaller model
• MARS shows dramatic
performance improvement
Regression
GPS
MARS
Salford Systems ©2013 43
Distribution of TEST MSE across runs
44. Combining MARS and GPS
• Use MARS as a search engine to break predictors
into ranges reflecting differences in relationship
between target and predictors
• MARS also handles missing values with missing value
indicators and interactions for conditional use of a
predictor (only when not missing)
• Allow the MARS model to be large
• GPS can then select basis functions and shrink
coefficients
• We will see that this combination of the best of both
worlds will also apply to ensembles of decision trees
Salford Systems ©2013 44
45. Running Score: Test Sample MSE
Method 20% random Parametric
Bootstrap
Battery Partition
Regression 27.069 27.97 23.80
MARS Regression Splines 14.663 15.91 14.12
GPS Lasso/ Regularized 21.361 21.11 23.15
Salford Systems © Copyright 2005-
2013
45
46. Regression Tree
Out of the box results, no tuning of controls
9 regions (terminal
nodes)
Test MSE= 17.296
Salford Systems © Copyright 2005-
2013
46
47. Regression Tree Representation of a Surface
High Dimensional Step function
Should be at a disadvantage relative to other tools. Can never be smooth.
But always worth checking
48. Regression Tree Partial Dependency Plot
LSTAT NOX
Use model to simulate impact of a change in predictor
Here we simulate separately for every training data record and then average
For CART trees is essentially a step function
May only get one “knot” in graph if variable appears only once in tree
See appendix to learn how to get these plots
49. Running Score
Method 20% random Parametric
Bootstrap
Repeated 100
20% Partitions
Regression 27.069 27.97 23.80
MARS 14.663 15.91 14.12
GPS Lasso 21.361 21.11 23.15
CART 17.296 17.26 20.66
Salford Systems © Copyright 2005-
2013
49
50. Bagger Mechanism
• Generate a reasonable number of bootstrap samples
o Breiman started with numbers like 50, 100, 200
• Grow a standard CART tree on each sample
• Use the unpruned tree to make predictions
o Pruned trees yield inferior predictive accuracy for the ensemble
• Simple voting for classification
o Majority rule voting for binary classification
o Plurality rule voting for multi-class classification
o Average predicted target for regression models
• Will result in a much smoother range of predictions
o Single tree gives same prediction for all records in a terminal node
o In bagger records will have different patterns of terminal node results
• Each record likely to have a unique score from ensemble
Salford Systems © Copyright 2005-
2013
50
51. Bagger Partial Dependency Plot
LSTAT NOX
Averaging over many trees allows for a more complex dependency
Opportunity for many splits of a variable (100 large trees)
Jaggedness may reflect existence of interactions
Salford Systems © Copyright 2005-
2013
51
52. Running Score
Method 20% random Parametric
Bootstrap
Battery Partition
Regression 27.069 27.97 23.80
MARS 14.663 15.91 14.12
GPS Lasso 21.361 21.11 23.15
CART 17.296 17.26 20.66
Bagged CART 9.545 12,79
Salford Systems © Copyright 2005-
2013
52
53. RandomForests: Bagger on Steroids
• Leo Breiman was frustrated by the fact that the bagger did
not perform better. Convinced there was a better way
• Observed that trees generated bagging across different
bootstrap samples were surprisingly similar
• How to make them more different?
• Bagger induces randomness in how the rows of the data are
used for model construction
• Why not also introduce randomness in how the columns are
used for model construction
• Pick a random subset of predictors as candidate predictors –
a new random subset for every node
• Breiman was inspired by earlier research that experimented
with variations on these ideas
• Breiman perfected the bagger to make RandomForests
Salford Systems © Copyright 2005-
2013
53
54. Running Score
Method 20% random Parametric
Bootstrap
Battery Partition
Regression 27.069 27.97 23.80
MARS 14.663 15.91 14.12
GPS Lasso 21.361 21.11 23.15
CART 17.296 17.26 20.66
Bagged CART 9.545 12,79
RF Defaults 8.286 12.84
Salford Systems © Copyright 2005-
2013
54
55. Stochastic Gradient Boosting (TreeNet )
• SGB is a revolutionary data mining methodology first
introduced by Jerome H. Friedman in 1999
• Seminal paper defining SGB released in 2001
o Google scholar reports more than 1600 references to this paper and a further
3300 references to a companion paper
• Extended further by Friedman in major papers in 2004
and 2008 (Model compression and rule extraction)
• Ongoing development and refinement by Salford
Systems
o Latest version released 2013 as part of SPM 7.0
• TreeNet/Gradient boosting has emerged as one of the
most used learning machines and has been successfully
applied across many industries
• Friedman‟s proprietary code in TreeNet
Salford Systems © Copyright 2005-
2013
55
56. Trees incrementally revise predictions
First tree grown on
original target.
Intentionally
“weak” model
2nd tree grown on
residuals from first.
Predictions made to
improve first tree
3rd tree grown on
residuals from model
consisting of first two
trees
+ +
Tree 1 Tree 2 Tree 3
Every tree produces at least one positive and at least one negative node. Red
reflects a relatively large positive and deep blue reflects a relatively negative
node. Total “score” for a given record is obtained by finding relevant terminal node
in every tree in model and summing across all trees
Salford Systems © Copyright 2005-
2013
56
57. Gradient Boosting Methodology: Key points
• Trees are usually kept small (2-6 nodes common)
o However, should experiment with larger trees (12, 20, 30 nodes)
o Sometimes larger trees are surprisingly good
• Updates are small (downweighted). Update factors can
be as small as .01, .001, .0001.
o Do not accept the full learning of a tree (small step size, also GPS style)
o Larger trees should be coupled with slower learn rates
• Use random subsets of the training data in each cycle.
Never train on all the training data in any one cycle
o Typical is to use a random half of the learn data to grow each tree
Salford Systems © Copyright 2005-
2013
57
58. Running Score
Method 20% random Parametric
Bootstrap
Battery Partition
Regression 27.069 27.97 23.80
MARS 14.663 15.91 14.12
GPS Lasso 21.361 21.11 23.15
CART 17.296 17.26 20.66
Bagged CART 9.545 12,79
RF Defaults 8.286 12.84
RF PREDS=6 8.002 12.05
TreeNet Defaults 7.417 8.67 11.02
Using cross-validation on learn partition to determine optimal number of trees
and then scoring the test partition with that model: TreeNet MSE=8.523
Salford Systems © Copyright 2005-
2013
58
59. Vary HUBER Threshold: Best MSE=6.71
Vary threshold where we switch from squared errors to absolute errors
Optimum when the 5% largest errors are not squared in loss computation
Yields best MSE on test data. Sometimes LAD yields best test sample MSE.
Salford Systems © Copyright 2005-
2013
59
61. Running Score
Method 20% random Parametric
Bootstrap
Battery Partition
Regression 27.069 27.97 23.80
MARS 14.663 15.91 14.12
GPS Lasso 21.361 21.11 23.15
CART 17.296 17.26 20.66
Bagged CART 9.545 12,79
RF Defaults 8.286 12.84
RF PREDS=6 8.002 12.05
TreeNet Defaults 7.417 8.67 11.02
TreeNet Huber 6.682 7.86 11.46
TN Additive 9.897 10.48
If we had used cross-validation to determine the optimal number of trees and
then used those to score test partition the TreeNet Default model MSE=8.523
Salford Systems © Copyright 2005-
2013
61
62. References MARS
• Friedman, J. H. (1991a). Multivariate adaptive regression
splines (with discussion). Annals of Statistics, 19, 1-141
(March).
• Friedman, J. H. (1991b). Estimating functions of mixed
ordinal and categorical variables using adaptive splines.
Department of Statistics,Stanford University, Tech. Report
LCS108.
• De Veaux R.D., Psichogios D.C., and Ungar L.H. (1993), A
Comparison of Two Nonparametric Estimation Schemes: Mars
and Neutral Networks, Computers Chemical Engineering,
Vol.17, No.8.
Salford Systems ©2013 62
63. References Regularized Regression
• Arthur E. HOERL and Robert W. KENNARD. Ridge
Regression: Biased Estimation for Nonorthogonal
Problems TECHNOMETRICS, 1970, VOL. 12, 55-67
• Friedman, Jerome. H. Fast Sparse regression and
Classification.
http://www-stat.stanford.edu/~jhf/ftp/GPSpaper.pdf
• Friedman, J. H., and Popescu, B. E. (2003). Importance
sampled learning ensembles. Stanford University,
Department of Statistics. Technical Report. http://www-
stat.stanford.edu/~jhf/ftp/isle.pdf
• Tibshirani, R. (1996). Regression shrinkage and selection
via the lasso. J. Royal. Statist. Soc. B. 58, 267-288.
Salford Systems ©2013 63
64. References Regression via Trees
• Breiman, L., J. Friedman, R. Olshen and C. Stone (1984),
Classification and Regression Trees, CRC Press.
• Breiman, L (1996), Bagging Predictors, Machine Learning, 24, 123-
140
• Breiman, L. (2001) Random Forests. Machine Learning. 45, pp
5-32.
• Friedman, J. H. Greedy function approximation: A gradient
boosting machine http://www-
stat.stanford.edu/~jhf/ftp/trebst.pdf Ann. Statist. Volume 29,
Number 5 (2001), 1189-1232.
• Friedman, J. H., and Popescu, B. E. (2003). Importance
sampled learning ensembles. Stanford University, Department
of Statistics. Technical Report. http://www-
stat.stanford.edu/~jhf/ftp/isle.pdf
Salford Systems ©2013 64
65. What’s Next
• Visit our website for the full 4-hour video series
• https://www.salford-
systems.com/videos/tutorials/the-evolution-of-
regression-modeling
o 2 hours methodology
o 2 hours hands-on running of examples
o Also other tutorials on CART, TreeNet gradient boosting
• Download no-cost 60-day evaluation
o Just let the Unlock Department know you participated in the on-
demand webinar series
• Contains many capabilities not present in open
source renditions
o Largely the source code of the inventor of today‟s most important
data mining methods: Jerome H. Friedman
o We started working with Friedman in 1990 when very few people
were interested in his work
Salford Systems ©2013 65
66. Salford Predictive Modeler SPM
• Download a current version from our website
http://www.salford-systems.com
• Version will run without a license key for 10-days
• For more time request a license key from
unlock@salford-systems.com
• Request configuration to meet your needs
o Data handling capacity
o Data mining engines made available
© Salford Systems 2012