Submit Search
Upload
Some of the new features in SPM 7
•
Download as PPTX, PDF
•
0 likes
•
725 views
Salford Systems
Follow
New features available in SPM 7.
Read less
Read more
Technology
Report
Share
Report
Share
1 of 37
Download now
Recommended
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
Salford Systems
TreeNet Overview - Updated October 2012
TreeNet Overview - Updated October 2012
Salford Systems
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example Dataset
Salford Systems
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
Salford Systems
Electi Deep Learning Optimization
Electi Deep Learning Optimization
Nikolas Markou
Random forests-talk-nl-meetup
Random forests-talk-nl-meetup
Willem Hendriks
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
Maarten Smeets
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
BigMine
Recommended
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
Salford Systems
TreeNet Overview - Updated October 2012
TreeNet Overview - Updated October 2012
Salford Systems
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example Dataset
Salford Systems
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
Salford Systems
Electi Deep Learning Optimization
Electi Deep Learning Optimization
Nikolas Markou
Random forests-talk-nl-meetup
Random forests-talk-nl-meetup
Willem Hendriks
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
Maarten Smeets
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
BigMine
Hadoop Design Patterns
Hadoop Design Patterns
EMC
Introduction to Random Forest
Introduction to Random Forest
Rupak Roy
Intro to ml_2021
Intro to ml_2021
Sanghamitra Deb
The return of big iron?
The return of big iron?
Ben Stopford
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
AdityaSoraut
Salford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of Technology
Vladyslav Frolov
Adbms 8 history of data models
Adbms 8 history of data models
Vaibhav Khanna
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Alexey Rybakov
Patterns of enterprise application architecture
Patterns of enterprise application architecture
Chinh Ngo Nguyen
session on pattern oriented software architecture
session on pattern oriented software architecture
SUJOY SETT
Webinar: SDS is Broken - And How to Fix it
Webinar: SDS is Broken - And How to Fix it
Storage Switzerland
Chapter-2 Database System Concepts and Architecture
Chapter-2 Database System Concepts and Architecture
Kunal Anand
Model based engineering tutorial thomas consulting 4_sep13-1
Model based engineering tutorial thomas consulting 4_sep13-1
seymourmedia
Simplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data Virtualization
Denodo
Real World Performance - OLTP
Real World Performance - OLTP
Connor McDonald
Netezza Deep Dives
Netezza Deep Dives
Rush Shah
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
SnapLogic
Storage Systems For Scalable systems
Storage Systems For Scalable systems
elliando dias
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
Edge AI and Vision Alliance
Learning Trees - Decision Tree Learning Methods
Learning Trees - Decision Tree Learning Methods
HPCC Systems
Datascience101presentation4
Datascience101presentation4
Salford Systems
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
Salford Systems
More Related Content
Similar to Some of the new features in SPM 7
Hadoop Design Patterns
Hadoop Design Patterns
EMC
Introduction to Random Forest
Introduction to Random Forest
Rupak Roy
Intro to ml_2021
Intro to ml_2021
Sanghamitra Deb
The return of big iron?
The return of big iron?
Ben Stopford
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
AdityaSoraut
Salford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of Technology
Vladyslav Frolov
Adbms 8 history of data models
Adbms 8 history of data models
Vaibhav Khanna
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Alexey Rybakov
Patterns of enterprise application architecture
Patterns of enterprise application architecture
Chinh Ngo Nguyen
session on pattern oriented software architecture
session on pattern oriented software architecture
SUJOY SETT
Webinar: SDS is Broken - And How to Fix it
Webinar: SDS is Broken - And How to Fix it
Storage Switzerland
Chapter-2 Database System Concepts and Architecture
Chapter-2 Database System Concepts and Architecture
Kunal Anand
Model based engineering tutorial thomas consulting 4_sep13-1
Model based engineering tutorial thomas consulting 4_sep13-1
seymourmedia
Simplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data Virtualization
Denodo
Real World Performance - OLTP
Real World Performance - OLTP
Connor McDonald
Netezza Deep Dives
Netezza Deep Dives
Rush Shah
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
SnapLogic
Storage Systems For Scalable systems
Storage Systems For Scalable systems
elliando dias
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
Edge AI and Vision Alliance
Learning Trees - Decision Tree Learning Methods
Learning Trees - Decision Tree Learning Methods
HPCC Systems
Similar to Some of the new features in SPM 7
(20)
Hadoop Design Patterns
Hadoop Design Patterns
Introduction to Random Forest
Introduction to Random Forest
Intro to ml_2021
Intro to ml_2021
The return of big iron?
The return of big iron?
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Salford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of Technology
Adbms 8 history of data models
Adbms 8 history of data models
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Patterns of enterprise application architecture
Patterns of enterprise application architecture
session on pattern oriented software architecture
session on pattern oriented software architecture
Webinar: SDS is Broken - And How to Fix it
Webinar: SDS is Broken - And How to Fix it
Chapter-2 Database System Concepts and Architecture
Chapter-2 Database System Concepts and Architecture
Model based engineering tutorial thomas consulting 4_sep13-1
Model based engineering tutorial thomas consulting 4_sep13-1
Simplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data Virtualization
Real World Performance - OLTP
Real World Performance - OLTP
Netezza Deep Dives
Netezza Deep Dives
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Storage Systems For Scalable systems
Storage Systems For Scalable systems
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
Learning Trees - Decision Tree Learning Methods
Learning Trees - Decision Tree Learning Methods
More from Salford Systems
Datascience101presentation4
Datascience101presentation4
Salford Systems
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
Salford Systems
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Salford Systems
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
Salford Systems
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data Mining
Salford Systems
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele Cutler
Salford Systems
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
Salford Systems
Statistically Significant Quotes To Remember
Statistically Significant Quotes To Remember
Salford Systems
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
Salford Systems
Evolution of regression ols to gps to mars
Evolution of regression ols to gps to mars
Salford Systems
Data Mining for Higher Education
Data Mining for Higher Education
Salford Systems
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modeling
Salford Systems
Molecular data mining tool advances in hiv
Molecular data mining tool advances in hiv
Salford Systems
SPM v7.0 Feature Matrix
SPM v7.0 Feature Matrix
Salford Systems
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARS
Salford Systems
Hybrid cart logit model 1998
Hybrid cart logit model 1998
Salford Systems
Session Logs Tutorial for SPM
Session Logs Tutorial for SPM
Salford Systems
Text mining tutorial
Text mining tutorial
Salford Systems
Paradigm shifts in wildlife and biodiversity management through machine learning
Paradigm shifts in wildlife and biodiversity management through machine learning
Salford Systems
Global Modeling of Biodiversity and Climate Change
Global Modeling of Biodiversity and Climate Change
Salford Systems
More from Salford Systems
(20)
Datascience101presentation4
Datascience101presentation4
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data Mining
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele Cutler
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
Statistically Significant Quotes To Remember
Statistically Significant Quotes To Remember
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
Evolution of regression ols to gps to mars
Evolution of regression ols to gps to mars
Data Mining for Higher Education
Data Mining for Higher Education
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modeling
Molecular data mining tool advances in hiv
Molecular data mining tool advances in hiv
SPM v7.0 Feature Matrix
SPM v7.0 Feature Matrix
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARS
Hybrid cart logit model 1998
Hybrid cart logit model 1998
Session Logs Tutorial for SPM
Session Logs Tutorial for SPM
Text mining tutorial
Text mining tutorial
Paradigm shifts in wildlife and biodiversity management through machine learning
Paradigm shifts in wildlife and biodiversity management through machine learning
Global Modeling of Biodiversity and Climate Change
Global Modeling of Biodiversity and Climate Change
Recently uploaded
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Igalia
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
debabhi2
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Miguel Araújo
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Delhi Call girls
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Principled Technologies
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
The Digital Insurer
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
The Digital Insurer
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
apidays
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
Enterprise Knowledge
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Martijn de Jong
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Rafal Los
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
gurkirankumar98700
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
HampshireHUG
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
The Digital Insurer
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Delhi Call girls
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Neo4j
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Roshan Dwivedi
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Delhi Call girls
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
RTylerCroy
Recently uploaded
(20)
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
Some of the new features in SPM 7
1.
Advances in Boosted
Tree Technology: TreeNet Model Compression and Optimal Rule Extraction Dan Steinberg, Milkail Golovnya, N Scott Cardell May 2012 Salford Systems http://www.salford-systems.com
2.
Beyond TreeNet • TreeNet
has set a high bar for automatic off-the-shelf model performance – TreeNet was used to win all four 1st place awards in the Duke/Teradata churn modeling competition of 2002 – Awards in 2010, 2009, 2008, 2007, 2004 all based on TreeNet • TreeNet was first developed (MART) in 1999 and essentially perfected in 2000 – Many improvements since then but the fundamentals are largely those of the 2000 technology • In subsequent work Friedman has introduced major extensions that go beyond the framework of boosted trees © Copyright Salford Systems 2012
3.
Importance Sampled Learning
Ensembles (ISLE) • Friedman’s work in 2003 is somewhat more complex than what we describe here – Presented his paper at our first data mining conference in San Francisco in March of 2004 • We focus on the concept of model compression • TreeNet model is grown myopically one added tree at a time – From current model attempt to improve it by predicting residuals – Each tree represents incremental learning and error correction – Slow learning, small steps – During model development we do not know where we are going to end up • Once we have the TreeNet model completed can we review it and “clean it up” © Copyright Salford Systems 2012
4.
Post-Processing With Regularized
Regression • Friedman’s ISLE takes a TreeNet model as its raw material and considers how we can refine it using regression • Consider: every tree takes our raw data as input and generates outputs at the terminal nodes • Each tree can be thought of as a new variable constructed out of the original data – No missing values in tree outputs even if there were missing values in the raw data – Outliers among such predictors are expected to be rare as each terminal is doing averaging and the trees are typically small • Might create many more generated variables than original raw variables – Boston data set has 13 predictors, TN might generate 1000 trees © Copyright Salford Systems 2012
5.
Regularized Regression • Modern
regression techniques starting with Ridge regression, and then the Lasso, and finally hybrid models • Methods have advantages over classical regression – Can handle highly correlated variables (Ridge) – Can work with data sets with more columns than rows – Can do variable selection (Lasso, Ridge-Lasso hybrids) – Much more effective and reliable than old fashioned stepwise • Regularized regression is still regression and thus suffers from all the primary limitations of classical regression – No missing value handling – Linear additive model (no interactions) – Sensitive to functional form of predictors © Copyright Salford Systems 2012
6.
Regularized Regression Applied
to Trees • Applying to regularized regression to trees is not vulnerable to these traditional problems – Missing values already handled and transformed to non-missing – Interactions incorporated into the tree structure – Trees are invariant with respect to typical univariate transformations – Any order preserving transform will not affect tree • What will a regularized regression on trees accomplish? – Combine all identical trees into one – Combine several similar trees into a compromise tree – Bypass any meandering while TreeNet searched for optimum – Reweights the trees (in TN all trees have equal weight) © Copyright Salford Systems 2012
7.
Regularized Regression of
TreeNet • In this mode of ISLE we develop the best TreeNet model we can • Post-process results allowing for different degrees of compression • By default we run four models on the TreeNet – Ridge (no compression, just reweighting) – Lasso (compression possible) – Ridged Lasso (hybrid of Lasso and Ridge but mostly Lasso) – Compact (maximum compression) • Goal usually is to find a substantial degree of compression while giving up little or nothing on test sample performance • Could focus only on beating TN performance © Copyright Salford Systems 2012
8.
Model Compression: Early
Days • TreeNet has always offered model truncation • Instead of using the fully articulated model stop the process early • In 2005 this method was being used by a major web portal – TreeNet model used to predict likely response to item presented to visitor on a web page (ad, link, photo, story) – To implement real time response TN model limited to first 30 trees – Sacrificed considerable predictive accuracy to have a model that could score fast enough in real time – Truncated TreeNet at 30 trees still was better than other alternatives – Consider that model might have been rebuilt every hour © Copyright Salford Systems 2012
9.
Illustrative Example: Boston
Housing Data Set Set Up Model © Copyright Salford Systems 2012
10.
TreeNet Controls 1000 trees,
Least Squares, AUTO Learnrate © Copyright Salford Systems 2012
11.
Post Processor Controls:
What Type of Post Processing © Copyright Salford Systems 2012
12.
Post Processor Details:
Use all defaults • Standardizing the “trees” gives all equal weight in regularized regression • Worth experimenting with unstandardized – larger variance trees will dominate © Copyright Salford Systems 2012
13.
Two Stage Modeling
Process • First Stage here is a TreeNet but in SPM could also be – Single CART Tree (focus would be on nodes eg from maximal tree) – Ensemble of CART trees (bagger) – MARS model (basis functions from maximal model) – Random Forests • In ISLE mode we need to operate on a collection of variables created by a learning machine – these can come from any of our tree engines or MARS • We will get first stage results: a model • Then get second stage: model refinement – Model compression or model selection (eg tree pruning) © Copyright Salford Systems 2012
14.
TreeNet Results Test Set
R2=.87875 MSE=7.407 © Copyright Salford Systems 2012
15.
TreeNet Results: Residual
Stats One substantial outlier more than 5 IQR outside central data range © Copyright Salford Systems 2012
16.
TreeNet and Compressed
TreeNet Both Models Reported Below • The dashed lines show evolution of the compressed model • Because we can choose any of our 1000 trees to start the compressed model starts off much better than the original TreeNet and it has a coefficient © Copyright Salford Systems 2012
17.
ISLE Reweighted TreeNet:
Test Data Results © Copyright Salford Systems 2012
18.
TreeNet vs ISLE
Residuals ISLE is wider in the center but narrower top to bottom TreeNet Residuals ISLE Compressed TreeNet © Copyright Salford Systems 2012
19.
Comment on the
First Tree • It is interesting to observe that in this example the compressed model with just one tree in it outperforms the TreeNet model with just one tree • Trees are built without look ahead but having a menu of 1000 trees to choose from allows the 2nd stage model to do better • Worst case scenario is that 2nd stage chooses same first tree • Coefficient can spread out the predictions © Copyright Salford Systems 2012
20.
TreeNet Model Compression •
TreeNet has set a high bar for predictive accuracy in the data mining field • We now offer several ways in which a TreeNet can be further improved by post-processing • Consider that a TreeNet model is built one step at a time without knowledge of where we will end up – Some trees are exact or almost exact copies of other trees – Some trees may exhibit some “wandering” before the right direction is found – Trees are each built on a different random subset of the data and some trees may just be “unlucky” – Post processing can combine multiple copies of essentially the same tree and skip any unnecessary wandering © Copyright Salford Systems 2012
21.
How Much Compression
is Possible? • Our experience derives from working with data from several industries (retail sales, online web advertising, credit risk, direct marketing) • Compression of 80% is not uncommon for the best model generated by the post-processing • However, user is free to truncate the compressed model as it is also built up sequentially (we add one tree at a time to the model) • User can thus choose from a possibly broad range of tradeoffs opting for even greater compression available from a less accurate model • In the BOSTON example 90% compression also performs quite well (about 40 trees instead of the optimal 91 trees) © Copyright Salford Systems 2012
22.
A Comment on
the Theory behind ISLE • In Friedman’s paper on ISLE he provides a rationale for this approach quite different from ours • Consider that our goal is to learn a model from data where it is clear that a linear regression is not adequate • How to automatically manufacture basis functions that capture more complex structure than raw variables – Imagine offering high order polynomials – Some have suggested adding Xi*Xj interactions and also 1/Xi as new predictors plus log(Xi) for all strictly positive regressors – Friedman proposes TreeNet as a vehicle for generating such new variables in the search for a more faithful model (to the truth) – Think of TreeNet as a search engine for features (constructed predictors) © Copyright Salford Systems 2012
23.
From Trees to
Nodes • In a second round of work on the idea of post-processing a tree ensemble Friedman suggested working with nodes • Every node in a decision tree (other than the root) defines a potentially interesting subset of data • Analysts have long thought about the terminal nodes of a CART tree in this way – Each terminal node is a segment or can be thought of as an interesting rule – Cardell and Steinberg proposed blending CART and logistic regression in this way (each terminal node is a dummy variable) • Now we extend this thinking to all nodes below the root • Tibshirani proposed using all the nodes of a maximal tree in a Lasso model to “prune” the tree © Copyright Salford Systems 2012
24.
Nodes in a
Single TreeNet Tree Tree grown to have T=6 terminal nodes • Typical TreeNet has T=6 terminal nodes • One level down has two nodes • Next level has 4 nodes (3 terminal) • Next 2 levels have 2 nodes each • Total is 10 non-root nodes • Will always be T + (T-1) -1 = 2(T-1) • Represent each node as a 0/1 indicator • Record passes through this node (1) or does not pass through this node (0) • With 10 node indicators per each 6-terminal tree a 1,000 tree TreeNet will generate 10,000 node indicators • Now we want to post-process this node representation of the TreeNet • Methodology can generate an immense number of predictors © Copyright Salford Systems 2012
25.
Use Regularized Regression
to Post Process • Essential because even if we start with a small data set (rows and columns) we might generate thousands of trees • The regularized regression is used to – SELECT trees (only a subset of the original trees will be used) – REWEIGHT trees (originally all had equal weight) • The new model is still an ensemble of regression trees but now recombined differently – Some trees might get a negative weight • New model could have two advantages – Could be MUCH smaller than original model (good for deployment) – Could be more accurate on holdout data • No guarantees but results often attractive © Copyright Salford Systems 2012
26.
Variations on Node
Post Processing • Pure: nodes (only node dummies in 2nd stage model) • Hybrid: nodes + trees (mix of ISLE and nodes) • Hybrid: raw predictors + nodes (Friedman’s preferred) • Hybrid: raw predictors + ISLE variables • Hybrid: raw predictors + ISLE trees + nodes • In addition we could add the original TreeNet prediction to any of these sets of predictors • Ideal interaction detection: include TreeNet prediction from a pure additive model and node indicators as regressors © Copyright Salford Systems 2012
27.
Raw Predictor Problems •
Much of our empirical work involves incomplete data (missing values) and the 2nd stage model requires complete data (listwise deletion) • While the hybrid models involving raw variables can capture nonlinearity and interactions the raw predictors act as everyday regressors – Issue of functional form – Issue of outliers • Using ISLE variables may be far better for working with data for which careful cleaning and repair is not an option © Copyright Salford Systems 2012
28.
Same Data Post-Processing
Nodes • In this example running only on nodes does not do well • See the upper dotted performance curves • Still we will examine the outputs generated • Which method works best will vary with specifics of the data © Copyright Salford Systems 2012
29.
Pure RuleSeeker •
Each variable in model is a node, or a RULE • Worthwhile to examine mean target, lift, support and agreement with test data • All shown above © Copyright Salford Systems 2012
30.
Rule table:
Display is Sortable • Number of terms in a rule is determined by location of node in tree • Deep nodes can involve more variables (minimum is one, max is equal to depth of tree) © Copyright Salford Systems 2012
31.
Rule Statistics More columns
from the Rule Table Display © Copyright Salford Systems 2012
32.
Lift Report:
High Lifts Represent Interesting Sergments Dot for each rule (here displaying test data results) © Copyright Salford Systems 2012
33.
Parametric Bootstrap For
Interaction Statistics © Copyright Salford Systems 2012
34.
Final Details • We
have described RuleSeeker as a way to post-process a TreeNet model and this is a fundamental use of the method • When our goal from the start is to extract rules then we are advised to modify the TreeNet control in two ways – Allow the sizes of the trees to vary at random – Use very small subsets of the data when growing each tree • Friedman recommends an average tree size of 4 terminal nodes and using a Poisson distribution to generate varying tree sizes (will often yield a few trees with 10-16 nodes) • Friedman describes experiments in which each tree in the TreeNet is grown on just 5% of the available data – TreeNet first stage is inferior to standard TreeNet but 2nd stage could actually outperform the standard TreeNet © Copyright Salford Systems 2012
35.
RuleSeeker and Huge
Data • If the RuleSeeker approach can in fact outperform standard TreeNet this suggests a sampling approach to massive data sets • Extract rather small (possibly stratified) samples from each of many data repositories • Grow a Treenet tree • Repeat random draws to grow subsequent trees • Friedman’s approach does not grow very many trees (200) • The 2nd stage regression must be run on a much larger sample but regression is much easier to distribute than trees © Copyright Salford Systems 2012
36.
RuleSeeker Summary • A
RuleSeeker model has several interesting dimensions – It is a post-processed version of a TreeNet – RuleSeeker model could offer better performance than original TN – RuleSeeker model might also be more compact – Rules extracted could be seen as important INTERACTIONS – Rules could be studied as rules • Compare train vs test Lift (want good agreement) • Consider tradeoff of Lift versus Support – Rules can guide targeting but only worthwhile if support is sufficient © Copyright Salford Systems 2012
37.
Big Data • Currently
we support 64-bit single server • Using typical modern servers means 32-cores and 512GB RAM – Shortly we expect to see 200 cores and 2TB RAM at modest prices – Our training data can reach about 1/3 RAM without disk thrashing – 200GB training data (50 million rows by 1000 predictors) • MapReduce/Hadoop appears to be the emerging standard for massively parallel data stores and computation • Our approach will be bagging models that extract random samples from each of the data stores • Each mapper and reducer are expected to have 4GB RAM • We will require reducers to be equipped with 16GB © Copyright Salford Systems 2012
Download now