SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Rakesh Gupta1, Chris Sneed1,Vipul Tyagi1
1College of Computing and Technology,
Lipscomb University, Nashville, TN, USA
Predicting Online Purchases Using
Conversion Prediction Modeling
1
Executive Summary
• Homesite Group Inc. sponsored a Kaggle* competition to
understand how they could better predict what price will
entice it’s quote seekers to purchase a home insurance policy.
• The outcome of this research will be important to the field of
retail sales, with special importance to online sales
• The benefits of this implementation for Homesite are more
sales from its leads through effective product pricing.
• In this presentation, our team will demonsrate the process we
followed to create the model and our results in predicting the
data
*https://www.kaggle.com/c/homesite-quote-conversion
*U.S. Census Bureau News. Quarterly Retail E-Commerce Sales for 1st Quarter 2016. (May, 2016).
*
Sales Lead
Articles
History
Predictive
Models
Sales and Lead
Cycle Research
Sales Pricing Models
Classification
Algorithms
Naïve Bayes
Neural Networks
Binary Logistic
Regression
AdaBoost
Patents
Sales Lead
Prioritization
Lead
Conversion
Predicting Online Purchases – A Comparison of
Machine Learning Approaches
Dynamic Pricing
Sales Lead
Conversion
Weighted KNN
Gradient Boosting
Decision Trees
CART
C5.0
CHAID
Support Vector
Machines
Patents
Decision Trees
CART
C5.0
Naïve Bayes
Neural Networks
Binary Logistic
Regression
AdaBoost
Weighted KNN
Gradient
Boosting
CHAID
Support Vector
Machines
Classification Algorithms
Data Source Analysis
• Data from Homesite was relatively clean to begin with
• The dataset had 299 predictor variables and one target variable:
“QuoteConversion” Flag.
– Target variable has the values : 0 or 1
• Data collected had a train dataset of 260K records and test dataset
of 173K records
• During analysis, we removed the variable “QuoteDate” and the
following variables:
Summary Statistics
Variable Name GeographicField10A GeographicField10B PersonalField84 PropertField29 PropertyField6
Min -1 -1 1 0 0
1st Quartile -1 25 2 0 0
Median -1 25 2 0 0
Mean -1 25 1.99 0 0
3rd Quartile -1 25 2 0 0
Max -1 25 8 10 0
NAs 207020 334630
Data Cleansing & Preparation
• Categorical variables conversion to numeric
– 27 variables converted
• 293 predictor variables in the full training set
• Multiple split ratios of train/test
– 90/10
– 80/20
– 67/33
• Randomized sample
• Multiple iterations
Classifications & Platforms
• R – open source statistical tool
– Naïve Bayes
– Logistic Regression
– Boosting
• Python – open source programming platform
– Naïve Bayes
– kNN
– Logistic Regression
Naïve Bayes*
• Naive Bayes is a simple technique for constructing classifiers.
• Models that assign class labels to problem instances, represented as
vectors of feature values.
• All naive Bayes classifiers assume that the value of a particular
feature is independent of the value of any other feature, given the
class variable.
• The method of maximum likelihood is applied for parameter
estimation for naive Bayes models.
• Despite the naive design and apparently oversimplified assumptions,
naive Bayes classifiers have worked quite well in many complex real-
world situations.
• An advantage of naive Bayes is that it only requires a small amount of
training data to estimate the parameters necessary for classification.
• Our team used Gaussian Naïve Bayes as it is good for continuous data
*Naïve Bayes classifier. (n.d.). In Wikipedia. Retrieved from
https://en.wikipedia.org/wiki/Naive_Bayes_classifier
Logistic Regression*
•Binary logistic regression
as our target variable is 0
or 1
•Predicts probabilities of
dependent variable
*Logistic Regression. (n.d.). In Wikipedia. Retrieved from
https://en.wikipedia.org/wiki/Logistic_regression
kNN*
• An object is classified by a majority vote of its neighbors, assigning it to
the “nearest” neighbor
• The nearer neighbors contribute more to the average than the distant
ones
• Sensitive to the local structure of the data
*k-nearest neighbors algorithm. (n.d.). In Wikipedia. Retrieved from
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
Boosting*
•Boosting is a general method for improving
the accuracy of any given learning algorithm
•Works by combining rough and less than
accurate rules of thumb
• Produce a classifier with a
low generalization error
• Increase weights on
incorrectly classified
examples, forcing the base
learner to focus it’s attention
on them
*Schapire, Robert E. and Freund, Yoav. Boosting: Foundations and Algorithms. Massachusetts Institute
of Technology, Cambridge, MA. 2012
Trials & Tribulations
• Neural
Networks?
• CSV
Vector?
Mahout
• Output of
model
• Learning
curve
RapidMiner
• Complicated
to fit model
SVM
• VIF
Functions
• Corrgrams*
Multicollinearity
Analysis
*Package ‘corrgram’ Retrieved from https://cran.r-project.org/web/packages/corrgram/corrgram.pdf
Correlation Analysis
*Package ‘corrgram’ Retrieved from https://cran.r-project.org/web/packages/corrgram/corrgram.pdf
Results - Accuracy Matrices
“No Models are perfect, but some are better than others…”
Technology Classifier
Naïve
Bayes KNN
Logistic
Regression
HS Test File
0’s
1’s
Python
Split Ratio
90/10 81% 78.34% 81.33%
0’s = 168422
1’s = 5414
Python
Split Ratio
80/20 78.47% 81.15%
0’s = 165859
1’s = 7977
Python
Split Ratio
67/33 78.64% 81.13%
0’s = 165870
1’s = 7966
R
Split Ratio
80/20 71%
R
Split Ratio
80/20
0’s =
124,544
1’s = 49,292
Conclusion & Discussion
• Boosting helped identify the 6 variables that provided the
most value
• We know we can predict a sale from a lead about 80% of the
time given Homesite’s data set
• We reduced the number of predictor values from 292 to 6!
• This allows Homesite to focus on these data points.
• Following the 80/20 Pareto principle – From these 6
predictors we get 80% of the benefit without wasting time on
the other factors that don’t carry as much weight.
• Simple, fast market strategy that will provide immediate
benefits in terms of increased sales and revenue for Homesite
Future Works
• Continue work on additional data cleaning to
improve accuracy of the model from 81% to 97%
• Investigate the use of the remaining
classification models to see if we achieve better
results
• Design and build a process to provide real-time
prediction as new quotes are sent out by
HomeSite.
• Complete ANOVA analysis to determine strength
of logistic regression model
Questions?

Weitere ähnliche Inhalte

Ähnlich wie Presentation - Predicting Online Purchases Using Conversion Prediction Modeling 8.19.2016

Meetup_Consumer_Credit_Default_Vers_2_All
Meetup_Consumer_Credit_Default_Vers_2_AllMeetup_Consumer_Credit_Default_Vers_2_All
Meetup_Consumer_Credit_Default_Vers_2_All
Bernard Ong
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
Sri Ambati
 

Ähnlich wie Presentation - Predicting Online Purchases Using Conversion Prediction Modeling 8.19.2016 (20)

Meetup_Consumer_Credit_Default_Vers_2_All
Meetup_Consumer_Credit_Default_Vers_2_AllMeetup_Consumer_Credit_Default_Vers_2_All
Meetup_Consumer_Credit_Default_Vers_2_All
 
Recommender System Using AZURE ML
Recommender System Using AZURE MLRecommender System Using AZURE ML
Recommender System Using AZURE ML
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learning
 
Customer choice probabilities
Customer choice probabilitiesCustomer choice probabilities
Customer choice probabilities
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 
Benchmarking Automated Machine Learning For Clustering
Benchmarking Automated Machine Learning For ClusteringBenchmarking Automated Machine Learning For Clustering
Benchmarking Automated Machine Learning For Clustering
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
 
Recuriter Recommendation System
Recuriter Recommendation SystemRecuriter Recommendation System
Recuriter Recommendation System
 
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowAdvanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflow
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Machine Learning Applications in Credit Risk
Machine Learning Applications in Credit RiskMachine Learning Applications in Credit Risk
Machine Learning Applications in Credit Risk
 

Presentation - Predicting Online Purchases Using Conversion Prediction Modeling 8.19.2016

  • 1. Rakesh Gupta1, Chris Sneed1,Vipul Tyagi1 1College of Computing and Technology, Lipscomb University, Nashville, TN, USA Predicting Online Purchases Using Conversion Prediction Modeling 1
  • 2. Executive Summary • Homesite Group Inc. sponsored a Kaggle* competition to understand how they could better predict what price will entice it’s quote seekers to purchase a home insurance policy. • The outcome of this research will be important to the field of retail sales, with special importance to online sales • The benefits of this implementation for Homesite are more sales from its leads through effective product pricing. • In this presentation, our team will demonsrate the process we followed to create the model and our results in predicting the data *https://www.kaggle.com/c/homesite-quote-conversion
  • 3. *U.S. Census Bureau News. Quarterly Retail E-Commerce Sales for 1st Quarter 2016. (May, 2016). *
  • 4. Sales Lead Articles History Predictive Models Sales and Lead Cycle Research Sales Pricing Models Classification Algorithms Naïve Bayes Neural Networks Binary Logistic Regression AdaBoost Patents Sales Lead Prioritization Lead Conversion Predicting Online Purchases – A Comparison of Machine Learning Approaches Dynamic Pricing Sales Lead Conversion Weighted KNN Gradient Boosting Decision Trees CART C5.0 CHAID Support Vector Machines
  • 6. Decision Trees CART C5.0 Naïve Bayes Neural Networks Binary Logistic Regression AdaBoost Weighted KNN Gradient Boosting CHAID Support Vector Machines Classification Algorithms
  • 7. Data Source Analysis • Data from Homesite was relatively clean to begin with • The dataset had 299 predictor variables and one target variable: “QuoteConversion” Flag. – Target variable has the values : 0 or 1 • Data collected had a train dataset of 260K records and test dataset of 173K records • During analysis, we removed the variable “QuoteDate” and the following variables: Summary Statistics Variable Name GeographicField10A GeographicField10B PersonalField84 PropertField29 PropertyField6 Min -1 -1 1 0 0 1st Quartile -1 25 2 0 0 Median -1 25 2 0 0 Mean -1 25 1.99 0 0 3rd Quartile -1 25 2 0 0 Max -1 25 8 10 0 NAs 207020 334630
  • 8.
  • 9. Data Cleansing & Preparation • Categorical variables conversion to numeric – 27 variables converted • 293 predictor variables in the full training set • Multiple split ratios of train/test – 90/10 – 80/20 – 67/33 • Randomized sample • Multiple iterations
  • 10. Classifications & Platforms • R – open source statistical tool – Naïve Bayes – Logistic Regression – Boosting • Python – open source programming platform – Naïve Bayes – kNN – Logistic Regression
  • 11. Naïve Bayes* • Naive Bayes is a simple technique for constructing classifiers. • Models that assign class labels to problem instances, represented as vectors of feature values. • All naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. • The method of maximum likelihood is applied for parameter estimation for naive Bayes models. • Despite the naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real- world situations. • An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification. • Our team used Gaussian Naïve Bayes as it is good for continuous data *Naïve Bayes classifier. (n.d.). In Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Naive_Bayes_classifier
  • 12. Logistic Regression* •Binary logistic regression as our target variable is 0 or 1 •Predicts probabilities of dependent variable *Logistic Regression. (n.d.). In Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Logistic_regression
  • 13. kNN* • An object is classified by a majority vote of its neighbors, assigning it to the “nearest” neighbor • The nearer neighbors contribute more to the average than the distant ones • Sensitive to the local structure of the data *k-nearest neighbors algorithm. (n.d.). In Wikipedia. Retrieved from https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
  • 14. Boosting* •Boosting is a general method for improving the accuracy of any given learning algorithm •Works by combining rough and less than accurate rules of thumb • Produce a classifier with a low generalization error • Increase weights on incorrectly classified examples, forcing the base learner to focus it’s attention on them *Schapire, Robert E. and Freund, Yoav. Boosting: Foundations and Algorithms. Massachusetts Institute of Technology, Cambridge, MA. 2012
  • 15. Trials & Tribulations • Neural Networks? • CSV Vector? Mahout • Output of model • Learning curve RapidMiner • Complicated to fit model SVM • VIF Functions • Corrgrams* Multicollinearity Analysis *Package ‘corrgram’ Retrieved from https://cran.r-project.org/web/packages/corrgram/corrgram.pdf
  • 16. Correlation Analysis *Package ‘corrgram’ Retrieved from https://cran.r-project.org/web/packages/corrgram/corrgram.pdf
  • 17. Results - Accuracy Matrices “No Models are perfect, but some are better than others…” Technology Classifier Naïve Bayes KNN Logistic Regression HS Test File 0’s 1’s Python Split Ratio 90/10 81% 78.34% 81.33% 0’s = 168422 1’s = 5414 Python Split Ratio 80/20 78.47% 81.15% 0’s = 165859 1’s = 7977 Python Split Ratio 67/33 78.64% 81.13% 0’s = 165870 1’s = 7966 R Split Ratio 80/20 71% R Split Ratio 80/20 0’s = 124,544 1’s = 49,292
  • 18. Conclusion & Discussion • Boosting helped identify the 6 variables that provided the most value • We know we can predict a sale from a lead about 80% of the time given Homesite’s data set • We reduced the number of predictor values from 292 to 6! • This allows Homesite to focus on these data points. • Following the 80/20 Pareto principle – From these 6 predictors we get 80% of the benefit without wasting time on the other factors that don’t carry as much weight. • Simple, fast market strategy that will provide immediate benefits in terms of increased sales and revenue for Homesite
  • 19. Future Works • Continue work on additional data cleaning to improve accuracy of the model from 81% to 97% • Investigate the use of the remaining classification models to see if we achieve better results • Design and build a process to provide real-time prediction as new quotes are sent out by HomeSite. • Complete ANOVA analysis to determine strength of logistic regression model