Applying machine learning to Kaggle data set to predict which customers are most likely to become customers. Random Forest column importance graph is helpful to prioritize the best segments to target.
2. For Banks, Deposit Growth Drives Revenue Growth
Banks Make Money by Taking Deposits and Making Loans
Loans (Mortgage, Student Loan etc.) 4.5% interest received
Deposits (Savings accounts etc.) 1.5% interest paid
Margin for Operations and Profit 3.0% of $ loaned
Why is this analysis important?
More Deposits = More Loan Capacity = More Revenues and Profits
Therefore, how can a bank increase Deposits?
3. How will we meet our Deposit Growth Goals?
● Executive Responsible VP Marketing Operations
● Market and sell to the Customers Most Likely to
buy a term deposit
○ Which customers have bought term deposits in the past?
○ How can we target more of these types of customers?
● Fish in the ponds with the fish we want to catch
4. Goal: Predict Prospects Who Will Buy a Certificate of Deposit (CD)
Analyze Results of 3 years of phone solicitation campaigns (15)
o 41,188 phone calls
o Portuguese bank
o 11% of prospects bought a CD
o 20 columns of demographic, campaign and economic data
o Target variable: Yes or No, bought a CD
5. Rationale for 3 Machine Learning Models
Rationale for choosing these 3 models
< 100,000 data points
Supervised, Binary Classification of depositors, versus non-depositors
Models selected
Random Forest- variable importance helps with feature selection
Logistic Regression
Support Vector Machines
6. Prediction Pipeline
Clean Data
Fix dtypes
Rename
Missing
Drop Features
Explore
Histograms
Correlation
Matrix
Feature
Selection
Build &
Test Models
RandomForest
SVM
Logistic
Regression
Tune Best
Model
Random Search
o Logistic
Regression
o (C, max_iter,
Dual)
Pre-Process
Standardize
numerical values
One-Hot-Encode
categorical values
Balance target
classes
7. Judging the Models: Recall over Accuracy
Accuracy Precision Recall Features
Random Forest 0.35 0.13 0.83 Balanced target classes
Numerical and Categorical
Logistic
Regression
0.80 0.31 0.65 Balanced target classes
Numerical and Categorical
SVC 0.84 0.37 0.60 Balanced target classes
Numerical and Categorical
Simply NO 0.89 0.89 0.00 Baseline model
8. Tune the Best Model Hyperparameters
Logistic Regression Hyperparameters Optimized
• Use Random search
• Results are the same
Model Tuning Accuracy Precision Recall
Tuned
Logistic
Regression
C = 1.0
max_iter = 120
Dual = True
0.80 0.31 0.65
Default
Logistic
Regression
C = 1.0
max_iter = 100
Dual = False
0.80 0.31 0.65
9. Random Forest Column Importance Points the Way
For Likely Prospects
Better interest rate
Age 50 +
Already a loan
customer
Prestige of Bank
Good Economy
Called previously
Learn from campaigns
Technician job
Married
University Degree
10. Next 3 Months: Use ML to Power Growth
Marketing- Grow Deposits
Deploy campaigns that target better
(and fewer) prospects
Offer higher interest rates
Script engaging conversations
Market during good economic times
Copy successful campaigns (# 2-13)
Target jobs: technician, unemployed
Target ages: 50 and over
Target education: University degree
Data Analysis- Improve Sales too
Try Naïve-Bayes for real-time results
Thorough Feature Selection
• Additive, Subtractive
• Add calculated fields
Model Tuning for more models
Take new campaign results and iterate
the model
Apply similar models to lead scoring, ad
targeting, prospect prioritizing, etc.
11. Next 12 Months: Use ML to Transform Banking
Embed ML in Sales and
Marketing Workflows
Deploy machine learning to
automatically prioritize lists for
marketing programs
Deploy automated prospect
prioritization, and pitch guidance
for sales reps.
Use ML to Transform the Business
Deploy high outcome pilot projects to
demonstrate impact of embedded ML
in sales and marketing workflow.
Explore how ML might generate new
revenue streams or business models.
Can we sell smart cash management
services, for example?
13. Judging the Models: Recall over Accuracy
Accuracy Precision Recall Features
Random Forest 0.35 0.13 0.83 Balanced target classes
Numerical and Categorical
Logistic
Regression
0.80 0.31 0.65 Balanced target classes
Numerical and Categorical
SVC 0.84 0.37 0.60 Balanced target classes
Numerical and Categorical
Naïve-Bayes 0.72 0.25 0.70 Balanced target classes
Numerical and Categorical
KNN 0.89 0.54 0.28 Balanced target classes
Numerical and Categorical
Simply NO 0.89 0.89 0.00 Baseline model
14. Data Dictionary
1 - age The age of the client. Numeric
2 - job : type of job (categorical: 'admin.','blue-
collar','entrepreneur','housemaid','management','retired'
,'self-
employed','services','student','technician','unemployed','
unknown')
3 - marital : marital status (categorical:
'divorced','married','single','unknown'; note: 'divorced'
means divorced or widowed)
4 - education (categorical:
'basic.4y','basic.6y','basic.9y','high.school','illiterate','prof
essional.course','university.degree','unknown')
5 - default: has credit in default? (categorical:
'no','yes','unknown')
6 - housing: has housing loan? (categorical:
'no','yes','unknown')
7 - loan: has personal loan? (categorical:
'no','yes','unknown')
8 - contact: contact communication type (categorical:
'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan',
'feb', 'mar', …, 'nov', 'dec')
10 - dayofweek: last contact day of the week
(categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds
(numeric). Not known in advance, therefore drop this.
12 - campaign: number of contacts performed during
this campaign and for this client (numeric, includes last
contact)
15. Data Dictionary (cont’d)
13 - pdays: number of days that passed by after the
client was last contacted from a previous campaign
(numeric; 999 means client was not previously
contacted)
14 - previous: number of contacts performed before this
campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing
campaign (categorical: 'failure','nonexistent','success')
social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly
indicator (numeric)
17 - cons.price.idx: consumer price index - monthly
indicator (numeric)
18 - cons.conf.idx: consumer confidence index -
monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator
(numeric)
20 - nr.employed: number of employees - quarterly
indicator (numeric)
Target variable (desired outcome):
21 - y - has the client subscribed a term
deposit? (binary: 'yes','no')
Acknowledgements:
We thank UCI Machine learning repository for providing this
dataset.
16. Model Pros and Cons
Approach Pros Cons
Logistic
Regression
Well-understood binary
classification method
Prone to over-fitting
Random Forest Decorrelates trees
reduced variance
Naïve-Bayes Fast, can use real-time Must have independent features
SVM Missing Values OK Computationally intensive, Not for real-time
18. Assessing Model Performance
AUC Confusion Matrix
Area under Receiver Operator Curve
How much more does the model predict
above the presence in the population?
Recall: What % of Actual CD buyers were Predicted?
Precision: What % of Predicted CD buyers are Actual?