- Covers all standard and mandatory steps - *in details for any *supervised/classification - data science application
- Dataset used here: https://www.kaggle.com/buntyshah/auto-insurance-claims-data
- Detailed medium article: https://medium.com/@srijitpanja/step-by-step-data-science-execution-car-insurance-fraud-detection-task-example-9855d306a4c9
3. Data Preparation
Data Gathering ✔️ Data quality checks ✔️ Handling extreme values ✔️ Handling missing data ✔️ Feature selection ✔️ Encoding ✔️
Columns with
outliers
● Policy annual
premium
● Umbrella limit
● Capital loss
● Property claim
Solved with:
Median imputation
● Initial data
provided
● Intuitive
cross-check
● Ideation for
derived
columns
2 derived columns: ‘Months
within incident date and policy
bind date’ and ‘incident
within customership’
Columns with
missing data
● Collision type
● Property
damage
● Police report
available
Solved with:
Mode imputation
10
most
important
features
10
least
important
features
Feature - Feature Correlation
Heatmap
Initial: 1000 rows, 40 columns
● Total claim is
the sum of
Property claim,
Vehicle claim
and
Injury claim
● Values in
numeric
columns > 0
1 row containing umbrella limit
< 0 removed
4. Initial: 1000 rows, 40 columns
Columns removed due to non-relevance: Policy
number, _c39
Columns removed due to correlation >
95% with other column:
Vehicle claim
Columns removed due to contribution transferred
to a derived column: Incident date, Policy bind
date
Columns removed due to feature importance
score < 0.02: Collision type, Property damage,
Incident within customership, Insured sex,
Umbrella limit, Number of vehicles involved,
Police report available, Incident type
Columns in final Analytical Dataset:
Months as customer, Age, Policy state, Policy csl,
Policy deductible, Policy annual premium, Insured zip,
Insured education level, Insured occupation, Insured
hobbies, Insured relationship, Capital gains, Capital
loss, Incident severity, Authorities contacted, Incident
state, Incident city, Incident hour of the day, Bodily
injuries, Witnesses, Total claim amount, Injury claim,
Property claim, Auto make, Auto model, Auto year,
Months between incident date and bind date
Final: 999 rows, 27 columns
5. Handling imbalanced data✔️
Fraud 25%
Non-Fraud 75%
Initial
imbalanced
dataset
Imbalanced
Training
dataset
Balanced
Training
dataset
For Train Dataset
SMOTE (Synthetic Minority
Oversampling TEchnique)
Train - Test Split
Initial
imbalanced
dataset
Imbalanced
Test
dataset
For Test Dataset
Train - Test Split
Distribution of target labels
6. Data Analysis and Visualization
Distribution of Target column values along Categorical columns✔️ Distribution of Target column values along Non-Categorical columns✔️
Bar Charts - Feature column (X) vs
Target Column (Y)
Density Plots - Feature Column (X) vs Target Column (Y)
7. Explanatory Model Building
ML Model performances✔️ Main and Interaction effects on Model Outputs✔️
Model Accura
cy
Precisi
on
Recall F1
Score
LR 0.76 0 0 0
KNN 0.74 0.38 0.12 0.19
NB 0.735 0.35 0.12 0.18
DT 0.74 0.47 0.60 0.53
RF 0.77 0.53 0.44 0.48
XGB 0.775 0.53 0.58 0.55
Heatmap
for
Main
and
Interaction
effects
Therm
plot
for
main
effects
Best performing models are Tree-based models
Selected model: XGBoost
8. Predictive Model Building
Current Model performance✔️ Improvements✔️
Accuracy Precision Recall F1 Score
0.775 0.53 0.58 0.55
🚀 Hyperparameter Tuning
by GridSearchCV
Best Parameter values:
'colsample_bytree': 1,
'learning_rate': 0.01,
'max_depth': 10,
'n_estimators': 100,
'subsample': 0.7
Accuracy Precision Recall F1 Score
0.82 0.60 0.77 0.67
🚀 Tuning threshold from
ROC by maximising AUC
Theshold
value
=
0.68
Accuracy Precision Recall F1 Score
0.83 0.60 0.90 0.72
9. Challenges ● Intuitive cross-check and deriving
features.
● Improving the performance - Determining
the set of values for parameters in
hyperparameter tuning.
● Improving the performance further -
Determining correct optimizer for
procuring threshold from ROC. Finalized
at: (TPR - FPR)
Insights
Highest contributing columns [i.e. columns that
should be made sure to contain correct values]
Examples
of
their
contributions