SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Downloaden Sie, um offline zu lesen
Default of Credit Card Clients
Author:
Alexandre Pinto
Faculty of Sciences and Technology
Department of Informatics Engineering
University of Coimbra
Summary
Contents
1 Introduction
2 Feature Assessment/Visualization
3 Preprocessing
4 Feature Selection/Reduction
5 Classification
6 Evaluation Metrics
7 Demo - Short Experiment
8 Conclusions
Default of Credit Card Clients Alexandre Pinto 2
Introduction
Problem Definition
Default Credit Card:
• Happens when clients fail to adhere to the credit card
agreement, by not paying the monthly bill
Main Goal:
• Development of a system capable of detecting clients that
will not be able to pay the next month
Default of Credit Card Clients Alexandre Pinto 3
Introduction
Problem Definition
Default Credit Card:
• Happens when clients fail to adhere to the credit card
agreement, by not paying the monthly bill
Main Goal:
• Development of a system capable of detecting clients that
will not be able to pay the next month
Default of Credit Card Clients Alexandre Pinto 3
Introduction
Dataset Description
• 23 features: X1 - X23
• One predictive binary label (Default: Yes = 1, No = 0)
• X1: Amount of the given credit
X2: Gender
X3: Education
X4: Marital status
X5: Age
X6-X11: History of past payment (4/2005 to 9/2005)
X12-X17: Amount of bill statement (4/2005 to 9/2005)
X18–X23: Amount of previous payment (4/2005 to 9/2005)
Default of Credit Card Clients Alexandre Pinto 4
Introduction
Dataset Description
• 23 features: X1 - X23
• One predictive binary label (Default: Yes = 1, No = 0)
• X1: Amount of the given credit
X2: Gender
X3: Education
X4: Marital status
X5: Age
X6-X11: History of past payment (4/2005 to 9/2005)
X12-X17: Amount of bill statement (4/2005 to 9/2005)
X18–X23: Amount of previous payment (4/2005 to 9/2005)
Default of Credit Card Clients Alexandre Pinto 4
Introduction
Dataset Description
• 23 features: X1 - X23
• One predictive binary label (Default: Yes = 1, No = 0)
• X1: Amount of the given credit
X2: Gender
X3: Education
X4: Marital status
X5: Age
X6-X11: History of past payment (4/2005 to 9/2005)
X12-X17: Amount of bill statement (4/2005 to 9/2005)
X18–X23: Amount of previous payment (4/2005 to 9/2005)
Default of Credit Card Clients Alexandre Pinto 4
Feature Assessment/Visualization
• Useful to visualize how the features are distributed, and
assess their discriminative capability
Default of Credit Card Clients Alexandre Pinto 5
Feature Assessment/Visualization
Tasks
Normalized Histogram
Distributions:
• Bins represent unique values
(Grouped by class)
• Heights represent relative
frequencies (Normalized)
Figure: Normalized Histogram
Default of Credit Card Clients Alexandre Pinto 6
Feature Assessment/Visualization
Tasks
Box Plots:
• Show distribution of features
for each class
• Representation of min,
median, max and outlier
values
Figure: Box Plot
Default of Credit Card Clients Alexandre Pinto 7
Feature Assessment/Visualization
Tasks
Pairwise Relationships:
• 2 × 2 grid
• Histograms in diagonal cells,
Scatter plots otherwise
• Useful to see how features
relate to each other
• Useful to see how two
features separate the pattern
classes
Figure: Box Plot
Default of Credit Card Clients Alexandre Pinto 8
Feature Assessment/Visualization
Tasks
Empirical Cumulative
Distribution Function :
• Check if feature distributions
are drawn from a normal
distribution
Figure: Cumulative Function vs Std
Normal Curve
Default of Credit Card Clients Alexandre Pinto 9
Feature Assessment/Visualization
Tasks
Pearson Correlations:
• Check highly correlated
features
ρ =
covariance(xi , xj )
σi × σj
Default of Credit Card Clients Alexandre Pinto 10
Feature Assessment/Visualization
Tasks
Two-Dimensional PCA:
• First 2 component axes with
highest variance
Figure: 2D PCA
Default of Credit Card Clients Alexandre Pinto 11
Feature Assessment/Visualization
Tasks
Two-Dimensional LDA:
• First 2 component axes with
class-separation
Figure: 2D LDA
Default of Credit Card Clients Alexandre Pinto 12
Preprocessing
• Useful to improve the overall training data quality
• Turns data more suitable for the classification models,
improving the final accuracies
Default of Credit Card Clients Alexandre Pinto 13
Preprocessing
Tasks
Standardization:
• Center the data by removing
the mean
• Scale the data by dividing by
the standard deviation
• Obtain normally distributed
data (ρxstd
= 0,σxstd
= 1)
xstd =
x − ρX
σX
Default of Credit Card Clients Alexandre Pinto 14
Preprocessing
Tasks
Scaling features to a range:
• Scale features to lie between
a min and max value
x[0,1] =
x − minx
maxx − minx
Default of Credit Card Clients Alexandre Pinto 15
Preprocessing
Tasks
Normalization:
• Scale samples/features
vectors to have unit norm
• Divide each element by the
euclidean norm of the vector
xnorm =
x
i xi
2
Default of Credit Card Clients Alexandre Pinto 16
Preprocessing
Dataset Balancing
• Classes are not equally distributed:
Majority Class (’0’): 23364 (∼ 78%)
Minority Class (’1’): 6636 (∼ 22%)
• Biased classifiers tend to choose majority class
Default of Credit Card Clients Alexandre Pinto 17
Preprocessing
Dataset Balancing
Two main methods:
• Oversampling the minority class
• Undersampling the majority class
Default of Credit Card Clients Alexandre Pinto 18
Preprocessing
Undersampling
Random Majority Undersampling:
• Start with a set with samples from minority class
• Randomly add samples from the majority class until the
dataset is balanced
Default of Credit Card Clients Alexandre Pinto 19
Preprocessing
Undersampling
NearMiss-1:
• Selects samples from majority class which are close to some of
the minority samples
Default of Credit Card Clients Alexandre Pinto 20
Preprocessing
Undersampling
NearMiss-3:
• Selects samples from majority class which are the farthest
from the nearest minority samples
Default of Credit Card Clients Alexandre Pinto 21
Preprocessing
Undersampling
Neighbor Cleaning Rule:
• Removes majority samples who are misclassified by its three
nearest neighbors
• Removes neighbor majority samples who misclassify minority
samples
Default of Credit Card Clients Alexandre Pinto 22
Preprocessing
Oversampling
Random Minority Oversampling:
• Start with a set with samples from majority class
• Randomly add samples from the minority class until the
dataset is balanced
Default of Credit Card Clients Alexandre Pinto 23
Preprocessing
Oversampling
SMOTE:
• Start with a set with
samples from majority class
• Add synthetic samples
created from the minority
class (interpolation) until
the dataset is balanced
Ssyntentic = Si + (Sk − Si )δ
Default of Credit Card Clients Alexandre Pinto 24
Feature Selection/Reduction
Useful to:
• Remove irrelevant and redundant features
• Improve the prediction performance
• Reduce dimensionality and complexity
Approaches:
• Filter Methods: A subset of features is selected, without
considering the predictive model
• Wrapper Methods: The best subset of features is selected,
using the predictive model to rank the subset
Default of Credit Card Clients Alexandre Pinto 25
Feature Selection/Reduction
Useful to:
• Remove irrelevant and redundant features
• Improve the prediction performance
• Reduce dimensionality and complexity
Approaches:
• Filter Methods: A subset of features is selected, without
considering the predictive model
• Wrapper Methods: The best subset of features is selected,
using the predictive model to rank the subset
Default of Credit Card Clients Alexandre Pinto 25
Feature Selection/Reduction
Filter Methods
Information Gain:
• Rank each attribute
by its ability to
discriminate the
pattern classes
(decrease in entropy)
IG(S, A) = H(S)−
v∈values(A)
|Sv |
|S|
H(Sv )
Default of Credit Card Clients Alexandre Pinto 26
Feature Selection/Reduction
Filter Methods
Information Gain Ratio:
• Information Gain
normalized with the
entropy of the attribute
• It takes into account the
number and size of branches
of a split
• Reduces bias (attributes
with high number of unique
values)
IGR(S, A) =
IG(S, A)
H(A)
Default of Credit Card Clients Alexandre Pinto 27
Feature Selection/Reduction
Filter Methods
Kruskal-Wallis Test:
• Test whether class
groups are drawn
from the same
population and with
the same mean
• Assess the
class-separability of
the feature
H =
12
N(N + 1)
k
i=1
R2
i
ni
− 3(N + 1)
Default of Credit Card Clients Alexandre Pinto 28
Feature Selection/Reduction
Filter Methods
Fisher Score:
• Select features with high
class-separability and low
class-variability
F(Xi ) =
|m1 − m2|2
s2
1 + s2
2
Default of Credit Card Clients Alexandre Pinto 29
Feature Selection/Reduction
Filter Methods
Pearson Correlations:
• Select features highly
correlated with the target
class
• Select features with low
correlation between them
ρ =
covariance(xi , xj )
σi × σj
Default of Credit Card Clients Alexandre Pinto 30
Feature Selection/Reduction
Filter Methods
mRMR:
• Select features with high
relevance (mutual
information) with the target
class
• Select features with low
redundancy between them
max D =
1
|S|
xi ∈S
I(xi , c)
min R =
1
|S|2
xi .xj ∈S
I(xi , xj )
max φ(D, R), φ = D − R
Default of Credit Card Clients Alexandre Pinto 31
Feature Selection/Reduction
Filter Methods
Area Under the Curve:
• Select features that have good classification performance
Default of Credit Card Clients Alexandre Pinto 32
Feature Selection/Reduction
Wrapper Methods
Sequential Forward/Backward Selection:
• Start with an empty/full feature set
• Select the next best/worst feature that have the
largest-increase/smallest -decrease in feature importances
Default of Credit Card Clients Alexandre Pinto 33
Feature Selection/Reduction
Wrapper Methods
Recursive Feature Elimination:
• Start with a full feature set and compute feature importances
• Recursively compute/remove features with low importances
Default of Credit Card Clients Alexandre Pinto 34
Feature Selection/Reduction
Reduction methods
Linear Transformation Techniques:
• PCA: Find the directions (principal components) that
maximize the variance in the dataset
• LDA: Computes the directions (linear discriminants)
that maximize class-separability and minimizes the
variance within the class
Default of Credit Card Clients Alexandre Pinto 35
Classification
Predictive Models:
• Distance Based: Min. Distance Classifier && k-NN
• Probabilistic: Naive Bayes
• Search: Decision Tree
• Optimization: Support Vector Machines (SVM)
• Ensemble: Random Forest
Default of Credit Card Clients Alexandre Pinto 36
Evaluation Metrics
• Useful to assess the classifier performance
Default of Credit Card Clients Alexandre Pinto 37
Evaluation Metrics
Accuracy:
• Proportion of correctly
identified and rejected
instances
ACC =
TP + TN
TP + TN + FP + FN
Default of Credit Card Clients Alexandre Pinto 38
Evaluation Metrics
Precision
Precision:
• Proportion of correct
answers from the positive
predictions
P =
TP
TP + FP
Default of Credit Card Clients Alexandre Pinto 39
Evaluation Metrics
Recall
Recall:
• Proportion of correct
answers from the whole
positive part of a dataset
R =
TP
TP + FN
Default of Credit Card Clients Alexandre Pinto 40
Evaluation Metrics
F1
F1:
• Harmonic mean of precision
and recall
F1 = 2 ×
P · R
P + R
Default of Credit Card Clients Alexandre Pinto 41
Evaluation Metrics
Stratified K-fold Cross Validation
Stratified K-fold Cross Validation:
• Each fold is a good representative of the whole.
Default of Credit Card Clients Alexandre Pinto 42
Evaluation Metrics
ROC Curves
ROC Curve:
• Show trade-offs between
sensitivity(TPR) and
specificity(FPR)
Figure: ROC curves
Default of Credit Card Clients Alexandre Pinto 43
Evaluation Metrics
Precision-Recall Curves
Precision-Recall Curve:
• Show trade-offs between
precision and recall
Figure: PR curves
Default of Credit Card Clients Alexandre Pinto 44
Demo
Short Experiment
• Features Standardized
• Data balanced with SMOTE
• Keep top 15 best features using mRMR
Precision Recall F1 AUC Avg Precision
Min. Distance 0.69 ± 0.02 0.64 ± 0.00 0.66 ± 0.01 0.65 ± 0.01 0.76 ± 0.01
kNN 0.76 ± 0.01 0.91 ± 0.04 0.82 ± 0.02 0.79 ± 0.02 0.86 ± 0.01
Naive Bayes 0.57 ± 0.01 0.93 ± 0.01 0.71 ± 0.01 0.56 ± 0.01 0.77 ± 0.00
Linear SVM 0.70 ± 0.02 0.65 ± 0.01 0.67 ± 0.01 0.67 ± 0.02 0.77 ± 0.01
Decision Tree 0.79 ± 0.02 0.82 ± 0.17 0.79 ± 0.11 0.79 ± 0.07 0.85 ± 0.05
Random Forest 0.88 ± 0.01 0.82 ± 0.16 0.84 ± 0.10 0.85 ± 0.07 0.90 ± 0.04
Table: Results
Default of Credit Card Clients Alexandre Pinto 45
Demo
Short Experiment
Default of Credit Card Clients Alexandre Pinto 46
Demo
Short Experiment
Default of Credit Card Clients Alexandre Pinto 47
Conclusions
• Feature Engineering(Data Transformation, Feature Selection)
is probably the most important step
• Exploratory Data Analysis is important to better get a sense
of the distribution of the data
• Feature selection helps reduce training times and keep only
the most relevant and non-redundant features
• PR is an iterative process
Default of Credit Card Clients Alexandre Pinto 48
References I
[RF] A Gentle Introduction to Random Forests, Ensembles, and
Performance Metrics in a Commercial System.
https://citizennet.com/blog/2012/11/10/
random-forests-ensembles-and-performance-metrics/.
Accessed: 2016-06-06.
[Kag] General Tips for participating Kaggle Competitions.
http://www.slideshare.net/markpeng/
general-tips-for-participating-kaggle-competitions.
Accessed: 2016-06-06.
[knn] KNN classification.
https://www.researchgate.net/figure/260397165_fig7_
Pseudocode-for-KNN-classification. Accessed:
2016-06-06.
Default of Credit Card Clients Alexandre Pinto 49
References II
[PCA] Linear Discriminant Analysis - Bit by Bit. http://
sebastianraschka.com/Articles/2014_python_lda.html.
Accessed: 2016-06-06.
[NB] Naive Bayes. http://www.slideshare.net/
chowdhury343/naive-bayes-presentation. Accessed:
2016-06-06.
[6] Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,
W. P. (2002). SMOTE: Synthetic Minority Over-Sampling
Technique. Journal of artificial intelligence research, pages
321–357.
[7] Laurikkala, J. (2001). Improving Identification of Difficult
Small Classes by Balancing Class Distribution. Springer.
Default of Credit Card Clients Alexandre Pinto 50
References III
[8] Mani, I. and Zhang, I. (2003). kNN approach to unbalanced
data distributions: a case study involving information extraction.
In Proceedings of workshop on learning from imbalanced
datasets.
[9] Peng, H., Long, F., and Ding, C. (2005). Feature selection
based on mutual information criteria of max-dependency,
max-relevance, and min-redundancy. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 27(8):1226–1238.
[10] Prati, R. C., Batista, G. E., and Monard, M. C. (2009). Data
Mining with Imbalanced Class Distributions: Concepts and
Methods. In IICAI, pages 359–376.
[11] Yeh, I.-C. and Lien, C.-h. (2009). The comparisons of data
mining techniques for the predictive accuracy of probability of
default of credit card clients. Expert Systems with Applications,
36(2):2473–2480.
Default of Credit Card Clients Alexandre Pinto 51
Default of Credit Card Clients
Author:
Alexandre Pinto
Faculty of Sciences and Technology
Department of Informatics Engineering
University of Coimbra

Weitere ähnliche Inhalte

Was ist angesagt?

Fighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligenceFighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligenceRon Bodkin
 
Loan Default Prediction with Machine Learning
Loan Default Prediction with Machine LearningLoan Default Prediction with Machine Learning
Loan Default Prediction with Machine LearningAlibaba Cloud
 
Credit card fraud detection pptx (1) (1)
Credit card fraud detection pptx (1) (1)Credit card fraud detection pptx (1) (1)
Credit card fraud detection pptx (1) (1)ajmal anbu
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learningSandeep Garg
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval Venkata Reddy Konasani
 
Credit Card Fraud Detection
Credit Card Fraud DetectionCredit Card Fraud Detection
Credit Card Fraud DetectionBinayakreddy
 
Credit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research PaperCredit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research PaperGarvit Burad
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithmsankit panigrahy
 
credit card fraud analysis using predictive modeling python project abstract
credit card fraud analysis using predictive modeling python project abstractcredit card fraud analysis using predictive modeling python project abstract
credit card fraud analysis using predictive modeling python project abstractVenkat Projects
 
Credit card payment_fraud_detection
Credit card payment_fraud_detectionCredit card payment_fraud_detection
Credit card payment_fraud_detectionPEIPEI HAN
 
Default payment prediction system
Default payment prediction systemDefault payment prediction system
Default payment prediction systemAshish Arora
 
Credit default risk
Credit default riskCredit default risk
Credit default riskchs71
 
CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION K Srinivas Rao
 
Fraud detection with Machine Learning
Fraud detection with Machine LearningFraud detection with Machine Learning
Fraud detection with Machine LearningScaleway
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detectionvineeta vineeta
 
Case Study: Loan default prediction
Case Study: Loan default predictionCase Study: Loan default prediction
Case Study: Loan default predictionALTEN Calsoft Labs
 

Was ist angesagt? (20)

Fighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligenceFighting financial fraud at Danske Bank with artificial intelligence
Fighting financial fraud at Danske Bank with artificial intelligence
 
Loan Default Prediction with Machine Learning
Loan Default Prediction with Machine LearningLoan Default Prediction with Machine Learning
Loan Default Prediction with Machine Learning
 
Credit card fraud detection pptx (1) (1)
Credit card fraud detection pptx (1) (1)Credit card fraud detection pptx (1) (1)
Credit card fraud detection pptx (1) (1)
 
Loan prediction
Loan predictionLoan prediction
Loan prediction
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval
 
Credit Card Fraud Detection
Credit Card Fraud DetectionCredit Card Fraud Detection
Credit Card Fraud Detection
 
Credit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research PaperCredit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research Paper
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithms
 
credit card fraud analysis using predictive modeling python project abstract
credit card fraud analysis using predictive modeling python project abstractcredit card fraud analysis using predictive modeling python project abstract
credit card fraud analysis using predictive modeling python project abstract
 
Credit card payment_fraud_detection
Credit card payment_fraud_detectionCredit card payment_fraud_detection
Credit card payment_fraud_detection
 
Default payment prediction system
Default payment prediction systemDefault payment prediction system
Default payment prediction system
 
Customer churn prediction in banking
Customer churn prediction in bankingCustomer churn prediction in banking
Customer churn prediction in banking
 
Credit default risk
Credit default riskCredit default risk
Credit default risk
 
CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION
 
Fraud detection with Machine Learning
Fraud detection with Machine LearningFraud detection with Machine Learning
Fraud detection with Machine Learning
 
CREDIT_CARD.ppt
CREDIT_CARD.pptCREDIT_CARD.ppt
CREDIT_CARD.ppt
 
Credit scorecard
Credit scorecardCredit scorecard
Credit scorecard
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
Case Study: Loan default prediction
Case Study: Loan default predictionCase Study: Loan default prediction
Case Study: Loan default prediction
 

Ähnlich wie Default Credit Card Prediction

Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoringharmonylab
 
Wayfair-Data Science Project
Wayfair-Data Science ProjectWayfair-Data Science Project
Wayfair-Data Science ProjectMehnaz Maharin
 
Mô hình dự báo Churn cho khách hàng bằng phương pháp học máy suy diễn Phương ...
Mô hình dự báo Churn cho khách hàng bằng phương pháp học máy suy diễn Phương ...Mô hình dự báo Churn cho khách hàng bằng phương pháp học máy suy diễn Phương ...
Mô hình dự báo Churn cho khách hàng bằng phương pháp học máy suy diễn Phương ...NguyenThiNgocAnh9
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyAlon Bochman, CFA
 
30thSep2014
30thSep201430thSep2014
30thSep2014Mia liu
 
Data Science and Machine Learning with Tensorflow
 Data Science and Machine Learning with Tensorflow Data Science and Machine Learning with Tensorflow
Data Science and Machine Learning with TensorflowShubham Sharma
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
BMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist DeckBMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist DeckSasha Lazarevic
 
customer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedincustomer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedinAsoka Korale
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)Kunwoo Park
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptopRising Media, Inc.
 
Introduction to machine learning and model building using linear regression
Introduction to machine learning and model building using linear regressionIntroduction to machine learning and model building using linear regression
Introduction to machine learning and model building using linear regressionGirish Gore
 

Ähnlich wie Default Credit Card Prediction (20)

Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoring
 
Credit risk meetup
Credit risk meetupCredit risk meetup
Credit risk meetup
 
Machine learning meetup
Machine learning meetupMachine learning meetup
Machine learning meetup
 
Wayfair-Data Science Project
Wayfair-Data Science ProjectWayfair-Data Science Project
Wayfair-Data Science Project
 
Vi sem
Vi semVi sem
Vi sem
 
Mô hình dự báo Churn cho khách hàng bằng phương pháp học máy suy diễn Phương ...
Mô hình dự báo Churn cho khách hàng bằng phương pháp học máy suy diễn Phương ...Mô hình dự báo Churn cho khách hàng bằng phương pháp học máy suy diễn Phương ...
Mô hình dự báo Churn cho khách hàng bằng phương pháp học máy suy diễn Phương ...
 
Ds for finance day 3
Ds for finance day 3Ds for finance day 3
Ds for finance day 3
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 
Data Mining.ppt
Data Mining.pptData Mining.ppt
Data Mining.ppt
 
07 learning
07 learning07 learning
07 learning
 
30thSep2014
30thSep201430thSep2014
30thSep2014
 
ch16.ppt
ch16.pptch16.ppt
ch16.ppt
 
Characterization
CharacterizationCharacterization
Characterization
 
Data Science and Machine Learning with Tensorflow
 Data Science and Machine Learning with Tensorflow Data Science and Machine Learning with Tensorflow
Data Science and Machine Learning with Tensorflow
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
BMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist DeckBMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist Deck
 
customer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedincustomer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedin
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
Introduction to machine learning and model building using linear regression
Introduction to machine learning and model building using linear regressionIntroduction to machine learning and model building using linear regression
Introduction to machine learning and model building using linear regression
 

Kürzlich hochgeladen

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 

Kürzlich hochgeladen (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 

Default Credit Card Prediction

  • 1. Default of Credit Card Clients Author: Alexandre Pinto Faculty of Sciences and Technology Department of Informatics Engineering University of Coimbra
  • 2. Summary Contents 1 Introduction 2 Feature Assessment/Visualization 3 Preprocessing 4 Feature Selection/Reduction 5 Classification 6 Evaluation Metrics 7 Demo - Short Experiment 8 Conclusions Default of Credit Card Clients Alexandre Pinto 2
  • 3. Introduction Problem Definition Default Credit Card: • Happens when clients fail to adhere to the credit card agreement, by not paying the monthly bill Main Goal: • Development of a system capable of detecting clients that will not be able to pay the next month Default of Credit Card Clients Alexandre Pinto 3
  • 4. Introduction Problem Definition Default Credit Card: • Happens when clients fail to adhere to the credit card agreement, by not paying the monthly bill Main Goal: • Development of a system capable of detecting clients that will not be able to pay the next month Default of Credit Card Clients Alexandre Pinto 3
  • 5. Introduction Dataset Description • 23 features: X1 - X23 • One predictive binary label (Default: Yes = 1, No = 0) • X1: Amount of the given credit X2: Gender X3: Education X4: Marital status X5: Age X6-X11: History of past payment (4/2005 to 9/2005) X12-X17: Amount of bill statement (4/2005 to 9/2005) X18–X23: Amount of previous payment (4/2005 to 9/2005) Default of Credit Card Clients Alexandre Pinto 4
  • 6. Introduction Dataset Description • 23 features: X1 - X23 • One predictive binary label (Default: Yes = 1, No = 0) • X1: Amount of the given credit X2: Gender X3: Education X4: Marital status X5: Age X6-X11: History of past payment (4/2005 to 9/2005) X12-X17: Amount of bill statement (4/2005 to 9/2005) X18–X23: Amount of previous payment (4/2005 to 9/2005) Default of Credit Card Clients Alexandre Pinto 4
  • 7. Introduction Dataset Description • 23 features: X1 - X23 • One predictive binary label (Default: Yes = 1, No = 0) • X1: Amount of the given credit X2: Gender X3: Education X4: Marital status X5: Age X6-X11: History of past payment (4/2005 to 9/2005) X12-X17: Amount of bill statement (4/2005 to 9/2005) X18–X23: Amount of previous payment (4/2005 to 9/2005) Default of Credit Card Clients Alexandre Pinto 4
  • 8. Feature Assessment/Visualization • Useful to visualize how the features are distributed, and assess their discriminative capability Default of Credit Card Clients Alexandre Pinto 5
  • 9. Feature Assessment/Visualization Tasks Normalized Histogram Distributions: • Bins represent unique values (Grouped by class) • Heights represent relative frequencies (Normalized) Figure: Normalized Histogram Default of Credit Card Clients Alexandre Pinto 6
  • 10. Feature Assessment/Visualization Tasks Box Plots: • Show distribution of features for each class • Representation of min, median, max and outlier values Figure: Box Plot Default of Credit Card Clients Alexandre Pinto 7
  • 11. Feature Assessment/Visualization Tasks Pairwise Relationships: • 2 × 2 grid • Histograms in diagonal cells, Scatter plots otherwise • Useful to see how features relate to each other • Useful to see how two features separate the pattern classes Figure: Box Plot Default of Credit Card Clients Alexandre Pinto 8
  • 12. Feature Assessment/Visualization Tasks Empirical Cumulative Distribution Function : • Check if feature distributions are drawn from a normal distribution Figure: Cumulative Function vs Std Normal Curve Default of Credit Card Clients Alexandre Pinto 9
  • 13. Feature Assessment/Visualization Tasks Pearson Correlations: • Check highly correlated features ρ = covariance(xi , xj ) σi × σj Default of Credit Card Clients Alexandre Pinto 10
  • 14. Feature Assessment/Visualization Tasks Two-Dimensional PCA: • First 2 component axes with highest variance Figure: 2D PCA Default of Credit Card Clients Alexandre Pinto 11
  • 15. Feature Assessment/Visualization Tasks Two-Dimensional LDA: • First 2 component axes with class-separation Figure: 2D LDA Default of Credit Card Clients Alexandre Pinto 12
  • 16. Preprocessing • Useful to improve the overall training data quality • Turns data more suitable for the classification models, improving the final accuracies Default of Credit Card Clients Alexandre Pinto 13
  • 17. Preprocessing Tasks Standardization: • Center the data by removing the mean • Scale the data by dividing by the standard deviation • Obtain normally distributed data (ρxstd = 0,σxstd = 1) xstd = x − ρX σX Default of Credit Card Clients Alexandre Pinto 14
  • 18. Preprocessing Tasks Scaling features to a range: • Scale features to lie between a min and max value x[0,1] = x − minx maxx − minx Default of Credit Card Clients Alexandre Pinto 15
  • 19. Preprocessing Tasks Normalization: • Scale samples/features vectors to have unit norm • Divide each element by the euclidean norm of the vector xnorm = x i xi 2 Default of Credit Card Clients Alexandre Pinto 16
  • 20. Preprocessing Dataset Balancing • Classes are not equally distributed: Majority Class (’0’): 23364 (∼ 78%) Minority Class (’1’): 6636 (∼ 22%) • Biased classifiers tend to choose majority class Default of Credit Card Clients Alexandre Pinto 17
  • 21. Preprocessing Dataset Balancing Two main methods: • Oversampling the minority class • Undersampling the majority class Default of Credit Card Clients Alexandre Pinto 18
  • 22. Preprocessing Undersampling Random Majority Undersampling: • Start with a set with samples from minority class • Randomly add samples from the majority class until the dataset is balanced Default of Credit Card Clients Alexandre Pinto 19
  • 23. Preprocessing Undersampling NearMiss-1: • Selects samples from majority class which are close to some of the minority samples Default of Credit Card Clients Alexandre Pinto 20
  • 24. Preprocessing Undersampling NearMiss-3: • Selects samples from majority class which are the farthest from the nearest minority samples Default of Credit Card Clients Alexandre Pinto 21
  • 25. Preprocessing Undersampling Neighbor Cleaning Rule: • Removes majority samples who are misclassified by its three nearest neighbors • Removes neighbor majority samples who misclassify minority samples Default of Credit Card Clients Alexandre Pinto 22
  • 26. Preprocessing Oversampling Random Minority Oversampling: • Start with a set with samples from majority class • Randomly add samples from the minority class until the dataset is balanced Default of Credit Card Clients Alexandre Pinto 23
  • 27. Preprocessing Oversampling SMOTE: • Start with a set with samples from majority class • Add synthetic samples created from the minority class (interpolation) until the dataset is balanced Ssyntentic = Si + (Sk − Si )δ Default of Credit Card Clients Alexandre Pinto 24
  • 28. Feature Selection/Reduction Useful to: • Remove irrelevant and redundant features • Improve the prediction performance • Reduce dimensionality and complexity Approaches: • Filter Methods: A subset of features is selected, without considering the predictive model • Wrapper Methods: The best subset of features is selected, using the predictive model to rank the subset Default of Credit Card Clients Alexandre Pinto 25
  • 29. Feature Selection/Reduction Useful to: • Remove irrelevant and redundant features • Improve the prediction performance • Reduce dimensionality and complexity Approaches: • Filter Methods: A subset of features is selected, without considering the predictive model • Wrapper Methods: The best subset of features is selected, using the predictive model to rank the subset Default of Credit Card Clients Alexandre Pinto 25
  • 30. Feature Selection/Reduction Filter Methods Information Gain: • Rank each attribute by its ability to discriminate the pattern classes (decrease in entropy) IG(S, A) = H(S)− v∈values(A) |Sv | |S| H(Sv ) Default of Credit Card Clients Alexandre Pinto 26
  • 31. Feature Selection/Reduction Filter Methods Information Gain Ratio: • Information Gain normalized with the entropy of the attribute • It takes into account the number and size of branches of a split • Reduces bias (attributes with high number of unique values) IGR(S, A) = IG(S, A) H(A) Default of Credit Card Clients Alexandre Pinto 27
  • 32. Feature Selection/Reduction Filter Methods Kruskal-Wallis Test: • Test whether class groups are drawn from the same population and with the same mean • Assess the class-separability of the feature H = 12 N(N + 1) k i=1 R2 i ni − 3(N + 1) Default of Credit Card Clients Alexandre Pinto 28
  • 33. Feature Selection/Reduction Filter Methods Fisher Score: • Select features with high class-separability and low class-variability F(Xi ) = |m1 − m2|2 s2 1 + s2 2 Default of Credit Card Clients Alexandre Pinto 29
  • 34. Feature Selection/Reduction Filter Methods Pearson Correlations: • Select features highly correlated with the target class • Select features with low correlation between them ρ = covariance(xi , xj ) σi × σj Default of Credit Card Clients Alexandre Pinto 30
  • 35. Feature Selection/Reduction Filter Methods mRMR: • Select features with high relevance (mutual information) with the target class • Select features with low redundancy between them max D = 1 |S| xi ∈S I(xi , c) min R = 1 |S|2 xi .xj ∈S I(xi , xj ) max φ(D, R), φ = D − R Default of Credit Card Clients Alexandre Pinto 31
  • 36. Feature Selection/Reduction Filter Methods Area Under the Curve: • Select features that have good classification performance Default of Credit Card Clients Alexandre Pinto 32
  • 37. Feature Selection/Reduction Wrapper Methods Sequential Forward/Backward Selection: • Start with an empty/full feature set • Select the next best/worst feature that have the largest-increase/smallest -decrease in feature importances Default of Credit Card Clients Alexandre Pinto 33
  • 38. Feature Selection/Reduction Wrapper Methods Recursive Feature Elimination: • Start with a full feature set and compute feature importances • Recursively compute/remove features with low importances Default of Credit Card Clients Alexandre Pinto 34
  • 39. Feature Selection/Reduction Reduction methods Linear Transformation Techniques: • PCA: Find the directions (principal components) that maximize the variance in the dataset • LDA: Computes the directions (linear discriminants) that maximize class-separability and minimizes the variance within the class Default of Credit Card Clients Alexandre Pinto 35
  • 40. Classification Predictive Models: • Distance Based: Min. Distance Classifier && k-NN • Probabilistic: Naive Bayes • Search: Decision Tree • Optimization: Support Vector Machines (SVM) • Ensemble: Random Forest Default of Credit Card Clients Alexandre Pinto 36
  • 41. Evaluation Metrics • Useful to assess the classifier performance Default of Credit Card Clients Alexandre Pinto 37
  • 42. Evaluation Metrics Accuracy: • Proportion of correctly identified and rejected instances ACC = TP + TN TP + TN + FP + FN Default of Credit Card Clients Alexandre Pinto 38
  • 43. Evaluation Metrics Precision Precision: • Proportion of correct answers from the positive predictions P = TP TP + FP Default of Credit Card Clients Alexandre Pinto 39
  • 44. Evaluation Metrics Recall Recall: • Proportion of correct answers from the whole positive part of a dataset R = TP TP + FN Default of Credit Card Clients Alexandre Pinto 40
  • 45. Evaluation Metrics F1 F1: • Harmonic mean of precision and recall F1 = 2 × P · R P + R Default of Credit Card Clients Alexandre Pinto 41
  • 46. Evaluation Metrics Stratified K-fold Cross Validation Stratified K-fold Cross Validation: • Each fold is a good representative of the whole. Default of Credit Card Clients Alexandre Pinto 42
  • 47. Evaluation Metrics ROC Curves ROC Curve: • Show trade-offs between sensitivity(TPR) and specificity(FPR) Figure: ROC curves Default of Credit Card Clients Alexandre Pinto 43
  • 48. Evaluation Metrics Precision-Recall Curves Precision-Recall Curve: • Show trade-offs between precision and recall Figure: PR curves Default of Credit Card Clients Alexandre Pinto 44
  • 49. Demo Short Experiment • Features Standardized • Data balanced with SMOTE • Keep top 15 best features using mRMR Precision Recall F1 AUC Avg Precision Min. Distance 0.69 ± 0.02 0.64 ± 0.00 0.66 ± 0.01 0.65 ± 0.01 0.76 ± 0.01 kNN 0.76 ± 0.01 0.91 ± 0.04 0.82 ± 0.02 0.79 ± 0.02 0.86 ± 0.01 Naive Bayes 0.57 ± 0.01 0.93 ± 0.01 0.71 ± 0.01 0.56 ± 0.01 0.77 ± 0.00 Linear SVM 0.70 ± 0.02 0.65 ± 0.01 0.67 ± 0.01 0.67 ± 0.02 0.77 ± 0.01 Decision Tree 0.79 ± 0.02 0.82 ± 0.17 0.79 ± 0.11 0.79 ± 0.07 0.85 ± 0.05 Random Forest 0.88 ± 0.01 0.82 ± 0.16 0.84 ± 0.10 0.85 ± 0.07 0.90 ± 0.04 Table: Results Default of Credit Card Clients Alexandre Pinto 45
  • 50. Demo Short Experiment Default of Credit Card Clients Alexandre Pinto 46
  • 51. Demo Short Experiment Default of Credit Card Clients Alexandre Pinto 47
  • 52. Conclusions • Feature Engineering(Data Transformation, Feature Selection) is probably the most important step • Exploratory Data Analysis is important to better get a sense of the distribution of the data • Feature selection helps reduce training times and keep only the most relevant and non-redundant features • PR is an iterative process Default of Credit Card Clients Alexandre Pinto 48
  • 53. References I [RF] A Gentle Introduction to Random Forests, Ensembles, and Performance Metrics in a Commercial System. https://citizennet.com/blog/2012/11/10/ random-forests-ensembles-and-performance-metrics/. Accessed: 2016-06-06. [Kag] General Tips for participating Kaggle Competitions. http://www.slideshare.net/markpeng/ general-tips-for-participating-kaggle-competitions. Accessed: 2016-06-06. [knn] KNN classification. https://www.researchgate.net/figure/260397165_fig7_ Pseudocode-for-KNN-classification. Accessed: 2016-06-06. Default of Credit Card Clients Alexandre Pinto 49
  • 54. References II [PCA] Linear Discriminant Analysis - Bit by Bit. http:// sebastianraschka.com/Articles/2014_python_lda.html. Accessed: 2016-06-06. [NB] Naive Bayes. http://www.slideshare.net/ chowdhury343/naive-bayes-presentation. Accessed: 2016-06-06. [6] Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-Sampling Technique. Journal of artificial intelligence research, pages 321–357. [7] Laurikkala, J. (2001). Improving Identification of Difficult Small Classes by Balancing Class Distribution. Springer. Default of Credit Card Clients Alexandre Pinto 50
  • 55. References III [8] Mani, I. and Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets. [9] Peng, H., Long, F., and Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(8):1226–1238. [10] Prati, R. C., Batista, G. E., and Monard, M. C. (2009). Data Mining with Imbalanced Class Distributions: Concepts and Methods. In IICAI, pages 359–376. [11] Yeh, I.-C. and Lien, C.-h. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2):2473–2480. Default of Credit Card Clients Alexandre Pinto 51
  • 56. Default of Credit Card Clients Author: Alexandre Pinto Faculty of Sciences and Technology Department of Informatics Engineering University of Coimbra