SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Acquisition Credit Scoring
Model
Project Final Report
31st April, 2015
Great Lakes Institute of Management,Gurgaon
Subhasis Mishra
– Research Supervisor –
Mr. Manu Chandra (FNMathlogic)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 2
Table of Contents
1 Introduction........................................................................................................................... 3
2 Scope and Objectives.............................................................................................................. 3
3 Data Sources.......................................................................................................................... 4
4 Analytical Approach................................................................................................................ 4
4.1 Data Collection ................................................................................................................ 4
4.2 Data Preparation.............................................................................................................. 4
4.3 Variable Reduction......................................................................................................... 11
4.4 Data Sampling................................................................................................................ 15
4.5 Model Development ...................................................................................................... 16
4.6 Intercept not significantin the development sample........................................................ 17
4.7 Assessment of sign of variables....................................................................................... 18
4.8 Multicollinearity Test...................................................................................................... 18
4.9 ProbabilityPrediction..................................................................................................... 22
4.10 Model Prediction (Goodness-of-fit)............................................................................... 23
4.11 Calculating top 3 variables affecting credit score function .............................................. 29
4.12 Reject Inference........................................................................................................... 30
4.13 Model Performance ..................................................................................................... 31
5 Decision Tree Approach........................................................................................................ 31
6 Tools and Techniques ........................................................................................................... 33
7 Way Forward........................................................................................................................ 33
8 Recommendations andApplications...................................................................................... 33
9 References and Bibliography................................................................................................. 33
10 Project Code....................................................................................................................... 33
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 3
1 Introduction
Credit scoring is perhaps one of the most classic applications for predictive modeling, to
predict whether or not credit extended to an applicant will likely result in profit or
losses for the lending institution. There are many variations and complexities regarding
how exactly credit is extended to individuals, businesses, and other organizations for
various purposes (purchasing equipment, real estate, consumer items, and so on), and
using various methods of credit (credit card, loan, delayed payment plan). But in all
cases, a lender provides money to an individual or institution, and expects to be paid
back in time with interest commensurate with the risk of default.
Credit scoring is the set of decision models and their underlying techniques that aid
lenders in the granting of consumer credit. These techniques determine who will get
credit, how much credit they should get, and what operational strategies will enhance
the profitability of the borrowers to the lenders. Further, they help to assess the risk in
lending. Credit scoring is a dependable assessment of a person’s credit worthiness since
it is based on actual data.
A lender commonly makes two types of decisions: first, whether to grant credit to a
new applicant, and second, how to deal with existing applicants, including whether to
increase their credit limits. In both cases, whatever the techniques used, it is critical
that there is a large sample of previous customers with their application details,
behavioral patterns, and subsequent credit history available. Most of the techniques
use this sample to identify the connection between the characteristics of the
consumers (annual income, age, number of years in employment with their current
employer, etc.) and their subsequent history.
Typical application areas in the consumer market include: credit cards, auto loans,
home mortgages, home equity loans, mail catalog orders, and a wide variety of
personal loan products.
2 Scope and Objectives
To evaluate the scope of applying Logistic Regression and various Data Minging
techniques in Credit Scoring facilitating better decision making to reduce the Risk.
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 4
Objective of this project is to build a Credit Scoring model to reduce the potential risk
involved, and based on which credit lending decisions can be made.
3 Data Sources
Applicant data involving US customers that has been used in this project was provided
by our mentor.
4 Analytical Approach
Steps which would be followed for this project are described below:
4.1 Data Collection
Customer credit history along with all the information was provided by our mentor.
Data provided was cross sectional data and belongs to a single point of time.
4.2 Data Preparation
Listed below are some of the techniques used for data preparation :-
1. Initial variable selection based on judgement –
First we identified some of the key variables out of all the variables given in our data
set purely based on judgement. Initial screening of variables is very much important
and it requires deeper understanding about that particular domain as well as
experience. So out of the 46 variables, we took some 20 odd variables into
consideration for modelling. Below is the list of initial variables :-
Loan_amnt
Term
Annual_inc
Fico_range_high
Fico_range_low
Last_fico_range_hig
Last_fico_range_low
Purpose
Home_ownership
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 5
Grade
Dti
Mths_since_last_delinq
Mths_since_last_record
Last_paymnt_amnt
Total_paymnt
Total_paymnt_inv
Pub_rec
Total_rec_int
Revol_bal
2. Missing Value Treatment -
Normally there are two techniques used in case of missing value treatment. Those are
either by removing the entire variable- if more than 80-90% of observations are
missing for that variable, or by imputing with a very high value like 9999999- if the
variable is significant in the perspective of modeling, hence can’t be dropped. In our
case we went ahead with the latter one as few variables like
Last_month_since_deliquency, last_month_since_record carries a lot of importance in
terms of modeling although it had more than 90% observations missing in the original
data set. For other non-significant variables, we just imputed with Zero.
Below is the piece of R code showing what we did towards missing value treatment –
new_data <- as.data.frame(data, stringsAsFactors=FALSE)
new_data$mths_since_last_delinq[is.na(new_data$mths_since_last_d
elinq)] <- 9999999
new_data$mths_since_last_record[is.na(new_data$mths_since_last_r
ecord)] <- 9999999
new_data[is.na(new_data)] <- 0
str(new_data)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 6
3. Identifying the dependent variable –
In any business problem, It is very much critical to first understand the problem
statement and then approaching with an appropriate solution. Hence in our case the
first priority was given to identify the dependent variable correctly, which might
otherwise would have caused the modelling to go completely wrong. Our problem is
concerned with finding if the person is going to default or not by calculating the
probability of default among the consumers.
From our data set, we picked LOAN_STATUS as the dependent variable which contains
value like Current, Default, Charged off etc.
4. Data Transformation –
Data Transformation is one of the most significant steps before doing the actual
modeling and carries a lot of importance in making the final model robust. It includes
variuos steps like introduction of dummy variables, conversion of integer variable to
categorical variable etc.
 Introduction of dummy variables –
We introduced dummy variables for the character variables. As part of the dummy
variable inclusion, we did the following :-
# Introduction of Dummy variables
# For Loan status
def <- ifelse(loan_status=="Default", 1, 0)
# purpose dummies
purpose_debt <- ifelse(purpose=='debt_consolidation', 1, 0)
purpose_car <- ifelse(purpose=='car', 1, 0)
purpose_credit <- ifelse(purpose=="credit_card", 1, 0)
purpose_home_imp <- ifelse(purpose=="home_improvement", 1, 0)
purpose_maj_purchase <- ifelse(purpose=="major_purchase", 1, 0)
purpose_ren_energy <- ifelse(purpose=="renewable_energy", 1, 0)
purpose_education <- ifelse(purpose=="educational", 1, 0)
purpose_house <- ifelse(purpose=="house", 1, 0)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 7
purpose_medical <- ifelse(purpose=="medical", 1, 0)
purpose_moving <- ifelse(purpose=="moving", 1, 0)
purpose_small_business <- ifelse(purpose=="small_business", 1,
0)
purpose_vacation <- ifelse(purpose=="vacation", 1, 0)
purpose_wedding <- ifelse(purpose=="wedding", 1, 0)
# home ownership dummies
home_ownership_mort <- ifelse(home_ownership=="MORTGAGE", 1, 0)
home_ownership_own <- ifelse(home_ownership=="OWN", 1, 0)
home_ownership_rent <- ifelse(home_ownership=="RENT", 1, 0)
home_ownership_other <- ifelse(home_ownership=="OTHER", 1, 0)
# Grade
grade_a <- ifelse(grade=="A",1 ,0)
grade_b <- ifelse(grade=="B",1 ,0)
grade_c <- ifelse(grade=="C",1 ,0)
grade_d <- ifelse(grade=="D",1 ,0)
grade_e <- ifelse(grade=="E",1 ,0)
grade_f <- ifelse(grade=="F",1 ,0)
 Conversion of continuos variables to categorical
variables(Fine classing)
As a better modelling practice, it is recommended that continuos variables should be
changed into categorical variables by introduction of bins- which is also called fine
classing. It generally yields better result during Information Value (IV) calculation of
response variables.
We followed the below approach for fine classing :-
gloan_amnt <- cut(new_data$loan_amnt, br=c(0, 5000, 15000, 35000),
labels=c("Low","Medium","High"))
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 8
gloan_amnt_data <- data.frame(new_data, gloan_amnt)
summary(gloan_amnt_data$gloan_amnt)
gloan_amnt_data_final<-
gloan_amnt_data[complete.cases(gloan_amnt_data),]
summary(gloan_amnt_data_final$gloan_amnt)
glast_fico_range_high <- cut(new_data$last_fico_range_high, br=c(0,
649, 699, 744, 850), labels=c("Low","Medium","High","Very High"))
last_fico_high_data <- data.frame(new_data, glast_fico_range_high)
summary(last_fico_high_data$glast_fico_range_high)
last_fico_high_data_final<-
last_fico_high_data[complete.cases(last_fico_high_data),]
summary(last_fico_high_data_final$glast_fico_range_high)
glast_fico_range_low <- cut(new_data$last_fico_range_low, br=c(0,
645, 695,740, 845), labels=c("Low","Medium","High","Very High"))
last_fico_low_data <- data.frame(new_data, glast_fico_range_low)
summary(last_fico_low_data$glast_fico_range_low)
last_fico_low_data_final<-
last_fico_low_data[complete.cases(last_fico_low_data),]
summary(last_fico_low_data_final$glast_fico_range_low)
gfico_range_high <- cut(new_data$fico_range_high, br=c(0, 650, 700,
745, 850), labels=c("Low","Medium","High","Very High"))
fico_high_data <- data.frame(new_data, gfico_range_high)
summary(fico_high_data$gfico_range_high)
fico_high_data_final<-
fico_high_data[complete.cases(fico_high_data),]
summary(fico_high_data_final$gfico_range_high)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 9
gfico_range_low <- cut(new_data$fico_range_low, br=c(0, 650, 700,
745, 850), labels=c("Low","Medium","High","Very High"))
fico_low_data <- data.frame(new_data, gfico_range_low)
summary(fico_low_data$gfico_range_low)
fico_low_data_final <- fico_low_data[complete.cases(fico_low_data),]
summary(fico_low_data_final$gfico_range_low)
Gdti<-
cut(new_data$dti,br=c(0,8,13,19,30),labels=c("Low","Medium","High","
Very High"))
dti_data <- data.frame(new_data, gdti)
summary(dti_data$gdti)
dti_data_final <- dti_data[complete.cases(dti_data),]
summary(dti_data_final$gdti)
gmths_since_last_delinq<-cut(new_data$mths_since_last_delinq,
br=c(0, 48, 6466000, 10000000), labels=c("Low","Medium","High"))
mths_since_last_delinq_data<-data.frame(new_data,
gmths_since_last_delinq)
summary(mths_since_last_delinq_data$gmths_since_last_delinq)
mths_since_last_delinq_data_final<-
mths_since_last_delinq_data[complete.cases(mths_since_last_delinq_da
ta),]
summary(mths_since_last_delinq_data_final$gmths_since_last_delinq)
gmths_since_last_record <- cut(new_data$mths_since_last_record,
br=c(0, 9299000, 10000000), labels=c("Low","High"))
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 10
mths_since_last_record_data <- data.frame(new_data,
gmths_since_last_record)
summary(mths_since_last_record_data$gmths_since_last_record)
mths_since_last_record_data_final<-
mths_since_last_record_data[complete.cases(mths_since_last_record_da
ta),]
summary(mths_since_last_record_data_final$gmths_since_last_record)
gannual_inc <- cut(new_data$annual_inc, br=c(4000, 40500, 59000,
82340, 6000000), labels=c("Low","Medium","High","Very High"))
annual_inc_data <- data.frame(new_data, gannual_inc)
summary(annual_inc_data$gannual_inc)
annual_inc_data_final<-
annual_inc_data[complete.cases(annual_inc_data),]
summary(annual_inc_data_final$gannual_inc)
glast_pymnt_amnt <- cut(new_data$last_pymnt_amnt, br=c(0, 195, 392,
1816, 36120), labels=c("Low","Medium","High","Very High"))
last_pymnt_amnt_data <- data.frame(new_data, glast_pymnt_amnt)
summary(last_pymnt_amnt_data$glast_pymnt_amnt)
last_pymnt_amnt_data_final<-
last_pymnt_amnt_data[complete.cases(last_pymnt_amnt_data),]
summary(last_pymnt_amnt_data_final$gannual_inc)
gtotal_pymnt <- cut(new_data$total_pymnt, br=c(0, 4707, 8089, 13090,
49490), labels=c("Low","Medium","High","Very High"))
total_pymnt_data <- data.frame(new_data, gtotal_pymnt)
summary(total_pymnt_data$gtotal_pymnt)
total_pymnt_data_final<-
total_pymnt_data[complete.cases(total_pymnt_data),]
summary(total_pymnt_data_final$gtotal_pymnt)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 11
gtotal_pymnt_inv <- cut(new_data$total_pymnt_inv, br=c(0, 4312,
7590, 12390, 49100), labels=c("Low","Medium","High","Very High"))
total_pymnt_inv_data <- data.frame(new_data, gtotal_pymnt_inv)
summary(total_pymnt_inv_data$gtotal_pymnt_inv)
total_pymnt_inv_data_final<-
total_pymnt_inv_data[complete.cases(total_pymnt_inv_data),]
summary(total_pymnt_inv_data_final$gtotal_pymnt_inv)
dim(total_pymnt_inv_data_final)
glast_pymnt_amnt <- cut(new_data$last_pymnt_amnt, br=c(0, 195, 392,
1816, 16120), labels=c("Low","Medium","High","Very
High"))last_pymnt_amnt_data <- data.frame(new_data,
glast_pymnt_amnt)summary(last_pymnt_amnt_data$glast_pymnt_amnt)
last_pymnt_amnt_data_final<-
last_pymnt_amnt_data[complete.cases(last_pymnt_amnt_data),]summary(l
ast_pymnt_amnt_data_final$glast_pymnt_amnt)
dim(last_pymnt_amnt_data_final)
gtotal_rec_int <- cut(new_data$total_rec_int, br=c(0, 640, 1290,
2618, 15300), labels=c("Low","Medium","High","Very High"))
total_rec_int_data <- data.frame(new_data, gtotal_rec_int)
summary(total_rec_int_data$gtotal_rec_int)
total_rec_int_data_final<-
total_rec_int_data[complete.cases(total_rec_int_data),]
summary(total_rec_int_data_final$gtotal_rec_int)
dim(total_rec_int_data_final)
4.3 Variable Reduction
Once the strongest characteristics are grouped and ranked, variable selection is
done. At the end of this step, the Scorecard Developer will have a set of strong,
grouped characterstics, preferably representing independent information
types, for use in the regression step. The strength of a characteristic is gauged
using four main criteria :
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 12
 Predictive power of each attribute. The weight of evidence (WOE)
measure is used for this purpose.
 The range and trend of weight of evidence across grouped attributes
within a characteristic.
 Predictive power of the characterstic. The Information Value (IV)
measure is used for this purpose.
 Operational and business considerations
In our case, we have used the Information Value measure approach for
variable reduction. Some analysts run other variable selection algorithms (e.g.,
those that rank predictive power using Chi Square or R-Square) prior to
grouping characteristics. This gives them an indication of characteristic
strength using independent means, and also alerts them in cases where the
Information Value figure is high/low compared to other measures. 
The initial
characteristic analysis process can be interactive, and involvement from
business users and operations staff should be encouraged. In particular, they
may provide further insights into any unex- pected or illogical behavior
patterns and enhance the grouping of all variables. 
The first step in
performing this analysis is to perform initial grouping of the variables, and
rank order them by IV or some other strength measure. This can be done
using a number of binning techniques.
If using other applications, a good way to start is to bin nominal variables into
50 or so equal groups, and to calculate the WOE and IV for the grouped
attributes and characteristics. One can then use any spreadsheet software to
fine tune the groupings for the stronger characteristics based on principles to
be outlined in the next section. Similarly for categorical characteristics, the
WOE for each unique attribute and the IV of each characteristic can be
calculated. One can then spend time fine-tuning the grouping for those
characteristics that surpass a minimum acceptable strength. Decision trees are
also often used for grouping variables. Most users, however, use them to
generate initial ideas, and then use alternate software applications to
interactively fine-tune the groupings. 

Information Value (IV) :-
Information Value provides a measure of how well a variable X is able to
distinguish between a binary response (e.g “good” vs “bad”) in some target
variable Y. The idea is if a variable X has a low information value, it may not do
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 13
a sufficient job of classifying the target variable, and hence is removed as an
explnatory variable.
To see how this works, let X be grouped into n bins. Each x  X corresponds to a
y  Y that may take one of two values, say 0 or 1. Then for bins, Xi, 1  i  n,
IV =  ( gi- bi ) * ln(gi/ bi)
Where bi = the proportion of 0’s in bin i versus all bins
gi = the proportion of 1’s in bin versus all bins
ln(gi/ bi) is known as weight of evidence (For bin Xi). Cut-off value ma vary
but in our case, we have considered IV cut-off as 0.1 .
Below is the R code sample, output and plot of IV calculation :-
# IV calculation
iv.mult(final_data1,"def",TRUE)
iv_data <- data.frame(def, annual_inc_data$gannual_inc,
last_fico_high_data$glast_fico_range_high,
mths_since_last_delinq_data$gmths_since_last_delinq,
total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv,
term, last_pymnt_amnt_data$glast_pymnt_amnt,
total_rec_int_data$gtotal_rec_int)
iv.plot.summary(iv.mult(final_data1,"def",TRUE))
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 14
#taking IV cut off as 0.1
iv_data_final <- data.frame(def, last_fico_high_data$glast_fico_range_high,
total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv,
term, last_pymnt_amnt_data$glast_pymnt_amnt,
total_rec_int_data$gtotal_rec_int)
iv.mult(iv_data_final,"def",TRUE)
iv.plot.summary(iv.mult(iv_data_final,"def",TRUE))
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 15
4.4 Data Sampling
Sampling has been done based on the 70-30% rule. That is the entire data
would be splitted into 70-30 ratio to use it as Development and Validation data
respectively.
Below is the R code :-
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 16
# Dividing data into train and test
totalrecords <- nrow(fin_data)
trainfraction = 0.7
trainrecords = as.integer(totalrecords * trainfraction)
allrows <- 1:totalrecords
trainrows <- sample(totalrecords,trainrecords)
testrows <- allrows[-trainrows]
train<-data.frame(fin_data[trainrows,])
test<-data.frame(fin_data[testrows,])
dim(train)
dim(test)
4.5 Model Development
When the training data set on which the modeling is based contains a binary
indicator variable of "Paid back" vs. "Default", or "Good Credit" vs. "Bad Credit",
then Logistic Regression models are well suited for subsequent predictive
modeling. Logistic regression yields prediction probabilities for whether or not
a particular outcome (e.g., Bad Credit) will occur. Furthermore, logistic
regression models are linear models, in that the logit transformed prediction
probability is a linear function of the predictor variable values. Thus, a final
score card model derived in this manner has the desirable quality that the final
credit score (credit risk) is a linear function of the predictors, and with some
additional transformations applied to the model parameter, a simple linear
function of scores that can be associated with each predictor class value after
coarse coding. So the final credit score is then a simple sum of individual score
values that can be taken from the scorecard.
In our model, we have used train data set which contains 70% of the original
data. Below is the R code for logistic regrssion :-
fit_train <- glm(def~., data=train, family="binomial")
summary(fit_train)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 17
4.6 Intercept not significant in the development sample
An intercept is almost always part of the model and is almost always significantly
different from zero. The test of the intercept in the procedure output tests whether
this parameter is equal to zero. If the intercept is zero (equivalent to having no
intercept in the model), the resulting model implies that the response function must
be exactly zero when all the predictors are set to zero. For a logistic model it means
that the logit (or log odds) is zero, which implies that the event probability is 0.5. This
is a very strong assumption that is sometimes reasonable, but more often is not. So,
a highly significant intercept in your model is generally not a problem.
By the same token, if the intercept is not significant you usually would not want to
remove it from the model because by doing this you are creating a model that says
that the response function must be zero when the predictors are all zero. If the
nature of what you are modeling is such that you want to assume this, then you
might want to remove the intercept. In our case, we are getting a intercept value of
around 0.58 for validation sample.
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 18
4.7 Assessment of sign of variables
a. Last_fico_range_high- Negative sign of the coeffcient signifies that it
exhibits an inverse relationship with default, which is actually true in
this context. Higher the fico value is, lesser is the chance of default.
b. Total_pymnt- It also signifies an inverse relationship with default,
which is also true in this context.
c. Last_pymnt_amnt- Negative sign of the coefficient shows that it
exhibits an inverse relationship with default.
d. Total_rec_int- Positive sign of the coefficient signifies that it has a
direct relationship with default.
4.8 Multicollinearity Test
It occurs when there are high correlations among predictor variables, leading to
unreliable and unstable estimates of regression coefficients. After model building, multi
collinearity check is normally performed to ensure that independent variables are not
highly correlated using VIF (Variable Inflation Factor) function. Normally the cut-off for
VIF is considered as 5. Variables with VIF more than 5 should be considered to have
collinearity. However, for factor/ categorical variables, GVIF value is considered as the
baseline. Variables which require more than 1 coefficient and thus more than 1 degree
of freedom are typically evaluated using the GVIF. For one-coefficient terms VIF is
equal to GVIF.
There are 4 options for producing the VIF value :-
 Corvif command from the AED package
 Vif command from the car package
 Vif command from the rms package
 Vif command from the DAAG package
Out of these,“car” and “AED” produce GVIFvalue andothertwo produce VIFvalue.
# Multi-colinearity check
library(car) vif(fit_train)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 19
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 20
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 21
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 22
As we can see,Fourplotsare generatedata time while plottingthe logisticoutput.
4.9 Probability Prediction
From the above logisticoutputwe have predictedthe probabilitiesof individual customers.
Please findthe belowRcode :-
# Probablitiesprediction
prob_glm<- predict.glm(fit4,type="response",se.fit=FALSE)
write.csv(prob_glm,"probability.csv")
plot(prob_glm)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 23
4.10 Model Prediction (Goodness-of-fit)
Goodness-of-fit attempts to get at how well a model fits the data. It is usually applied
after a final model has been selected. If we have multiple models, then goodness-of-
fit is performed to choose among all the models. Concordance, discordance, ROC
curve and KS-statistics are used for this purpose.
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 24
 Concordance :- In OLS regression, the R-squared and its more refined
measure adjusted R-square would be the ‘one-stop’ metric which would immediately
tell us if the model was a good fit or not. And since this was a value between 0 and 1,
we could easily change it to a percentage value and pass it off as ‘model accuracy’
for beginners and the not-so-much-math-oriented businesses. Unfortunately, looking
at adj-R square would be totally irrelevant in case of logistic regression because we
model the log odds ratio and it becomes very difficult in terms of explain ability. This
is where concordance helps. Concordance tells us the association between actual
values and the values fitted by the model in percentage terms. Concordance is
defined as the ratio of number of pairs where the 1 had a higher model score than
the model score of zero to the total number of 1-0 pairs possible. A higher value for
concordance (60-70%) means a better fitted model. However, a very large value for
concordance (85-95%) could also suggest that the model is over-fitted and needs to
be re-aligned to explain the entire population.
We have used an in-built R function OptimisedConc() for getting the concordance value.
OptimisedConc=function(model)
{
Data = cbind(model$y, model$fitted.values)
ones = Data[Data[,1] == 1,]
zeros = Data[Data[,1] == 0,]
conc=matrix(0, dim(zeros)[1], dim(ones)[1])
disc=matrix(0, dim(zeros)[1], dim(ones)[1])
ties=matrix(0, dim(zeros)[1], dim(ones)[1])
for (j in 1:dim(zeros)[1])
{
for (i in 1:dim(ones)[1])
{
if (ones[i,2]>zeros[j,2])
{conc[j,i]=1}
else if (ones[i,2]<zeros[j,2])
{disc[j,i]=1}
else if (ones[i,2]==zeros[j,2])
{ties[j,i]=1}
}
}
Pairs=dim(zeros)[1]*dim(ones)[1]
PercentConcordance=(sum(conc)/Pairs)*100
PercentDiscordance=(sum(disc)/Pairs)*100
PercentTied=(sum(ties)/Pairs)*100
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 25
return(list("Percent Concordance"=PercentConcordance,"Percent
Discordance"=PercentDiscordance,"Percent
Tied"=PercentTied,"Pairs"=Pairs))
}
OptimisedConc(fit_train)
In our model, we are getting 91 percent Concordance which means this model
is overfitted.
 Lorenz Curve and AUC :-
For train data :
-----------------
#Calculating ROC curve for model
library(ROCR)
#score train data set
train$score<-predict(fit_train,type='response',train)
pred_train<-prediction(train$score,train$def)
perf_train <- performance(pred_train,"tpr","fpr")
plot(perf_train)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 26
# calculating AUC
auc_train <- performance(pred_train,"auc")
auc_train <- unlist(slot(auc_train, "y.values"))
# adding min and max ROC AUC to the center of the plot
minauc<-min(round(auc_train, digits = 2))
maxauc<-max(round(auc_train, digits = 2))
minauct <- paste(c("min(AUC) = "),minauc,sep="")
maxauct <- paste(c("max(AUC) = "),maxauc,sep="")
legend(0.3,0.6,c(minauct,maxauct,"n"),border="white",cex=1.7,
box.col = "white")
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 27
So for the train data,we are getting AUC as 0.91.
For Test Data :
--------------------
#score test data set
test$score<-predict(fit_train,type='response',test)
pred_test<-prediction(test$score,test$def)
perf_test <- performance(pred_test,"tpr","fpr")
plot(perf_test)
# calculating AUC
auc_test <- performance(pred_test,"auc")
# now converting S4 class to vector
auc_test <- unlist(slot(auc_test, "y.values"))
# adding min and max ROC AUC to the center of the plot
minauc<-min(round(auc_test, digits = 2))
maxauc<-max(round(auc_test, digits = 2))
minauct <- paste(c("min(AUC) = "),minauc,sep="")
maxauct <- paste(c("max(AUC) = "),maxauc,sep="")
legend(0.3,0.6,c(minauct,maxauct,"n"),border="white",cex=1.7,box.co
l = "white")
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 28
For Test data, we are getting AUC value as 0.84.
 KS Statistics (For train data) :-
# Calculating KS statistic for train data
max(attr(perf_train,'y.values')[[1]]-
attr(perf_train,'x.values')[[1]])
We are getting KS- stats value for train data as 0.706 .
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 29
 KS Statistics (For test data):-
# Calculating KS statistic
max(attr(perf_test,'y.values')[[1]]-
attr(perf_test,'x.values')[[1]])
4.11 Calculating top 3 variables affecting credit score function
#Calculating top 3 variables affecting Credit Score Function
g<-predict(fit_train,type='terms',test)
#function to pick top 3 reasons
#works by sorting coefficient terms in equation and selecting
top 3 in sort for each loan scored
ftopk<- function(x,top=3){
res=names(x)[order(x, decreasing = TRUE)][1:top]
paste(res,collapse=";",sep="")
}
# Application of the function using the top 3 rows
topk=apply(g,1,ftopk,top=3)
#add reason list to scored test sample
test<-cbind(test, topk)
summary(test)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 30
4.12 Reject Inference
The term Reject Inference describes the issue of how to deal with the inherent
bias when modeling is based on a training dataset consisting only of those
previous applicants for whom the actual performance (Good Credit vs. Bad
Credit) has been observed; however, there are
likely another significant number of previous applicants, that had been rejected
and for whom final "credit performance" was never observed. The question is,
how to include those previous applicants in the modeling, in order to make the
predictive model more accurate and robust (and less biased), and applicable
also to those individuals.
This is of particular importance when the criteria for the decision whether or
not to extend credit need to be loosened, in order to attract and extend credit
to more applicants. This can for example happen during a severe economic
downturn, affecting many people and placing their overall financial well being
into a condition that would not qualify them as acceptable credit risk using
older criteria. In short, if nobody were to qualify for credit any more, then the
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 31
institutions extending credits would be out of business. So it is often critically
important to make predictions about observations with specific predictor
values that were essentially outside the range of what would have been
previously considered, and consequently is unavailable and has not been
observed in the training data where the actual outcomes are recorded.
There are a number of approaches that have been suggested on how to include
previously rejected applicants for credit in the model building step, in order to
make the model more broadly applicable (to those applicants as well). In short,
these methods come down to systematically extrapolating from the actual
observed data, often by deliberately introducing biases and assumptions about
the expected loan outcome, had the (in actuality not observed) applicant been
accepted for credit.
4.13 Model Performance
A specific performance window would be taken into consideration to identify
the accuracy of the predictive power of the model.
5 Decision Tree Approach
This is one of the classic data mining techniques available for better decision
making purpose as the tree structure gives a better understanding of the data
and the important variables- which in turn could help in minimizing the potential
risk involved. However, individual customer probobilties can’t be found using this
method. Hence, Logistic Regression is preferred over it in industries.
# Decision Tree Implementation
#load tree package
library(rpart)
#build model using 90% 10% priors
#with smaller complexity parameter to allow more complex trees
model_dt <-
rpart(def~.,data=train,parms=list(prior=c(.9,.1)),cp=.0002)
plot(model_dt) text(model_dt)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 32
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 33
6 Tools and Techniques
Following tools and techniques have been used :-
 R Studio
 Predictive Modeling, Logistic Regression, Data Mining Techniques
7 Way Forward
Reject inference has not been implemented. So there is scope to implement that.
Also data mining techniques like Random Forest, Neural Network can be
implemented as per the scope and easeness of the project. We have only
implemented Decision Tree.
8 Recommendations and Applications
Typical applicationareasinthe consumermarketinclude:creditcards,autoloans,home mortgages,
home equityloans,mail catalogorders,andawide varietyof personal loanproducts.
9 References and Bibliography
Belowdocumentshave beenreferredforthisproject:-
 CreditriskscorecardsdevelopingandimplementingintelligentcreditscoringbyNaeem
Siddiqi
 Sharma- Creditscoring
 Machine learningwithR– BrettLantz
10 Project Code
Below is the complete R code that has been used for the project :-
rm(list=ls())
data <- read.csv(file.choose(), header=T, stringsAsFactors=FALSE)
str(data)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 34
# Replacing NA values
new_data <- as.data.frame(data, stringsAsFactors=FALSE)
new_data$mths_since_last_delinq[is.na(new_data$mths_since_last_delin
q)] <- 9999999
new_data$mths_since_last_record[is.na(new_data$mths_since_last_recor
d)] <- 9999999
new_data[is.na(new_data)] <- 0
str(new_data)
attach(new_data)
# Introduction of Dummy variables
# For Loan status
def <- ifelse(loan_status=="Default", 1, 0)
# purpose dummies
purpose_debt <- ifelse(purpose=='debt_consolidation', 1, 0)
purpose_car <- ifelse(purpose=='car', 1, 0)
purpose_credit <- ifelse(purpose=="credit_card", 1, 0)
purpose_home_imp <- ifelse(purpose=="home_improvement", 1, 0)
purpose_maj_purchase <- ifelse(purpose=="major_purchase", 1, 0)
purpose_ren_energy <- ifelse(purpose=="renewable_energy", 1, 0)
purpose_education <- ifelse(purpose=="educational", 1, 0)
purpose_house <- ifelse(purpose=="house", 1, 0)
purpose_medical <- ifelse(purpose=="medical", 1, 0)
purpose_moving <- ifelse(purpose=="moving", 1, 0)
purpose_small_business <- ifelse(purpose=="small_business", 1, 0)
purpose_vacation <- ifelse(purpose=="vacation", 1, 0)
purpose_wedding <- ifelse(purpose=="wedding", 1, 0)
# home ownership dummies
home_ownership_mort <- ifelse(home_ownership=="MORTGAGE", 1, 0)
home_ownership_own <- ifelse(home_ownership=="OWN", 1, 0)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 35
home_ownership_rent <- ifelse(home_ownership=="RENT", 1, 0)
home_ownership_other <- ifelse(home_ownership=="OTHER", 1, 0)
# Grade
grade_a <- ifelse(grade=="A",1 ,0)
grade_b <- ifelse(grade=="B",1 ,0)
grade_c <- ifelse(grade=="C",1 ,0)
grade_d <- ifelse(grade=="D",1 ,0)
grade_e <- ifelse(grade=="E",1 ,0)
grade_f <- ifelse(grade=="F",1 ,0)
# Fine classing
gloan_amnt <- cut(new_data$loan_amnt, br=c(0, 5000, 15000, 35000),
labels=c("Low","Medium","High"))
gloan_amnt_data <- data.frame(new_data, gloan_amnt)
summary(gloan_amnt_data$gloan_amnt)
gloan_amnt_data_final <-
gloan_amnt_data[complete.cases(gloan_amnt_data),]
summary(gloan_amnt_data_final$gloan_amnt)
glast_fico_range_high <- cut(new_data$last_fico_range_high, br=c(0,
649, 699, 744, 850), labels=c("Low","Medium","High","Very High"))
last_fico_high_data <- data.frame(new_data, glast_fico_range_high)
summary(last_fico_high_data$glast_fico_range_high)
last_fico_high_data_final <-
last_fico_high_data[complete.cases(last_fico_high_data),]
summary(last_fico_high_data_final$glast_fico_range_high)
glast_fico_range_low <- cut(new_data$last_fico_range_low, br=c(0,
645, 695,740, 845), labels=c("Low","Medium","High","Very High"))
last_fico_low_data <- data.frame(new_data, glast_fico_range_low)
summary(last_fico_low_data$glast_fico_range_low)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 36
last_fico_low_data_final <-
last_fico_low_data[complete.cases(last_fico_low_data),]
summary(last_fico_low_data_final$glast_fico_range_low)
gfico_range_high <- cut(new_data$fico_range_high, br=c(0, 650, 700,
745, 850), labels=c("Low","Medium","High","Very High"))
fico_high_data <- data.frame(new_data, gfico_range_high)
summary(fico_high_data$gfico_range_high)
fico_high_data_final <-
fico_high_data[complete.cases(fico_high_data),]
summary(fico_high_data_final$gfico_range_high)
gfico_range_low <- cut(new_data$fico_range_low, br=c(0, 650, 700,
745, 850), labels=c("Low","Medium","High","Very High"))
fico_low_data <- data.frame(new_data, gfico_range_low)
summary(fico_low_data$gfico_range_low)
fico_low_data_final <- fico_low_data[complete.cases(fico_low_data),]
summary(fico_low_data_final$gfico_range_low)
gdti <- cut(new_data$dti, br=c(0, 8, 13, 19, 30),
labels=c("Low","Medium","High","Very High"))
dti_data <- data.frame(new_data, gdti)
summary(dti_data$gdti)
dti_data_final <- dti_data[complete.cases(dti_data),]
summary(dti_data_final$gdti)
gmths_since_last_delinq <- cut(new_data$mths_since_last_delinq,
br=c(0, 48, 6466000, 10000000), labels=c("Low","Medium","High"))
mths_since_last_delinq_data <- data.frame(new_data,
gmths_since_last_delinq)
summary(mths_since_last_delinq_data$gmths_since_last_delinq)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 37
mths_since_last_delinq_data_final <-
mths_since_last_delinq_data[complete.cases(mths_since_last_delinq_da
ta),]
summary(mths_since_last_delinq_data_final$gmths_since_last_delinq)
gmths_since_last_record <- cut(new_data$mths_since_last_record,
br=c(0, 9299000, 10000000), labels=c("Low","High"))
mths_since_last_record_data <- data.frame(new_data,
gmths_since_last_record)
summary(mths_since_last_record_data$gmths_since_last_record)
mths_since_last_record_data_final <-
mths_since_last_record_data[complete.cases(mths_since_last_record_da
ta),]
summary(mths_since_last_record_data_final$gmths_since_last_record)
gannual_inc <- cut(new_data$annual_inc, br=c(4000, 40500, 59000,
82340, 6000000), labels=c("Low","Medium","High","Very High"))
annual_inc_data <- data.frame(new_data, gannual_inc)
summary(annual_inc_data$gannual_inc)
annual_inc_data_final <-
annual_inc_data[complete.cases(annual_inc_data),]
summary(annual_inc_data_final$gannual_inc)
glast_pymnt_amnt <- cut(new_data$last_pymnt_amnt, br=c(0, 195, 392,
1816, 36120), labels=c("Low","Medium","High","Very High"))
last_pymnt_amnt_data <- data.frame(new_data, glast_pymnt_amnt)
summary(last_pymnt_amnt_data$glast_pymnt_amnt)
last_pymnt_amnt_data_final <-
last_pymnt_amnt_data[complete.cases(last_pymnt_amnt_data),]
summary(last_pymnt_amnt_data_final$gannual_inc)
gtotal_pymnt <- cut(new_data$total_pymnt, br=c(0, 4707, 8089, 13090,
49490), labels=c("Low","Medium","High","Very High"))
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 38
total_pymnt_data <- data.frame(new_data, gtotal_pymnt)
summary(total_pymnt_data$gtotal_pymnt)
total_pymnt_data_final <-
total_pymnt_data[complete.cases(total_pymnt_data),]
summary(total_pymnt_data_final$gtotal_pymnt)
gtotal_pymnt_inv <- cut(new_data$total_pymnt_inv, br=c(0, 4312,
7590, 12390, 49100), labels=c("Low","Medium","High","Very High"))
total_pymnt_inv_data <- data.frame(new_data, gtotal_pymnt_inv)
summary(total_pymnt_inv_data$gtotal_pymnt_inv)
total_pymnt_inv_data_final <-
total_pymnt_inv_data[complete.cases(total_pymnt_inv_data),]
summary(total_pymnt_inv_data_final$gtotal_pymnt_inv)
dim(total_pymnt_inv_data_final)
glast_pymnt_amnt <- cut(new_data$last_pymnt_amnt, br=c(0, 195, 392,
1816, 16120), labels=c("Low","Medium","High","Very High"))
last_pymnt_amnt_data <- data.frame(new_data, glast_pymnt_amnt)
summary(last_pymnt_amnt_data$glast_pymnt_amnt)
last_pymnt_amnt_data_final <-
last_pymnt_amnt_data[complete.cases(last_pymnt_amnt_data),]
summary(last_pymnt_amnt_data_final$glast_pymnt_amnt)
dim(last_pymnt_amnt_data_final)
gtotal_rec_int <- cut(new_data$total_rec_int, br=c(0, 640, 1290,
2618, 15300), labels=c("Low","Medium","High","Very High"))
total_rec_int_data <- data.frame(new_data, gtotal_rec_int)
summary(total_rec_int_data$gtotal_rec_int)
total_rec_int_data_final <-
total_rec_int_data[complete.cases(total_rec_int_data),]
summary(total_rec_int_data_final$gtotal_rec_int)
dim(total_rec_int_data_final)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 39
# define global
purpose <- data.frame(purpose_debt ,purpose_car ,purpose_credit
,purpose_home_imp ,purpose_maj_purchase,purpose_ren_energy
,purpose_education ,purpose_house ,purpose_medical ,purpose_moving
,purpose_small_business ,purpose_vacation ,purpose_wedding)
home_ownership <- data.frame(home_ownership_mort,
home_ownership_own, home_ownership_rent, home_ownership_other)
grade <- data.frame(grade_a,grade_b,grade_c,grade_d,grade_e,grade_f)
final_data <- data.frame(def, annual_inc_data$gannual_inc,
last_fico_high_data$glast_fico_range_high,
last_fico_low_data$glast_fico_range_low,
fico_low_data$gfico_range_low, fico_high_data$gfico_range_high,
mths_since_last_delinq_data$gmths_since_last_delinq,
total_pymnt_data$gtotal_pymnt,
total_pymnt_inv_data$gtotal_pymnt_inv, term,
last_pymnt_amnt_data$glast_pymnt_amnt,
total_rec_int_data$gtotal_rec_int, purpose, grade,home_ownership)
fit1 <- glm(def~.,data=final_data, family="binomial")
summary(fit1)
final_data1<- data.frame(def, annual_inc_data$gannual_inc,
last_fico_high_data$glast_fico_range_high,
mths_since_last_delinq_data$gmths_since_last_delinq,
total_pymnt_data$gtotal_pymnt,
total_pymnt_inv_data$gtotal_pymnt_inv, term,
last_pymnt_amnt_data$glast_pymnt_amnt,
total_rec_int_data$gtotal_rec_int)
fit2 <- glm(def~.,data=final_data1, family="binomial")
summary(fit2)
#IV calculation
iv.mult(final_data1,"def",TRUE)
iv_data <- data.frame(def, annual_inc_data$gannual_inc,
last_fico_high_data$glast_fico_range_high,
mths_since_last_delinq_data$gmths_since_last_delinq,
total_pymnt_data$gtotal_pymnt,
total_pymnt_inv_data$gtotal_pymnt_inv, term,
last_pymnt_amnt_data$glast_pymnt_amnt,
total_rec_int_data$gtotal_rec_int)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 40
iv.plot.summary(iv.mult(final_data1,"def",TRUE))
#taking IV cut off as 0.1
iv_data_final <- data.frame(def,
last_fico_high_data$glast_fico_range_high,
total_pymnt_data$gtotal_pymnt,
total_pymnt_inv_data$gtotal_pymnt_inv, term,
last_pymnt_amnt_data$glast_pymnt_amnt,
total_rec_int_data$gtotal_rec_int)
iv.mult(iv_data_final,"def",TRUE)
iv.plot.summary(iv.mult(iv_data_final,"def",TRUE))
fin_data <- data.frame(def, last_fico_range_high, last_pymnt_amnt,
total_pymnt, total_rec_int)
fit <- glm(def~., data=fin_data, family="binomial")
summary(fit)
vif(fit)
# Dividing data into train and test
totalrecords <- nrow(fin_data)
trainfraction = 0.7
trainrecords = as.integer(totalrecords * trainfraction)
allrows <- 1:totalrecords
trainrows <- sample(totalrecords,trainrecords)
testrows <- allrows[-trainrows]
train<-data.frame(fin_data[trainrows,])
test<-data.frame(fin_data[testrows,])
dim(train)
dim(test)
fit_train <- glm(def~., data=train, family="binomial")
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 41
summary(fit_train)
# multicolinearity
vif(fit_train)
#Calculating ROC curve for model
library(ROCR)
#score train data set
train$score<-predict(fit_train,type='response',train)
pred_train<-prediction(train$score,train$def)
perf_train <- performance(pred_train,"tpr","fpr")
plot(perf_train)
# calculating AUC
auc_train <- performance(pred_train,"auc")
auc_train <- unlist(slot(auc_train, "y.values"))
# adding min and max ROC AUC to the center of the plot
minauc<-min(round(auc_train, digits = 2))
maxauc<-max(round(auc_train, digits = 2))
minauct <- paste(c("min(AUC) = "),minauc,sep="")
maxauct <- paste(c("max(AUC) = "),maxauc,sep="")
legend(0.3,0.6,c(minauct,maxauct,"n"),border="white",cex=1.7,box.co
l = "white")
# Calculating KS statistic for train data
max(attr(perf_train,'y.values')[[1]]-
attr(perf_train,'x.values')[[1]])
# Concordance
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 42
OptimisedConc=function(model)
{
Data = cbind(model$y, model$fitted.values)
ones = Data[Data[,1] == 1,]
zeros = Data[Data[,1] == 0,]
conc=matrix(0, dim(zeros)[1], dim(ones)[1])
disc=matrix(0, dim(zeros)[1], dim(ones)[1])
ties=matrix(0, dim(zeros)[1], dim(ones)[1])
for (j in 1:dim(zeros)[1])
{
for (i in 1:dim(ones)[1])
{
if (ones[i,2]>zeros[j,2])
{conc[j,i]=1}
else if (ones[i,2]<zeros[j,2])
{disc[j,i]=1}
else if (ones[i,2]==zeros[j,2])
{ties[j,i]=1}
}
}
Pairs=dim(zeros)[1]*dim(ones)[1]
PercentConcordance=(sum(conc)/Pairs)*100
PercentDiscordance=(sum(disc)/Pairs)*100
PercentTied=(sum(ties)/Pairs)*100
return(list("Percent Concordance"=PercentConcordance,"Percent
Discordance"=PercentDiscordance,"Percent
Tied"=PercentTied,"Pairs"=Pairs))
}
# concordance of train data
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 43
OptimisedConc(fit_train)
#score test data set
test$score<-predict(fit_train,type='response',test)
pred_test<-prediction(test$score,test$def)
perf_test <- performance(pred_test,"tpr","fpr")
plot(perf_test)
# calculating AUC
auc_test <- performance(pred_test,"auc")
auc_test <- unlist(slot(auc_test, "y.values"))
# adding min and max ROC AUC to the center of the plot
minauc<-min(round(auc_test, digits = 2))
maxauc<-max(round(auc_test, digits = 2))
minauct <- paste(c("min(AUC) = "),minauc,sep="")
maxauct <- paste(c("max(AUC) = "),maxauc,sep="")
legend(0.3,0.6,c(minauct,maxauct,"n"),border="white",cex=1.7,box.co
l = "white")
# Calculating KS statistic
max(attr(perf_test,'y.values')[[1]]-attr(perf_test,'x.values')[[1]])
# Probabilities prediction
prob_glm <- predict.glm(fit_train, type="response",se.fit=FALSE)
write.csv(prob_glm, "probability.csv")
plot(prob_glm)
#Calculating top 3 variables affecting Credit Score Function
g<-predict(fit_train,type='terms',test)
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 44
#function to pick top 3 reasons
#works by sorting coefficient terms in equation and selecting top 3
in sort for each loan scored
ftopk<- function(x,top=3){
res=names(x)[order(x, decreasing = TRUE)][1:top]
paste(res,collapse=";",sep="")
}
# Application of the function using the top 3 rows
topk=apply(g,1,ftopk,top=3)
#add reason list to scored test sample
test<-cbind(test, topk)
summary(test)
# Decision Tree Implementation
#load tree package
library(rpart)
#build model using 90% 10% priors
#with smaller complexity parameter to allow more complex trees
model_dt <-
rpart(def~.,data=train,parms=list(prior=c(.9,.1)),cp=.0002)
plot(model_dt)
text(model_dt)
#score test data
test$tscore1<-predict(model_dt,type='prob',test)
pred5<-prediction(test$tscore1[,2],test$def)
perf5 <- performance(pred5,"tpr","fpr")
Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 45
# Random Forest
library(randomForest)
train$default <- data.frame(train$def[complete.cases(train$def),])
arf <-
randomForest(def~.,data=train,importance=TRUE,proximity=TRUE,ntree=5
00, keep.forest=TRUE)
#plot variable importance
varImpPlot(arf)
train$p <- predict(model_dt,train,type="prob")
train$p
summary(train$p)
testdata$p <- predict(model_rf,testdata,type="prob")
summary(testdata$p)
----------------------- End Of Report --------------------------

Weitere ähnliche Inhalte

Was ist angesagt?

Default of Credit Card Payments
Default of Credit Card PaymentsDefault of Credit Card Payments
Default of Credit Card PaymentsVikas Virani
 
Kaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overviewKaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overviewAdam Pah
 
Consumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestConsumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestHirak Sen Roy
 
Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Vatsal N Shah
 
Comparative analysis of products & services of Axis Bank with its competitors
Comparative analysis of products & services of Axis Bank with its competitors Comparative analysis of products & services of Axis Bank with its competitors
Comparative analysis of products & services of Axis Bank with its competitors Supriya Mondal
 
Credit score and reports
Credit score and reportsCredit score and reports
Credit score and reportskdepodesta
 
Overview of Data Analytics in Lending Business
Overview of Data Analytics in Lending BusinessOverview of Data Analytics in Lending Business
Overview of Data Analytics in Lending BusinessSanjay Kar
 
Credit Card Fraud Detection Client Presentation
Credit Card Fraud Detection Client PresentationCredit Card Fraud Detection Client Presentation
Credit Card Fraud Detection Client PresentationAyapparaj SKS
 
Taiwanese Credit Card Client Fraud detection
Taiwanese Credit Card Client Fraud detectionTaiwanese Credit Card Client Fraud detection
Taiwanese Credit Card Client Fraud detectionRavi Gupta
 
Default Credit Card Prediction
Default Credit Card PredictionDefault Credit Card Prediction
Default Credit Card PredictionAlexandre Pinto
 
Introduction to predictive modeling v1
Introduction to predictive modeling v1Introduction to predictive modeling v1
Introduction to predictive modeling v1Venkata Reddy Konasani
 
project on credit-risk-management
project on credit-risk-managementproject on credit-risk-management
project on credit-risk-managementShanky Rana
 
Credit risk management @ state bank of india project report mba finance
Credit risk management @ state bank of india project report mba financeCredit risk management @ state bank of india project report mba finance
Credit risk management @ state bank of india project report mba financeBabasab Patil
 
Big data & Digital Marketing
Big data & Digital MarketingBig data & Digital Marketing
Big data & Digital MarketingKarthik Bharath
 
Segmentation of bank customer
Segmentation of bank customerSegmentation of bank customer
Segmentation of bank customerShailesh kumar
 
The Path to Open Banking
The Path to Open BankingThe Path to Open Banking
The Path to Open BankingMuleSoft
 

Was ist angesagt? (20)

Default of Credit Card Payments
Default of Credit Card PaymentsDefault of Credit Card Payments
Default of Credit Card Payments
 
Kaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overviewKaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overview
 
Consumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestConsumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random Forest
 
Credit defaulter analysis
Credit defaulter analysisCredit defaulter analysis
Credit defaulter analysis
 
Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients
 
Comparative analysis of products & services of Axis Bank with its competitors
Comparative analysis of products & services of Axis Bank with its competitors Comparative analysis of products & services of Axis Bank with its competitors
Comparative analysis of products & services of Axis Bank with its competitors
 
Credit score and reports
Credit score and reportsCredit score and reports
Credit score and reports
 
Fintech introduction
Fintech introductionFintech introduction
Fintech introduction
 
Overview of Data Analytics in Lending Business
Overview of Data Analytics in Lending BusinessOverview of Data Analytics in Lending Business
Overview of Data Analytics in Lending Business
 
Artificial Intelligence in Banking
Artificial Intelligence in BankingArtificial Intelligence in Banking
Artificial Intelligence in Banking
 
Credit Card Fraud Detection Client Presentation
Credit Card Fraud Detection Client PresentationCredit Card Fraud Detection Client Presentation
Credit Card Fraud Detection Client Presentation
 
Taiwanese Credit Card Client Fraud detection
Taiwanese Credit Card Client Fraud detectionTaiwanese Credit Card Client Fraud detection
Taiwanese Credit Card Client Fraud detection
 
Default Credit Card Prediction
Default Credit Card PredictionDefault Credit Card Prediction
Default Credit Card Prediction
 
Introduction to predictive modeling v1
Introduction to predictive modeling v1Introduction to predictive modeling v1
Introduction to predictive modeling v1
 
project on credit-risk-management
project on credit-risk-managementproject on credit-risk-management
project on credit-risk-management
 
Credit risk management @ state bank of india project report mba finance
Credit risk management @ state bank of india project report mba financeCredit risk management @ state bank of india project report mba finance
Credit risk management @ state bank of india project report mba finance
 
Big data & Digital Marketing
Big data & Digital MarketingBig data & Digital Marketing
Big data & Digital Marketing
 
Segmentation of bank customer
Segmentation of bank customerSegmentation of bank customer
Segmentation of bank customer
 
Fintech india: Genesis
Fintech india: GenesisFintech india: Genesis
Fintech india: Genesis
 
The Path to Open Banking
The Path to Open BankingThe Path to Open Banking
The Path to Open Banking
 

Ähnlich wie Project Report - Acquisition Credit Scoring Model

Predicting Delinquency-Give me some credit
Predicting Delinquency-Give me some creditPredicting Delinquency-Give me some credit
Predicting Delinquency-Give me some creditpragativbora
 
IRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank LoansIRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank LoansIRJET Journal
 
Applications of Data Science in Banking and Financial sector.pptx
Applications of Data Science in Banking and Financial sector.pptxApplications of Data Science in Banking and Financial sector.pptx
Applications of Data Science in Banking and Financial sector.pptxkarnika21
 
Loan Analysis Predicting Defaulters
Loan Analysis Predicting DefaultersLoan Analysis Predicting Defaulters
Loan Analysis Predicting DefaultersIRJET Journal
 
Chapter8 - Beyond Classification
Chapter8 - Beyond ClassificationChapter8 - Beyond Classification
Chapter8 - Beyond ClassificationAnna Olecka
 
Credit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providers
Credit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providersCredit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providers
Credit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providersSrikanth Minnam
 
Data Mining in Life Insurance Business
Data Mining in Life Insurance BusinessData Mining in Life Insurance Business
Data Mining in Life Insurance BusinessAnkur Khanna
 
Risk dk
Risk dkRisk dk
Risk dkdkocis
 
A data mining approach to predict
A data mining approach to predictA data mining approach to predict
A data mining approach to predictIJDKP
 
A Guide for Credit Providers Moving to Participate in CCR by David Grafton
A Guide for Credit Providers Moving to Participate in CCR by David GraftonA Guide for Credit Providers Moving to Participate in CCR by David Grafton
A Guide for Credit Providers Moving to Participate in CCR by David GraftonDavid Grafton
 
PredictiveMetrics' Predictive Scoring for Collections Capabilities
PredictiveMetrics' Predictive Scoring for Collections CapabilitiesPredictiveMetrics' Predictive Scoring for Collections Capabilities
PredictiveMetrics' Predictive Scoring for Collections CapabilitiesPredictiveMetrics, Inc.
 
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING mlaij
 
Sample Risk Assessment Report- QuantumBanking.pdf
Sample Risk Assessment Report- QuantumBanking.pdfSample Risk Assessment Report- QuantumBanking.pdf
Sample Risk Assessment Report- QuantumBanking.pdfSathishKumar960827
 
Single View of Customer in Banking
Single View of Customer in BankingSingle View of Customer in Banking
Single View of Customer in BankingRajeev Krishnan
 
Projecting Impact of Non-Traditional Data and Advanced Analytics on Delivery ...
Projecting Impact of Non-Traditional Data and Advanced Analytics on Delivery ...Projecting Impact of Non-Traditional Data and Advanced Analytics on Delivery ...
Projecting Impact of Non-Traditional Data and Advanced Analytics on Delivery ...CGAP
 

Ähnlich wie Project Report - Acquisition Credit Scoring Model (20)

Predicting Delinquency-Give me some credit
Predicting Delinquency-Give me some creditPredicting Delinquency-Give me some credit
Predicting Delinquency-Give me some credit
 
IRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank LoansIRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank Loans
 
Data mining on Financial Data
Data mining on Financial DataData mining on Financial Data
Data mining on Financial Data
 
Applications of Data Science in Banking and Financial sector.pptx
Applications of Data Science in Banking and Financial sector.pptxApplications of Data Science in Banking and Financial sector.pptx
Applications of Data Science in Banking and Financial sector.pptx
 
Loan Analysis Predicting Defaulters
Loan Analysis Predicting DefaultersLoan Analysis Predicting Defaulters
Loan Analysis Predicting Defaulters
 
Chapter8 - Beyond Classification
Chapter8 - Beyond ClassificationChapter8 - Beyond Classification
Chapter8 - Beyond Classification
 
Credit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providers
Credit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providersCredit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providers
Credit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providers
 
Data Mining in Life Insurance Business
Data Mining in Life Insurance BusinessData Mining in Life Insurance Business
Data Mining in Life Insurance Business
 
Risk dk
Risk dkRisk dk
Risk dk
 
The-Report-V7
The-Report-V7The-Report-V7
The-Report-V7
 
A data mining approach to predict
A data mining approach to predictA data mining approach to predict
A data mining approach to predict
 
2018-Case-Study.pdf
2018-Case-Study.pdf2018-Case-Study.pdf
2018-Case-Study.pdf
 
A Guide for Credit Providers Moving to Participate in CCR by David Grafton
A Guide for Credit Providers Moving to Participate in CCR by David GraftonA Guide for Credit Providers Moving to Participate in CCR by David Grafton
A Guide for Credit Providers Moving to Participate in CCR by David Grafton
 
PredictiveMetrics' Predictive Scoring for Collections Capabilities
PredictiveMetrics' Predictive Scoring for Collections CapabilitiesPredictiveMetrics' Predictive Scoring for Collections Capabilities
PredictiveMetrics' Predictive Scoring for Collections Capabilities
 
Manuscript dss
Manuscript dssManuscript dss
Manuscript dss
 
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
 
Credit process
Credit processCredit process
Credit process
 
Sample Risk Assessment Report- QuantumBanking.pdf
Sample Risk Assessment Report- QuantumBanking.pdfSample Risk Assessment Report- QuantumBanking.pdf
Sample Risk Assessment Report- QuantumBanking.pdf
 
Single View of Customer in Banking
Single View of Customer in BankingSingle View of Customer in Banking
Single View of Customer in Banking
 
Projecting Impact of Non-Traditional Data and Advanced Analytics on Delivery ...
Projecting Impact of Non-Traditional Data and Advanced Analytics on Delivery ...Projecting Impact of Non-Traditional Data and Advanced Analytics on Delivery ...
Projecting Impact of Non-Traditional Data and Advanced Analytics on Delivery ...
 

Project Report - Acquisition Credit Scoring Model

  • 1. Acquisition Credit Scoring Model Project Final Report 31st April, 2015 Great Lakes Institute of Management,Gurgaon Subhasis Mishra – Research Supervisor – Mr. Manu Chandra (FNMathlogic)
  • 2. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 2 Table of Contents 1 Introduction........................................................................................................................... 3 2 Scope and Objectives.............................................................................................................. 3 3 Data Sources.......................................................................................................................... 4 4 Analytical Approach................................................................................................................ 4 4.1 Data Collection ................................................................................................................ 4 4.2 Data Preparation.............................................................................................................. 4 4.3 Variable Reduction......................................................................................................... 11 4.4 Data Sampling................................................................................................................ 15 4.5 Model Development ...................................................................................................... 16 4.6 Intercept not significantin the development sample........................................................ 17 4.7 Assessment of sign of variables....................................................................................... 18 4.8 Multicollinearity Test...................................................................................................... 18 4.9 ProbabilityPrediction..................................................................................................... 22 4.10 Model Prediction (Goodness-of-fit)............................................................................... 23 4.11 Calculating top 3 variables affecting credit score function .............................................. 29 4.12 Reject Inference........................................................................................................... 30 4.13 Model Performance ..................................................................................................... 31 5 Decision Tree Approach........................................................................................................ 31 6 Tools and Techniques ........................................................................................................... 33 7 Way Forward........................................................................................................................ 33 8 Recommendations andApplications...................................................................................... 33 9 References and Bibliography................................................................................................. 33 10 Project Code....................................................................................................................... 33
  • 3. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 3 1 Introduction Credit scoring is perhaps one of the most classic applications for predictive modeling, to predict whether or not credit extended to an applicant will likely result in profit or losses for the lending institution. There are many variations and complexities regarding how exactly credit is extended to individuals, businesses, and other organizations for various purposes (purchasing equipment, real estate, consumer items, and so on), and using various methods of credit (credit card, loan, delayed payment plan). But in all cases, a lender provides money to an individual or institution, and expects to be paid back in time with interest commensurate with the risk of default. Credit scoring is the set of decision models and their underlying techniques that aid lenders in the granting of consumer credit. These techniques determine who will get credit, how much credit they should get, and what operational strategies will enhance the profitability of the borrowers to the lenders. Further, they help to assess the risk in lending. Credit scoring is a dependable assessment of a person’s credit worthiness since it is based on actual data. A lender commonly makes two types of decisions: first, whether to grant credit to a new applicant, and second, how to deal with existing applicants, including whether to increase their credit limits. In both cases, whatever the techniques used, it is critical that there is a large sample of previous customers with their application details, behavioral patterns, and subsequent credit history available. Most of the techniques use this sample to identify the connection between the characteristics of the consumers (annual income, age, number of years in employment with their current employer, etc.) and their subsequent history. Typical application areas in the consumer market include: credit cards, auto loans, home mortgages, home equity loans, mail catalog orders, and a wide variety of personal loan products. 2 Scope and Objectives To evaluate the scope of applying Logistic Regression and various Data Minging techniques in Credit Scoring facilitating better decision making to reduce the Risk.
  • 4. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 4 Objective of this project is to build a Credit Scoring model to reduce the potential risk involved, and based on which credit lending decisions can be made. 3 Data Sources Applicant data involving US customers that has been used in this project was provided by our mentor. 4 Analytical Approach Steps which would be followed for this project are described below: 4.1 Data Collection Customer credit history along with all the information was provided by our mentor. Data provided was cross sectional data and belongs to a single point of time. 4.2 Data Preparation Listed below are some of the techniques used for data preparation :- 1. Initial variable selection based on judgement – First we identified some of the key variables out of all the variables given in our data set purely based on judgement. Initial screening of variables is very much important and it requires deeper understanding about that particular domain as well as experience. So out of the 46 variables, we took some 20 odd variables into consideration for modelling. Below is the list of initial variables :- Loan_amnt Term Annual_inc Fico_range_high Fico_range_low Last_fico_range_hig Last_fico_range_low Purpose Home_ownership
  • 5. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 5 Grade Dti Mths_since_last_delinq Mths_since_last_record Last_paymnt_amnt Total_paymnt Total_paymnt_inv Pub_rec Total_rec_int Revol_bal 2. Missing Value Treatment - Normally there are two techniques used in case of missing value treatment. Those are either by removing the entire variable- if more than 80-90% of observations are missing for that variable, or by imputing with a very high value like 9999999- if the variable is significant in the perspective of modeling, hence can’t be dropped. In our case we went ahead with the latter one as few variables like Last_month_since_deliquency, last_month_since_record carries a lot of importance in terms of modeling although it had more than 90% observations missing in the original data set. For other non-significant variables, we just imputed with Zero. Below is the piece of R code showing what we did towards missing value treatment – new_data <- as.data.frame(data, stringsAsFactors=FALSE) new_data$mths_since_last_delinq[is.na(new_data$mths_since_last_d elinq)] <- 9999999 new_data$mths_since_last_record[is.na(new_data$mths_since_last_r ecord)] <- 9999999 new_data[is.na(new_data)] <- 0 str(new_data)
  • 6. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 6 3. Identifying the dependent variable – In any business problem, It is very much critical to first understand the problem statement and then approaching with an appropriate solution. Hence in our case the first priority was given to identify the dependent variable correctly, which might otherwise would have caused the modelling to go completely wrong. Our problem is concerned with finding if the person is going to default or not by calculating the probability of default among the consumers. From our data set, we picked LOAN_STATUS as the dependent variable which contains value like Current, Default, Charged off etc. 4. Data Transformation – Data Transformation is one of the most significant steps before doing the actual modeling and carries a lot of importance in making the final model robust. It includes variuos steps like introduction of dummy variables, conversion of integer variable to categorical variable etc.  Introduction of dummy variables – We introduced dummy variables for the character variables. As part of the dummy variable inclusion, we did the following :- # Introduction of Dummy variables # For Loan status def <- ifelse(loan_status=="Default", 1, 0) # purpose dummies purpose_debt <- ifelse(purpose=='debt_consolidation', 1, 0) purpose_car <- ifelse(purpose=='car', 1, 0) purpose_credit <- ifelse(purpose=="credit_card", 1, 0) purpose_home_imp <- ifelse(purpose=="home_improvement", 1, 0) purpose_maj_purchase <- ifelse(purpose=="major_purchase", 1, 0) purpose_ren_energy <- ifelse(purpose=="renewable_energy", 1, 0) purpose_education <- ifelse(purpose=="educational", 1, 0) purpose_house <- ifelse(purpose=="house", 1, 0)
  • 7. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 7 purpose_medical <- ifelse(purpose=="medical", 1, 0) purpose_moving <- ifelse(purpose=="moving", 1, 0) purpose_small_business <- ifelse(purpose=="small_business", 1, 0) purpose_vacation <- ifelse(purpose=="vacation", 1, 0) purpose_wedding <- ifelse(purpose=="wedding", 1, 0) # home ownership dummies home_ownership_mort <- ifelse(home_ownership=="MORTGAGE", 1, 0) home_ownership_own <- ifelse(home_ownership=="OWN", 1, 0) home_ownership_rent <- ifelse(home_ownership=="RENT", 1, 0) home_ownership_other <- ifelse(home_ownership=="OTHER", 1, 0) # Grade grade_a <- ifelse(grade=="A",1 ,0) grade_b <- ifelse(grade=="B",1 ,0) grade_c <- ifelse(grade=="C",1 ,0) grade_d <- ifelse(grade=="D",1 ,0) grade_e <- ifelse(grade=="E",1 ,0) grade_f <- ifelse(grade=="F",1 ,0)  Conversion of continuos variables to categorical variables(Fine classing) As a better modelling practice, it is recommended that continuos variables should be changed into categorical variables by introduction of bins- which is also called fine classing. It generally yields better result during Information Value (IV) calculation of response variables. We followed the below approach for fine classing :- gloan_amnt <- cut(new_data$loan_amnt, br=c(0, 5000, 15000, 35000), labels=c("Low","Medium","High"))
  • 8. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 8 gloan_amnt_data <- data.frame(new_data, gloan_amnt) summary(gloan_amnt_data$gloan_amnt) gloan_amnt_data_final<- gloan_amnt_data[complete.cases(gloan_amnt_data),] summary(gloan_amnt_data_final$gloan_amnt) glast_fico_range_high <- cut(new_data$last_fico_range_high, br=c(0, 649, 699, 744, 850), labels=c("Low","Medium","High","Very High")) last_fico_high_data <- data.frame(new_data, glast_fico_range_high) summary(last_fico_high_data$glast_fico_range_high) last_fico_high_data_final<- last_fico_high_data[complete.cases(last_fico_high_data),] summary(last_fico_high_data_final$glast_fico_range_high) glast_fico_range_low <- cut(new_data$last_fico_range_low, br=c(0, 645, 695,740, 845), labels=c("Low","Medium","High","Very High")) last_fico_low_data <- data.frame(new_data, glast_fico_range_low) summary(last_fico_low_data$glast_fico_range_low) last_fico_low_data_final<- last_fico_low_data[complete.cases(last_fico_low_data),] summary(last_fico_low_data_final$glast_fico_range_low) gfico_range_high <- cut(new_data$fico_range_high, br=c(0, 650, 700, 745, 850), labels=c("Low","Medium","High","Very High")) fico_high_data <- data.frame(new_data, gfico_range_high) summary(fico_high_data$gfico_range_high) fico_high_data_final<- fico_high_data[complete.cases(fico_high_data),] summary(fico_high_data_final$gfico_range_high)
  • 9. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 9 gfico_range_low <- cut(new_data$fico_range_low, br=c(0, 650, 700, 745, 850), labels=c("Low","Medium","High","Very High")) fico_low_data <- data.frame(new_data, gfico_range_low) summary(fico_low_data$gfico_range_low) fico_low_data_final <- fico_low_data[complete.cases(fico_low_data),] summary(fico_low_data_final$gfico_range_low) Gdti<- cut(new_data$dti,br=c(0,8,13,19,30),labels=c("Low","Medium","High"," Very High")) dti_data <- data.frame(new_data, gdti) summary(dti_data$gdti) dti_data_final <- dti_data[complete.cases(dti_data),] summary(dti_data_final$gdti) gmths_since_last_delinq<-cut(new_data$mths_since_last_delinq, br=c(0, 48, 6466000, 10000000), labels=c("Low","Medium","High")) mths_since_last_delinq_data<-data.frame(new_data, gmths_since_last_delinq) summary(mths_since_last_delinq_data$gmths_since_last_delinq) mths_since_last_delinq_data_final<- mths_since_last_delinq_data[complete.cases(mths_since_last_delinq_da ta),] summary(mths_since_last_delinq_data_final$gmths_since_last_delinq) gmths_since_last_record <- cut(new_data$mths_since_last_record, br=c(0, 9299000, 10000000), labels=c("Low","High"))
  • 10. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 10 mths_since_last_record_data <- data.frame(new_data, gmths_since_last_record) summary(mths_since_last_record_data$gmths_since_last_record) mths_since_last_record_data_final<- mths_since_last_record_data[complete.cases(mths_since_last_record_da ta),] summary(mths_since_last_record_data_final$gmths_since_last_record) gannual_inc <- cut(new_data$annual_inc, br=c(4000, 40500, 59000, 82340, 6000000), labels=c("Low","Medium","High","Very High")) annual_inc_data <- data.frame(new_data, gannual_inc) summary(annual_inc_data$gannual_inc) annual_inc_data_final<- annual_inc_data[complete.cases(annual_inc_data),] summary(annual_inc_data_final$gannual_inc) glast_pymnt_amnt <- cut(new_data$last_pymnt_amnt, br=c(0, 195, 392, 1816, 36120), labels=c("Low","Medium","High","Very High")) last_pymnt_amnt_data <- data.frame(new_data, glast_pymnt_amnt) summary(last_pymnt_amnt_data$glast_pymnt_amnt) last_pymnt_amnt_data_final<- last_pymnt_amnt_data[complete.cases(last_pymnt_amnt_data),] summary(last_pymnt_amnt_data_final$gannual_inc) gtotal_pymnt <- cut(new_data$total_pymnt, br=c(0, 4707, 8089, 13090, 49490), labels=c("Low","Medium","High","Very High")) total_pymnt_data <- data.frame(new_data, gtotal_pymnt) summary(total_pymnt_data$gtotal_pymnt) total_pymnt_data_final<- total_pymnt_data[complete.cases(total_pymnt_data),] summary(total_pymnt_data_final$gtotal_pymnt)
  • 11. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 11 gtotal_pymnt_inv <- cut(new_data$total_pymnt_inv, br=c(0, 4312, 7590, 12390, 49100), labels=c("Low","Medium","High","Very High")) total_pymnt_inv_data <- data.frame(new_data, gtotal_pymnt_inv) summary(total_pymnt_inv_data$gtotal_pymnt_inv) total_pymnt_inv_data_final<- total_pymnt_inv_data[complete.cases(total_pymnt_inv_data),] summary(total_pymnt_inv_data_final$gtotal_pymnt_inv) dim(total_pymnt_inv_data_final) glast_pymnt_amnt <- cut(new_data$last_pymnt_amnt, br=c(0, 195, 392, 1816, 16120), labels=c("Low","Medium","High","Very High"))last_pymnt_amnt_data <- data.frame(new_data, glast_pymnt_amnt)summary(last_pymnt_amnt_data$glast_pymnt_amnt) last_pymnt_amnt_data_final<- last_pymnt_amnt_data[complete.cases(last_pymnt_amnt_data),]summary(l ast_pymnt_amnt_data_final$glast_pymnt_amnt) dim(last_pymnt_amnt_data_final) gtotal_rec_int <- cut(new_data$total_rec_int, br=c(0, 640, 1290, 2618, 15300), labels=c("Low","Medium","High","Very High")) total_rec_int_data <- data.frame(new_data, gtotal_rec_int) summary(total_rec_int_data$gtotal_rec_int) total_rec_int_data_final<- total_rec_int_data[complete.cases(total_rec_int_data),] summary(total_rec_int_data_final$gtotal_rec_int) dim(total_rec_int_data_final) 4.3 Variable Reduction Once the strongest characteristics are grouped and ranked, variable selection is done. At the end of this step, the Scorecard Developer will have a set of strong, grouped characterstics, preferably representing independent information types, for use in the regression step. The strength of a characteristic is gauged using four main criteria :
  • 12. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 12  Predictive power of each attribute. The weight of evidence (WOE) measure is used for this purpose.  The range and trend of weight of evidence across grouped attributes within a characteristic.  Predictive power of the characterstic. The Information Value (IV) measure is used for this purpose.  Operational and business considerations In our case, we have used the Information Value measure approach for variable reduction. Some analysts run other variable selection algorithms (e.g., those that rank predictive power using Chi Square or R-Square) prior to grouping characteristics. This gives them an indication of characteristic strength using independent means, and also alerts them in cases where the Information Value figure is high/low compared to other measures. 
The initial characteristic analysis process can be interactive, and involvement from business users and operations staff should be encouraged. In particular, they may provide further insights into any unex- pected or illogical behavior patterns and enhance the grouping of all variables. 
The first step in performing this analysis is to perform initial grouping of the variables, and rank order them by IV or some other strength measure. This can be done using a number of binning techniques. If using other applications, a good way to start is to bin nominal variables into 50 or so equal groups, and to calculate the WOE and IV for the grouped attributes and characteristics. One can then use any spreadsheet software to fine tune the groupings for the stronger characteristics based on principles to be outlined in the next section. Similarly for categorical characteristics, the WOE for each unique attribute and the IV of each characteristic can be calculated. One can then spend time fine-tuning the grouping for those characteristics that surpass a minimum acceptable strength. Decision trees are also often used for grouping variables. Most users, however, use them to generate initial ideas, and then use alternate software applications to interactively fine-tune the groupings. 
 Information Value (IV) :- Information Value provides a measure of how well a variable X is able to distinguish between a binary response (e.g “good” vs “bad”) in some target variable Y. The idea is if a variable X has a low information value, it may not do
  • 13. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 13 a sufficient job of classifying the target variable, and hence is removed as an explnatory variable. To see how this works, let X be grouped into n bins. Each x  X corresponds to a y  Y that may take one of two values, say 0 or 1. Then for bins, Xi, 1  i  n, IV =  ( gi- bi ) * ln(gi/ bi) Where bi = the proportion of 0’s in bin i versus all bins gi = the proportion of 1’s in bin versus all bins ln(gi/ bi) is known as weight of evidence (For bin Xi). Cut-off value ma vary but in our case, we have considered IV cut-off as 0.1 . Below is the R code sample, output and plot of IV calculation :- # IV calculation iv.mult(final_data1,"def",TRUE) iv_data <- data.frame(def, annual_inc_data$gannual_inc, last_fico_high_data$glast_fico_range_high, mths_since_last_delinq_data$gmths_since_last_delinq, total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv, term, last_pymnt_amnt_data$glast_pymnt_amnt, total_rec_int_data$gtotal_rec_int) iv.plot.summary(iv.mult(final_data1,"def",TRUE))
  • 14. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 14 #taking IV cut off as 0.1 iv_data_final <- data.frame(def, last_fico_high_data$glast_fico_range_high, total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv, term, last_pymnt_amnt_data$glast_pymnt_amnt, total_rec_int_data$gtotal_rec_int) iv.mult(iv_data_final,"def",TRUE) iv.plot.summary(iv.mult(iv_data_final,"def",TRUE))
  • 15. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 15 4.4 Data Sampling Sampling has been done based on the 70-30% rule. That is the entire data would be splitted into 70-30 ratio to use it as Development and Validation data respectively. Below is the R code :-
  • 16. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 16 # Dividing data into train and test totalrecords <- nrow(fin_data) trainfraction = 0.7 trainrecords = as.integer(totalrecords * trainfraction) allrows <- 1:totalrecords trainrows <- sample(totalrecords,trainrecords) testrows <- allrows[-trainrows] train<-data.frame(fin_data[trainrows,]) test<-data.frame(fin_data[testrows,]) dim(train) dim(test) 4.5 Model Development When the training data set on which the modeling is based contains a binary indicator variable of "Paid back" vs. "Default", or "Good Credit" vs. "Bad Credit", then Logistic Regression models are well suited for subsequent predictive modeling. Logistic regression yields prediction probabilities for whether or not a particular outcome (e.g., Bad Credit) will occur. Furthermore, logistic regression models are linear models, in that the logit transformed prediction probability is a linear function of the predictor variable values. Thus, a final score card model derived in this manner has the desirable quality that the final credit score (credit risk) is a linear function of the predictors, and with some additional transformations applied to the model parameter, a simple linear function of scores that can be associated with each predictor class value after coarse coding. So the final credit score is then a simple sum of individual score values that can be taken from the scorecard. In our model, we have used train data set which contains 70% of the original data. Below is the R code for logistic regrssion :- fit_train <- glm(def~., data=train, family="binomial") summary(fit_train)
  • 17. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 17 4.6 Intercept not significant in the development sample An intercept is almost always part of the model and is almost always significantly different from zero. The test of the intercept in the procedure output tests whether this parameter is equal to zero. If the intercept is zero (equivalent to having no intercept in the model), the resulting model implies that the response function must be exactly zero when all the predictors are set to zero. For a logistic model it means that the logit (or log odds) is zero, which implies that the event probability is 0.5. This is a very strong assumption that is sometimes reasonable, but more often is not. So, a highly significant intercept in your model is generally not a problem. By the same token, if the intercept is not significant you usually would not want to remove it from the model because by doing this you are creating a model that says that the response function must be zero when the predictors are all zero. If the nature of what you are modeling is such that you want to assume this, then you might want to remove the intercept. In our case, we are getting a intercept value of around 0.58 for validation sample.
  • 18. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 18 4.7 Assessment of sign of variables a. Last_fico_range_high- Negative sign of the coeffcient signifies that it exhibits an inverse relationship with default, which is actually true in this context. Higher the fico value is, lesser is the chance of default. b. Total_pymnt- It also signifies an inverse relationship with default, which is also true in this context. c. Last_pymnt_amnt- Negative sign of the coefficient shows that it exhibits an inverse relationship with default. d. Total_rec_int- Positive sign of the coefficient signifies that it has a direct relationship with default. 4.8 Multicollinearity Test It occurs when there are high correlations among predictor variables, leading to unreliable and unstable estimates of regression coefficients. After model building, multi collinearity check is normally performed to ensure that independent variables are not highly correlated using VIF (Variable Inflation Factor) function. Normally the cut-off for VIF is considered as 5. Variables with VIF more than 5 should be considered to have collinearity. However, for factor/ categorical variables, GVIF value is considered as the baseline. Variables which require more than 1 coefficient and thus more than 1 degree of freedom are typically evaluated using the GVIF. For one-coefficient terms VIF is equal to GVIF. There are 4 options for producing the VIF value :-  Corvif command from the AED package  Vif command from the car package  Vif command from the rms package  Vif command from the DAAG package Out of these,“car” and “AED” produce GVIFvalue andothertwo produce VIFvalue. # Multi-colinearity check library(car) vif(fit_train)
  • 19. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 19
  • 20. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 20
  • 21. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 21
  • 22. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 22 As we can see,Fourplotsare generatedata time while plottingthe logisticoutput. 4.9 Probability Prediction From the above logisticoutputwe have predictedthe probabilitiesof individual customers. Please findthe belowRcode :- # Probablitiesprediction prob_glm<- predict.glm(fit4,type="response",se.fit=FALSE) write.csv(prob_glm,"probability.csv") plot(prob_glm)
  • 23. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 23 4.10 Model Prediction (Goodness-of-fit) Goodness-of-fit attempts to get at how well a model fits the data. It is usually applied after a final model has been selected. If we have multiple models, then goodness-of- fit is performed to choose among all the models. Concordance, discordance, ROC curve and KS-statistics are used for this purpose.
  • 24. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 24  Concordance :- In OLS regression, the R-squared and its more refined measure adjusted R-square would be the ‘one-stop’ metric which would immediately tell us if the model was a good fit or not. And since this was a value between 0 and 1, we could easily change it to a percentage value and pass it off as ‘model accuracy’ for beginners and the not-so-much-math-oriented businesses. Unfortunately, looking at adj-R square would be totally irrelevant in case of logistic regression because we model the log odds ratio and it becomes very difficult in terms of explain ability. This is where concordance helps. Concordance tells us the association between actual values and the values fitted by the model in percentage terms. Concordance is defined as the ratio of number of pairs where the 1 had a higher model score than the model score of zero to the total number of 1-0 pairs possible. A higher value for concordance (60-70%) means a better fitted model. However, a very large value for concordance (85-95%) could also suggest that the model is over-fitted and needs to be re-aligned to explain the entire population. We have used an in-built R function OptimisedConc() for getting the concordance value. OptimisedConc=function(model) { Data = cbind(model$y, model$fitted.values) ones = Data[Data[,1] == 1,] zeros = Data[Data[,1] == 0,] conc=matrix(0, dim(zeros)[1], dim(ones)[1]) disc=matrix(0, dim(zeros)[1], dim(ones)[1]) ties=matrix(0, dim(zeros)[1], dim(ones)[1]) for (j in 1:dim(zeros)[1]) { for (i in 1:dim(ones)[1]) { if (ones[i,2]>zeros[j,2]) {conc[j,i]=1} else if (ones[i,2]<zeros[j,2]) {disc[j,i]=1} else if (ones[i,2]==zeros[j,2]) {ties[j,i]=1} } } Pairs=dim(zeros)[1]*dim(ones)[1] PercentConcordance=(sum(conc)/Pairs)*100 PercentDiscordance=(sum(disc)/Pairs)*100 PercentTied=(sum(ties)/Pairs)*100
  • 25. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 25 return(list("Percent Concordance"=PercentConcordance,"Percent Discordance"=PercentDiscordance,"Percent Tied"=PercentTied,"Pairs"=Pairs)) } OptimisedConc(fit_train) In our model, we are getting 91 percent Concordance which means this model is overfitted.  Lorenz Curve and AUC :- For train data : ----------------- #Calculating ROC curve for model library(ROCR) #score train data set train$score<-predict(fit_train,type='response',train) pred_train<-prediction(train$score,train$def) perf_train <- performance(pred_train,"tpr","fpr") plot(perf_train)
  • 26. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 26 # calculating AUC auc_train <- performance(pred_train,"auc") auc_train <- unlist(slot(auc_train, "y.values")) # adding min and max ROC AUC to the center of the plot minauc<-min(round(auc_train, digits = 2)) maxauc<-max(round(auc_train, digits = 2)) minauct <- paste(c("min(AUC) = "),minauc,sep="") maxauct <- paste(c("max(AUC) = "),maxauc,sep="") legend(0.3,0.6,c(minauct,maxauct,"n"),border="white",cex=1.7, box.col = "white")
  • 27. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 27 So for the train data,we are getting AUC as 0.91. For Test Data : -------------------- #score test data set test$score<-predict(fit_train,type='response',test) pred_test<-prediction(test$score,test$def) perf_test <- performance(pred_test,"tpr","fpr") plot(perf_test) # calculating AUC auc_test <- performance(pred_test,"auc") # now converting S4 class to vector auc_test <- unlist(slot(auc_test, "y.values")) # adding min and max ROC AUC to the center of the plot minauc<-min(round(auc_test, digits = 2)) maxauc<-max(round(auc_test, digits = 2)) minauct <- paste(c("min(AUC) = "),minauc,sep="") maxauct <- paste(c("max(AUC) = "),maxauc,sep="") legend(0.3,0.6,c(minauct,maxauct,"n"),border="white",cex=1.7,box.co l = "white")
  • 28. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 28 For Test data, we are getting AUC value as 0.84.  KS Statistics (For train data) :- # Calculating KS statistic for train data max(attr(perf_train,'y.values')[[1]]- attr(perf_train,'x.values')[[1]]) We are getting KS- stats value for train data as 0.706 .
  • 29. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 29  KS Statistics (For test data):- # Calculating KS statistic max(attr(perf_test,'y.values')[[1]]- attr(perf_test,'x.values')[[1]]) 4.11 Calculating top 3 variables affecting credit score function #Calculating top 3 variables affecting Credit Score Function g<-predict(fit_train,type='terms',test) #function to pick top 3 reasons #works by sorting coefficient terms in equation and selecting top 3 in sort for each loan scored ftopk<- function(x,top=3){ res=names(x)[order(x, decreasing = TRUE)][1:top] paste(res,collapse=";",sep="") } # Application of the function using the top 3 rows topk=apply(g,1,ftopk,top=3) #add reason list to scored test sample test<-cbind(test, topk) summary(test)
  • 30. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 30 4.12 Reject Inference The term Reject Inference describes the issue of how to deal with the inherent bias when modeling is based on a training dataset consisting only of those previous applicants for whom the actual performance (Good Credit vs. Bad Credit) has been observed; however, there are likely another significant number of previous applicants, that had been rejected and for whom final "credit performance" was never observed. The question is, how to include those previous applicants in the modeling, in order to make the predictive model more accurate and robust (and less biased), and applicable also to those individuals. This is of particular importance when the criteria for the decision whether or not to extend credit need to be loosened, in order to attract and extend credit to more applicants. This can for example happen during a severe economic downturn, affecting many people and placing their overall financial well being into a condition that would not qualify them as acceptable credit risk using older criteria. In short, if nobody were to qualify for credit any more, then the
  • 31. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 31 institutions extending credits would be out of business. So it is often critically important to make predictions about observations with specific predictor values that were essentially outside the range of what would have been previously considered, and consequently is unavailable and has not been observed in the training data where the actual outcomes are recorded. There are a number of approaches that have been suggested on how to include previously rejected applicants for credit in the model building step, in order to make the model more broadly applicable (to those applicants as well). In short, these methods come down to systematically extrapolating from the actual observed data, often by deliberately introducing biases and assumptions about the expected loan outcome, had the (in actuality not observed) applicant been accepted for credit. 4.13 Model Performance A specific performance window would be taken into consideration to identify the accuracy of the predictive power of the model. 5 Decision Tree Approach This is one of the classic data mining techniques available for better decision making purpose as the tree structure gives a better understanding of the data and the important variables- which in turn could help in minimizing the potential risk involved. However, individual customer probobilties can’t be found using this method. Hence, Logistic Regression is preferred over it in industries. # Decision Tree Implementation #load tree package library(rpart) #build model using 90% 10% priors #with smaller complexity parameter to allow more complex trees model_dt <- rpart(def~.,data=train,parms=list(prior=c(.9,.1)),cp=.0002) plot(model_dt) text(model_dt)
  • 32. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 32
  • 33. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 33 6 Tools and Techniques Following tools and techniques have been used :-  R Studio  Predictive Modeling, Logistic Regression, Data Mining Techniques 7 Way Forward Reject inference has not been implemented. So there is scope to implement that. Also data mining techniques like Random Forest, Neural Network can be implemented as per the scope and easeness of the project. We have only implemented Decision Tree. 8 Recommendations and Applications Typical applicationareasinthe consumermarketinclude:creditcards,autoloans,home mortgages, home equityloans,mail catalogorders,andawide varietyof personal loanproducts. 9 References and Bibliography Belowdocumentshave beenreferredforthisproject:-  CreditriskscorecardsdevelopingandimplementingintelligentcreditscoringbyNaeem Siddiqi  Sharma- Creditscoring  Machine learningwithR– BrettLantz 10 Project Code Below is the complete R code that has been used for the project :- rm(list=ls()) data <- read.csv(file.choose(), header=T, stringsAsFactors=FALSE) str(data)
  • 34. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 34 # Replacing NA values new_data <- as.data.frame(data, stringsAsFactors=FALSE) new_data$mths_since_last_delinq[is.na(new_data$mths_since_last_delin q)] <- 9999999 new_data$mths_since_last_record[is.na(new_data$mths_since_last_recor d)] <- 9999999 new_data[is.na(new_data)] <- 0 str(new_data) attach(new_data) # Introduction of Dummy variables # For Loan status def <- ifelse(loan_status=="Default", 1, 0) # purpose dummies purpose_debt <- ifelse(purpose=='debt_consolidation', 1, 0) purpose_car <- ifelse(purpose=='car', 1, 0) purpose_credit <- ifelse(purpose=="credit_card", 1, 0) purpose_home_imp <- ifelse(purpose=="home_improvement", 1, 0) purpose_maj_purchase <- ifelse(purpose=="major_purchase", 1, 0) purpose_ren_energy <- ifelse(purpose=="renewable_energy", 1, 0) purpose_education <- ifelse(purpose=="educational", 1, 0) purpose_house <- ifelse(purpose=="house", 1, 0) purpose_medical <- ifelse(purpose=="medical", 1, 0) purpose_moving <- ifelse(purpose=="moving", 1, 0) purpose_small_business <- ifelse(purpose=="small_business", 1, 0) purpose_vacation <- ifelse(purpose=="vacation", 1, 0) purpose_wedding <- ifelse(purpose=="wedding", 1, 0) # home ownership dummies home_ownership_mort <- ifelse(home_ownership=="MORTGAGE", 1, 0) home_ownership_own <- ifelse(home_ownership=="OWN", 1, 0)
  • 35. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 35 home_ownership_rent <- ifelse(home_ownership=="RENT", 1, 0) home_ownership_other <- ifelse(home_ownership=="OTHER", 1, 0) # Grade grade_a <- ifelse(grade=="A",1 ,0) grade_b <- ifelse(grade=="B",1 ,0) grade_c <- ifelse(grade=="C",1 ,0) grade_d <- ifelse(grade=="D",1 ,0) grade_e <- ifelse(grade=="E",1 ,0) grade_f <- ifelse(grade=="F",1 ,0) # Fine classing gloan_amnt <- cut(new_data$loan_amnt, br=c(0, 5000, 15000, 35000), labels=c("Low","Medium","High")) gloan_amnt_data <- data.frame(new_data, gloan_amnt) summary(gloan_amnt_data$gloan_amnt) gloan_amnt_data_final <- gloan_amnt_data[complete.cases(gloan_amnt_data),] summary(gloan_amnt_data_final$gloan_amnt) glast_fico_range_high <- cut(new_data$last_fico_range_high, br=c(0, 649, 699, 744, 850), labels=c("Low","Medium","High","Very High")) last_fico_high_data <- data.frame(new_data, glast_fico_range_high) summary(last_fico_high_data$glast_fico_range_high) last_fico_high_data_final <- last_fico_high_data[complete.cases(last_fico_high_data),] summary(last_fico_high_data_final$glast_fico_range_high) glast_fico_range_low <- cut(new_data$last_fico_range_low, br=c(0, 645, 695,740, 845), labels=c("Low","Medium","High","Very High")) last_fico_low_data <- data.frame(new_data, glast_fico_range_low) summary(last_fico_low_data$glast_fico_range_low)
  • 36. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 36 last_fico_low_data_final <- last_fico_low_data[complete.cases(last_fico_low_data),] summary(last_fico_low_data_final$glast_fico_range_low) gfico_range_high <- cut(new_data$fico_range_high, br=c(0, 650, 700, 745, 850), labels=c("Low","Medium","High","Very High")) fico_high_data <- data.frame(new_data, gfico_range_high) summary(fico_high_data$gfico_range_high) fico_high_data_final <- fico_high_data[complete.cases(fico_high_data),] summary(fico_high_data_final$gfico_range_high) gfico_range_low <- cut(new_data$fico_range_low, br=c(0, 650, 700, 745, 850), labels=c("Low","Medium","High","Very High")) fico_low_data <- data.frame(new_data, gfico_range_low) summary(fico_low_data$gfico_range_low) fico_low_data_final <- fico_low_data[complete.cases(fico_low_data),] summary(fico_low_data_final$gfico_range_low) gdti <- cut(new_data$dti, br=c(0, 8, 13, 19, 30), labels=c("Low","Medium","High","Very High")) dti_data <- data.frame(new_data, gdti) summary(dti_data$gdti) dti_data_final <- dti_data[complete.cases(dti_data),] summary(dti_data_final$gdti) gmths_since_last_delinq <- cut(new_data$mths_since_last_delinq, br=c(0, 48, 6466000, 10000000), labels=c("Low","Medium","High")) mths_since_last_delinq_data <- data.frame(new_data, gmths_since_last_delinq) summary(mths_since_last_delinq_data$gmths_since_last_delinq)
  • 37. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 37 mths_since_last_delinq_data_final <- mths_since_last_delinq_data[complete.cases(mths_since_last_delinq_da ta),] summary(mths_since_last_delinq_data_final$gmths_since_last_delinq) gmths_since_last_record <- cut(new_data$mths_since_last_record, br=c(0, 9299000, 10000000), labels=c("Low","High")) mths_since_last_record_data <- data.frame(new_data, gmths_since_last_record) summary(mths_since_last_record_data$gmths_since_last_record) mths_since_last_record_data_final <- mths_since_last_record_data[complete.cases(mths_since_last_record_da ta),] summary(mths_since_last_record_data_final$gmths_since_last_record) gannual_inc <- cut(new_data$annual_inc, br=c(4000, 40500, 59000, 82340, 6000000), labels=c("Low","Medium","High","Very High")) annual_inc_data <- data.frame(new_data, gannual_inc) summary(annual_inc_data$gannual_inc) annual_inc_data_final <- annual_inc_data[complete.cases(annual_inc_data),] summary(annual_inc_data_final$gannual_inc) glast_pymnt_amnt <- cut(new_data$last_pymnt_amnt, br=c(0, 195, 392, 1816, 36120), labels=c("Low","Medium","High","Very High")) last_pymnt_amnt_data <- data.frame(new_data, glast_pymnt_amnt) summary(last_pymnt_amnt_data$glast_pymnt_amnt) last_pymnt_amnt_data_final <- last_pymnt_amnt_data[complete.cases(last_pymnt_amnt_data),] summary(last_pymnt_amnt_data_final$gannual_inc) gtotal_pymnt <- cut(new_data$total_pymnt, br=c(0, 4707, 8089, 13090, 49490), labels=c("Low","Medium","High","Very High"))
  • 38. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 38 total_pymnt_data <- data.frame(new_data, gtotal_pymnt) summary(total_pymnt_data$gtotal_pymnt) total_pymnt_data_final <- total_pymnt_data[complete.cases(total_pymnt_data),] summary(total_pymnt_data_final$gtotal_pymnt) gtotal_pymnt_inv <- cut(new_data$total_pymnt_inv, br=c(0, 4312, 7590, 12390, 49100), labels=c("Low","Medium","High","Very High")) total_pymnt_inv_data <- data.frame(new_data, gtotal_pymnt_inv) summary(total_pymnt_inv_data$gtotal_pymnt_inv) total_pymnt_inv_data_final <- total_pymnt_inv_data[complete.cases(total_pymnt_inv_data),] summary(total_pymnt_inv_data_final$gtotal_pymnt_inv) dim(total_pymnt_inv_data_final) glast_pymnt_amnt <- cut(new_data$last_pymnt_amnt, br=c(0, 195, 392, 1816, 16120), labels=c("Low","Medium","High","Very High")) last_pymnt_amnt_data <- data.frame(new_data, glast_pymnt_amnt) summary(last_pymnt_amnt_data$glast_pymnt_amnt) last_pymnt_amnt_data_final <- last_pymnt_amnt_data[complete.cases(last_pymnt_amnt_data),] summary(last_pymnt_amnt_data_final$glast_pymnt_amnt) dim(last_pymnt_amnt_data_final) gtotal_rec_int <- cut(new_data$total_rec_int, br=c(0, 640, 1290, 2618, 15300), labels=c("Low","Medium","High","Very High")) total_rec_int_data <- data.frame(new_data, gtotal_rec_int) summary(total_rec_int_data$gtotal_rec_int) total_rec_int_data_final <- total_rec_int_data[complete.cases(total_rec_int_data),] summary(total_rec_int_data_final$gtotal_rec_int) dim(total_rec_int_data_final)
  • 39. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 39 # define global purpose <- data.frame(purpose_debt ,purpose_car ,purpose_credit ,purpose_home_imp ,purpose_maj_purchase,purpose_ren_energy ,purpose_education ,purpose_house ,purpose_medical ,purpose_moving ,purpose_small_business ,purpose_vacation ,purpose_wedding) home_ownership <- data.frame(home_ownership_mort, home_ownership_own, home_ownership_rent, home_ownership_other) grade <- data.frame(grade_a,grade_b,grade_c,grade_d,grade_e,grade_f) final_data <- data.frame(def, annual_inc_data$gannual_inc, last_fico_high_data$glast_fico_range_high, last_fico_low_data$glast_fico_range_low, fico_low_data$gfico_range_low, fico_high_data$gfico_range_high, mths_since_last_delinq_data$gmths_since_last_delinq, total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv, term, last_pymnt_amnt_data$glast_pymnt_amnt, total_rec_int_data$gtotal_rec_int, purpose, grade,home_ownership) fit1 <- glm(def~.,data=final_data, family="binomial") summary(fit1) final_data1<- data.frame(def, annual_inc_data$gannual_inc, last_fico_high_data$glast_fico_range_high, mths_since_last_delinq_data$gmths_since_last_delinq, total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv, term, last_pymnt_amnt_data$glast_pymnt_amnt, total_rec_int_data$gtotal_rec_int) fit2 <- glm(def~.,data=final_data1, family="binomial") summary(fit2) #IV calculation iv.mult(final_data1,"def",TRUE) iv_data <- data.frame(def, annual_inc_data$gannual_inc, last_fico_high_data$glast_fico_range_high, mths_since_last_delinq_data$gmths_since_last_delinq, total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv, term, last_pymnt_amnt_data$glast_pymnt_amnt, total_rec_int_data$gtotal_rec_int)
  • 40. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 40 iv.plot.summary(iv.mult(final_data1,"def",TRUE)) #taking IV cut off as 0.1 iv_data_final <- data.frame(def, last_fico_high_data$glast_fico_range_high, total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv, term, last_pymnt_amnt_data$glast_pymnt_amnt, total_rec_int_data$gtotal_rec_int) iv.mult(iv_data_final,"def",TRUE) iv.plot.summary(iv.mult(iv_data_final,"def",TRUE)) fin_data <- data.frame(def, last_fico_range_high, last_pymnt_amnt, total_pymnt, total_rec_int) fit <- glm(def~., data=fin_data, family="binomial") summary(fit) vif(fit) # Dividing data into train and test totalrecords <- nrow(fin_data) trainfraction = 0.7 trainrecords = as.integer(totalrecords * trainfraction) allrows <- 1:totalrecords trainrows <- sample(totalrecords,trainrecords) testrows <- allrows[-trainrows] train<-data.frame(fin_data[trainrows,]) test<-data.frame(fin_data[testrows,]) dim(train) dim(test) fit_train <- glm(def~., data=train, family="binomial")
  • 41. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 41 summary(fit_train) # multicolinearity vif(fit_train) #Calculating ROC curve for model library(ROCR) #score train data set train$score<-predict(fit_train,type='response',train) pred_train<-prediction(train$score,train$def) perf_train <- performance(pred_train,"tpr","fpr") plot(perf_train) # calculating AUC auc_train <- performance(pred_train,"auc") auc_train <- unlist(slot(auc_train, "y.values")) # adding min and max ROC AUC to the center of the plot minauc<-min(round(auc_train, digits = 2)) maxauc<-max(round(auc_train, digits = 2)) minauct <- paste(c("min(AUC) = "),minauc,sep="") maxauct <- paste(c("max(AUC) = "),maxauc,sep="") legend(0.3,0.6,c(minauct,maxauct,"n"),border="white",cex=1.7,box.co l = "white") # Calculating KS statistic for train data max(attr(perf_train,'y.values')[[1]]- attr(perf_train,'x.values')[[1]]) # Concordance
  • 42. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 42 OptimisedConc=function(model) { Data = cbind(model$y, model$fitted.values) ones = Data[Data[,1] == 1,] zeros = Data[Data[,1] == 0,] conc=matrix(0, dim(zeros)[1], dim(ones)[1]) disc=matrix(0, dim(zeros)[1], dim(ones)[1]) ties=matrix(0, dim(zeros)[1], dim(ones)[1]) for (j in 1:dim(zeros)[1]) { for (i in 1:dim(ones)[1]) { if (ones[i,2]>zeros[j,2]) {conc[j,i]=1} else if (ones[i,2]<zeros[j,2]) {disc[j,i]=1} else if (ones[i,2]==zeros[j,2]) {ties[j,i]=1} } } Pairs=dim(zeros)[1]*dim(ones)[1] PercentConcordance=(sum(conc)/Pairs)*100 PercentDiscordance=(sum(disc)/Pairs)*100 PercentTied=(sum(ties)/Pairs)*100 return(list("Percent Concordance"=PercentConcordance,"Percent Discordance"=PercentDiscordance,"Percent Tied"=PercentTied,"Pairs"=Pairs)) } # concordance of train data
  • 43. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 43 OptimisedConc(fit_train) #score test data set test$score<-predict(fit_train,type='response',test) pred_test<-prediction(test$score,test$def) perf_test <- performance(pred_test,"tpr","fpr") plot(perf_test) # calculating AUC auc_test <- performance(pred_test,"auc") auc_test <- unlist(slot(auc_test, "y.values")) # adding min and max ROC AUC to the center of the plot minauc<-min(round(auc_test, digits = 2)) maxauc<-max(round(auc_test, digits = 2)) minauct <- paste(c("min(AUC) = "),minauc,sep="") maxauct <- paste(c("max(AUC) = "),maxauc,sep="") legend(0.3,0.6,c(minauct,maxauct,"n"),border="white",cex=1.7,box.co l = "white") # Calculating KS statistic max(attr(perf_test,'y.values')[[1]]-attr(perf_test,'x.values')[[1]]) # Probabilities prediction prob_glm <- predict.glm(fit_train, type="response",se.fit=FALSE) write.csv(prob_glm, "probability.csv") plot(prob_glm) #Calculating top 3 variables affecting Credit Score Function g<-predict(fit_train,type='terms',test)
  • 44. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 44 #function to pick top 3 reasons #works by sorting coefficient terms in equation and selecting top 3 in sort for each loan scored ftopk<- function(x,top=3){ res=names(x)[order(x, decreasing = TRUE)][1:top] paste(res,collapse=";",sep="") } # Application of the function using the top 3 rows topk=apply(g,1,ftopk,top=3) #add reason list to scored test sample test<-cbind(test, topk) summary(test) # Decision Tree Implementation #load tree package library(rpart) #build model using 90% 10% priors #with smaller complexity parameter to allow more complex trees model_dt <- rpart(def~.,data=train,parms=list(prior=c(.9,.1)),cp=.0002) plot(model_dt) text(model_dt) #score test data test$tscore1<-predict(model_dt,type='prob',test) pred5<-prediction(test$tscore1[,2],test$def) perf5 <- performance(pred5,"tpr","fpr")
  • 45. Acquisition Credit Scoring Model Project Final Report – Great Lakes PGPBA Program Page 45 # Random Forest library(randomForest) train$default <- data.frame(train$def[complete.cases(train$def),]) arf <- randomForest(def~.,data=train,importance=TRUE,proximity=TRUE,ntree=5 00, keep.forest=TRUE) #plot variable importance varImpPlot(arf) train$p <- predict(model_dt,train,type="prob") train$p summary(train$p) testdata$p <- predict(model_rf,testdata,type="prob") summary(testdata$p) ----------------------- End Of Report --------------------------