Project Report - Acquisition Credit Scoring Model

Acquisition Credit Scoring
Model
Project Final Report
31st April, 2015
Great Lakes Institute of Management,Gurgaon
Subhasis Mishra
– Research Supervisor –
Mr. Manu Chandra (FNMathlogic)

Acquisition Credit Scoring Model
Project Final Report – Great Lakes PGPBA Program Page 2
Table of Contents
1 Introduction........................................................................................................................... 3
2 Scope and Objectives.............................................................................................................. 3
3 Data Sources.......................................................................................................................... 4
4 Analytical Approach................................................................................................................ 4
4.1 Data Collection ................................................................................................................ 4
4.2 Data Preparation.............................................................................................................. 4
4.3 Variable Reduction......................................................................................................... 11
4.4 Data Sampling................................................................................................................ 15
4.5 Model Development ...................................................................................................... 16
4.6 Intercept not significantin the development sample........................................................ 17
4.7 Assessment of sign of variables....................................................................................... 18
4.8 Multicollinearity Test...................................................................................................... 18
4.9 ProbabilityPrediction..................................................................................................... 22
4.10 Model Prediction (Goodness-of-fit)............................................................................... 23
4.11 Calculating top 3 variables affecting credit score function .............................................. 29
4.12 Reject Inference........................................................................................................... 30
4.13 Model Performance ..................................................................................................... 31
5 Decision Tree Approach........................................................................................................ 31
6 Tools and Techniques ........................................................................................................... 33
7 Way Forward........................................................................................................................ 33
8 Recommendations andApplications...................................................................................... 33
9 References and Bibliography................................................................................................. 33
10 Project Code....................................................................................................................... 33

1 Introduction
Credit scoring is perhaps one of the most classic applications for predictive modeling, to
predict whether or not credit extended to an applicant will likely result in profit or
losses for the lending institution. There are many variations and complexities regarding
how exactly credit is extended to individuals, businesses, and other organizations for
various purposes (purchasing equipment, real estate, consumer items, and so on), and
using various methods of credit (credit card, loan, delayed payment plan). But in all
cases, a lender provides money to an individual or institution, and expects to be paid
back in time with interest commensurate with the risk of default.
Credit scoring is the set of decision models and their underlying techniques that aid
lenders in the granting of consumer credit. These techniques determine who will get
credit, how much credit they should get, and what operational strategies will enhance
the profitability of the borrowers to the lenders. Further, they help to assess the risk in
lending. Credit scoring is a dependable assessment of a person’s credit worthiness since
it is based on actual data.
A lender commonly makes two types of decisions: first, whether to grant credit to a
new applicant, and second, how to deal with existing applicants, including whether to
increase their credit limits. In both cases, whatever the techniques used, it is critical
that there is a large sample of previous customers with their application details,
behavioral patterns, and subsequent credit history available. Most of the techniques
use this sample to identify the connection between the characteristics of the
consumers (annual income, age, number of years in employment with their current
employer, etc.) and their subsequent history.
Typical application areas in the consumer market include: credit cards, auto loans,
home mortgages, home equity loans, mail catalog orders, and a wide variety of
personal loan products.
2 Scope and Objectives
To evaluate the scope of applying Logistic Regression and various Data Minging
techniques in Credit Scoring facilitating better decision making to reduce the Risk.

Objective of this project is to build a Credit Scoring model to reduce the potential risk
involved, and based on which credit lending decisions can be made.
3 Data Sources
Applicant data involving US customers that has been used in this project was provided
by our mentor.
4 Analytical Approach
Steps which would be followed for this project are described below:
4.1 Data Collection
Customer credit history along with all the information was provided by our mentor.
Data provided was cross sectional data and belongs to a single point of time.
4.2 Data Preparation
Listed below are some of the techniques used for data preparation :-
1. Initial variable selection based on judgement –
First we identified some of the key variables out of all the variables given in our data
set purely based on judgement. Initial screening of variables is very much important
and it requires deeper understanding about that particular domain as well as
experience. So out of the 46 variables, we took some 20 odd variables into
consideration for modelling. Below is the list of initial variables :-
Loan_amnt
Term
Annual_inc
Fico_range_high
Fico_range_low
Last_fico_range_hig
Last_fico_range_low
Purpose
Home_ownership

Grade
Dti
Mths_since_last_delinq
Mths_since_last_record
Last_paymnt_amnt
Total_paymnt
Total_paymnt_inv
Pub_rec
Total_rec_int
Revol_bal
2. Missing Value Treatment -
Normally there are two techniques used in case of missing value treatment. Those are
either by removing the entire variable- if more than 80-90% of observations are
missing for that variable, or by imputing with a very high value like 9999999- if the
variable is significant in the perspective of modeling, hence can’t be dropped. In our
case we went ahead with the latter one as few variables like
Last_month_since_deliquency, last_month_since_record carries a lot of importance in
terms of modeling although it had more than 90% observations missing in the original
data set. For other non-significant variables, we just imputed with Zero.
Below is the piece of R code showing what we did towards missing value treatment –
new_data <- as.data.frame(data, stringsAsFactors=FALSE)
new_data$mths_since_last_delinq[is.na(new_data$mths_since_last_d
elinq)] <- 9999999
new_data$mths_since_last_record[is.na(new_data$mths_since_last_r
ecord)] <- 9999999
new_data[is.na(new_data)] <- 0
str(new_data)

3. Identifying the dependent variable –
In any business problem, It is very much critical to first understand the problem
statement and then approaching with an appropriate solution. Hence in our case the
first priority was given to identify the dependent variable correctly, which might
otherwise would have caused the modelling to go completely wrong. Our problem is
concerned with finding if the person is going to default or not by calculating the
probability of default among the consumers.
From our data set, we picked LOAN_STATUS as the dependent variable which contains
value like Current, Default, Charged off etc.
4. Data Transformation –
Data Transformation is one of the most significant steps before doing the actual
modeling and carries a lot of importance in making the final model robust. It includes
variuos steps like introduction of dummy variables, conversion of integer variable to
categorical variable etc.
 Introduction of dummy variables –
We introduced dummy variables for the character variables. As part of the dummy
variable inclusion, we did the following :-
# Introduction of Dummy variables
# For Loan status
def <- ifelse(loan_status=="Default", 1, 0)
# purpose dummies
purpose_debt <- ifelse(purpose=='debt_consolidation', 1, 0)
purpose_car <- ifelse(purpose=='car', 1, 0)
purpose_credit <- ifelse(purpose=="credit_card", 1, 0)
purpose_home_imp <- ifelse(purpose=="home_improvement", 1, 0)
purpose_maj_purchase <- ifelse(purpose=="major_purchase", 1, 0)
purpose_ren_energy <- ifelse(purpose=="renewable_energy", 1, 0)
purpose_education <- ifelse(purpose=="educational", 1, 0)
purpose_house <- ifelse(purpose=="house", 1, 0)

purpose_medical <- ifelse(purpose=="medical", 1, 0)
purpose_moving <- ifelse(purpose=="moving", 1, 0)
purpose_small_business <- ifelse(purpose=="small_business", 1,
0)
purpose_vacation <- ifelse(purpose=="vacation", 1, 0)
purpose_wedding <- ifelse(purpose=="wedding", 1, 0)
# home ownership dummies
home_ownership_mort <- ifelse(home_ownership=="MORTGAGE", 1, 0)
home_ownership_own <- ifelse(home_ownership=="OWN", 1, 0)
home_ownership_rent <- ifelse(home_ownership=="RENT", 1, 0)
home_ownership_other <- ifelse(home_ownership=="OTHER", 1, 0)
# Grade
grade_a <- ifelse(grade=="A",1 ,0)
grade_b <- ifelse(grade=="B",1 ,0)
grade_c <- ifelse(grade=="C",1 ,0)
grade_d <- ifelse(grade=="D",1 ,0)
grade_e <- ifelse(grade=="E",1 ,0)
grade_f <- ifelse(grade=="F",1 ,0)
 Conversion of continuos variables to categorical
variables(Fine classing)
As a better modelling practice, it is recommended that continuos variables should be
changed into categorical variables by introduction of bins- which is also called fine
classing. It generally yields better result during Information Value (IV) calculation of
response variables.
We followed the below approach for fine classing :-
gloan_amnt <- cut(new_data$loan_amnt, br=c(0, 5000, 15000, 35000),
labels=c("Low","Medium","High"))

gloan_amnt_data <- data.frame(new_data, gloan_amnt)
summary(gloan_amnt_data$gloan_amnt)
gloan_amnt_data_final<-
gloan_amnt_data[complete.cases(gloan_amnt_data),]
summary(gloan_amnt_data_final$gloan_amnt)
glast_fico_range_high <- cut(new_data$last_fico_range_high, br=c(0,
649, 699, 744, 850), labels=c("Low","Medium","High","Very High"))
last_fico_high_data <- data.frame(new_data, glast_fico_range_high)
summary(last_fico_high_data$glast_fico_range_high)
last_fico_high_data_final<-
last_fico_high_data[complete.cases(last_fico_high_data),]
summary(last_fico_high_data_final$glast_fico_range_high)
glast_fico_range_low <- cut(new_data$last_fico_range_low, br=c(0,
645, 695,740, 845), labels=c("Low","Medium","High","Very High"))
last_fico_low_data <- data.frame(new_data, glast_fico_range_low)
summary(last_fico_low_data$glast_fico_range_low)
last_fico_low_data_final<-
last_fico_low_data[complete.cases(last_fico_low_data),]
summary(last_fico_low_data_final$glast_fico_range_low)
gfico_range_high <- cut(new_data$fico_range_high, br=c(0, 650, 700,
745, 850), labels=c("Low","Medium","High","Very High"))
fico_high_data <- data.frame(new_data, gfico_range_high)
summary(fico_high_data$gfico_range_high)
fico_high_data_final<-
fico_high_data[complete.cases(fico_high_data),]
summary(fico_high_data_final$gfico_range_high)

gfico_range_low <- cut(new_data$fico_range_low, br=c(0, 650, 700,
fico_low_data <- data.frame(new_data, gfico_range_low)
summary(fico_low_data$gfico_range_low)
fico_low_data_final <- fico_low_data[complete.cases(fico_low_data),]
summary(fico_low_data_final$gfico_range_low)
Gdti<-
cut(new_data$dti,br=c(0,8,13,19,30),labels=c("Low","Medium","High","
Very High"))
dti_data <- data.frame(new_data, gdti)
summary(dti_data$gdti)
dti_data_final <- dti_data[complete.cases(dti_data),]
summary(dti_data_final$gdti)
gmths_since_last_delinq<-cut(new_data$mths_since_last_delinq,
br=c(0, 48, 6466000, 10000000), labels=c("Low","Medium","High"))
mths_since_last_delinq_data<-data.frame(new_data,
gmths_since_last_delinq)
summary(mths_since_last_delinq_data$gmths_since_last_delinq)
mths_since_last_delinq_data_final<-
mths_since_last_delinq_data[complete.cases(mths_since_last_delinq_da
ta),]
summary(mths_since_last_delinq_data_final$gmths_since_last_delinq)
gmths_since_last_record <- cut(new_data$mths_since_last_record,
br=c(0, 9299000, 10000000), labels=c("Low","High"))

mths_since_last_record_data <- data.frame(new_data,
gmths_since_last_record)
summary(mths_since_last_record_data$gmths_since_last_record)
mths_since_last_record_data_final<-
mths_since_last_record_data[complete.cases(mths_since_last_record_da
ta),]
summary(mths_since_last_record_data_final$gmths_since_last_record)
gannual_inc <- cut(new_data$annual_inc, br=c(4000, 40500, 59000,
annual_inc_data <- data.frame(new_data, gannual_inc)
summary(annual_inc_data$gannual_inc)
annual_inc_data_final<-
annual_inc_data[complete.cases(annual_inc_data),]
summary(annual_inc_data_final$gannual_inc)
glast_pymnt_amnt <- cut(new_data$last_pymnt_amnt, br=c(0, 195, 392,
last_pymnt_amnt_data <- data.frame(new_data, glast_pymnt_amnt)
summary(last_pymnt_amnt_data$glast_pymnt_amnt)
last_pymnt_amnt_data_final<-
last_pymnt_amnt_data[complete.cases(last_pymnt_amnt_data),]
summary(last_pymnt_amnt_data_final$gannual_inc)
gtotal_pymnt <- cut(new_data$total_pymnt, br=c(0, 4707, 8089, 13090,
49490), labels=c("Low","Medium","High","Very High"))
total_pymnt_data <- data.frame(new_data, gtotal_pymnt)
summary(total_pymnt_data$gtotal_pymnt)
total_pymnt_data_final<-
total_pymnt_data[complete.cases(total_pymnt_data),]
summary(total_pymnt_data_final$gtotal_pymnt)

gtotal_pymnt_inv <- cut(new_data$total_pymnt_inv, br=c(0, 4312,
7590, 12390, 49100), labels=c("Low","Medium","High","Very High"))
total_pymnt_inv_data <- data.frame(new_data, gtotal_pymnt_inv)
summary(total_pymnt_inv_data$gtotal_pymnt_inv)
total_pymnt_inv_data_final<-
total_pymnt_inv_data[complete.cases(total_pymnt_inv_data),]
summary(total_pymnt_inv_data_final$gtotal_pymnt_inv)
dim(total_pymnt_inv_data_final)
1816, 16120), labels=c("Low","Medium","High","Very
High"))last_pymnt_amnt_data <- data.frame(new_data,
glast_pymnt_amnt)summary(last_pymnt_amnt_data$glast_pymnt_amnt)
last_pymnt_amnt_data_final<-
last_pymnt_amnt_data[complete.cases(last_pymnt_amnt_data),]summary(l
ast_pymnt_amnt_data_final$glast_pymnt_amnt)
dim(last_pymnt_amnt_data_final)
gtotal_rec_int <- cut(new_data$total_rec_int, br=c(0, 640, 1290,
total_rec_int_data <- data.frame(new_data, gtotal_rec_int)
summary(total_rec_int_data$gtotal_rec_int)
total_rec_int_data_final<-
total_rec_int_data[complete.cases(total_rec_int_data),]
summary(total_rec_int_data_final$gtotal_rec_int)
dim(total_rec_int_data_final)
4.3 Variable Reduction
Once the strongest characteristics are grouped and ranked, variable selection is
done. At the end of this step, the Scorecard Developer will have a set of strong,
grouped characterstics, preferably representing independent information
types, for use in the regression step. The strength of a characteristic is gauged
using four main criteria :

 Predictive power of each attribute. The weight of evidence (WOE)
measure is used for this purpose.
 The range and trend of weight of evidence across grouped attributes
within a characteristic.
 Predictive power of the characterstic. The Information Value (IV)
measure is used for this purpose.
 Operational and business considerations
In our case, we have used the Information Value measure approach for
variable reduction. Some analysts run other variable selection algorithms (e.g.,
those that rank predictive power using Chi Square or R-Square) prior to
grouping characteristics. This gives them an indication of characteristic
strength using independent means, and also alerts them in cases where the
Information Value figure is high/low compared to other measures.  The initial
characteristic analysis process can be interactive, and involvement from
business users and operations staff should be encouraged. In particular, they
may provide further insights into any unex- pected or illogical behavior
patterns and enhance the grouping of all variables.  The first step in
performing this analysis is to perform initial grouping of the variables, and
rank order them by IV or some other strength measure. This can be done
using a number of binning techniques.
If using other applications, a good way to start is to bin nominal variables into
50 or so equal groups, and to calculate the WOE and IV for the grouped
attributes and characteristics. One can then use any spreadsheet software to
fine tune the groupings for the stronger characteristics based on principles to
be outlined in the next section. Similarly for categorical characteristics, the
WOE for each unique attribute and the IV of each characteristic can be
calculated. One can then spend time fine-tuning the grouping for those
characteristics that surpass a minimum acceptable strength. Decision trees are
also often used for grouping variables. Most users, however, use them to
generate initial ideas, and then use alternate software applications to
interactively fine-tune the groupings.  
Information Value (IV) :-
Information Value provides a measure of how well a variable X is able to
distinguish between a binary response (e.g “good” vs “bad”) in some target
variable Y. The idea is if a variable X has a low information value, it may not do

a sufficient job of classifying the target variable, and hence is removed as an
explnatory variable.
To see how this works, let X be grouped into n bins. Each x  X corresponds to a
y  Y that may take one of two values, say 0 or 1. Then for bins, Xi, 1  i  n,
IV =  ( gi- bi ) * ln(gi/ bi)
Where bi = the proportion of 0’s in bin i versus all bins
gi = the proportion of 1’s in bin versus all bins
ln(gi/ bi) is known as weight of evidence (For bin Xi). Cut-off value ma vary
but in our case, we have considered IV cut-off as 0.1 .
Below is the R code sample, output and plot of IV calculation :-
# IV calculation
iv.mult(final_data1,"def",TRUE)
iv_data <- data.frame(def, annual_inc_data$gannual_inc,
last_fico_high_data$glast_fico_range_high,
mths_since_last_delinq_data$gmths_since_last_delinq,
total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv,
term, last_pymnt_amnt_data$glast_pymnt_amnt,
total_rec_int_data$gtotal_rec_int)
iv.plot.summary(iv.mult(final_data1,"def",TRUE))

#taking IV cut off as 0.1
iv_data_final <- data.frame(def, last_fico_high_data$glast_fico_range_high,
total_pymnt_data$gtotal_pymnt, total_pymnt_inv_data$gtotal_pymnt_inv,
term, last_pymnt_amnt_data$glast_pymnt_amnt,
iv.mult(iv_data_final,"def",TRUE)
iv.plot.summary(iv.mult(iv_data_final,"def",TRUE))

4.4 Data Sampling
Sampling has been done based on the 70-30% rule. That is the entire data
would be splitted into 70-30 ratio to use it as Development and Validation data
respectively.
Below is the R code :-

# Dividing data into train and test
totalrecords <- nrow(fin_data)
trainfraction = 0.7
trainrecords = as.integer(totalrecords * trainfraction)
allrows <- 1:totalrecords
trainrows <- sample(totalrecords,trainrecords)
testrows <- allrows[-trainrows]
train<-data.frame(fin_data[trainrows,])
test<-data.frame(fin_data[testrows,])
dim(train)
dim(test)
4.5 Model Development
When the training data set on which the modeling is based contains a binary
indicator variable of "Paid back" vs. "Default", or "Good Credit" vs. "Bad Credit",
then Logistic Regression models are well suited for subsequent predictive
modeling. Logistic regression yields prediction probabilities for whether or not
a particular outcome (e.g., Bad Credit) will occur. Furthermore, logistic
regression models are linear models, in that the logit transformed prediction
probability is a linear function of the predictor variable values. Thus, a final
score card model derived in this manner has the desirable quality that the final
credit score (credit risk) is a linear function of the predictors, and with some
additional transformations applied to the model parameter, a simple linear
function of scores that can be associated with each predictor class value after
coarse coding. So the final credit score is then a simple sum of individual score
values that can be taken from the scorecard.
In our model, we have used train data set which contains 70% of the original
data. Below is the R code for logistic regrssion :-
fit_train <- glm(def~., data=train, family="binomial")
summary(fit_train)

4.6 Intercept not significant in the development sample
An intercept is almost always part of the model and is almost always significantly
different from zero. The test of the intercept in the procedure output tests whether
this parameter is equal to zero. If the intercept is zero (equivalent to having no
intercept in the model), the resulting model implies that the response function must
be exactly zero when all the predictors are set to zero. For a logistic model it means
that the logit (or log odds) is zero, which implies that the event probability is 0.5. This
is a very strong assumption that is sometimes reasonable, but more often is not. So,
a highly significant intercept in your model is generally not a problem.
By the same token, if the intercept is not significant you usually would not want to
remove it from the model because by doing this you are creating a model that says
that the response function must be zero when the predictors are all zero. If the
nature of what you are modeling is such that you want to assume this, then you
might want to remove the intercept. In our case, we are getting a intercept value of
around 0.58 for validation sample.

4.7 Assessment of sign of variables
a. Last_fico_range_high- Negative sign of the coeffcient signifies that it
exhibits an inverse relationship with default, which is actually true in
this context. Higher the fico value is, lesser is the chance of default.
b. Total_pymnt- It also signifies an inverse relationship with default,
which is also true in this context.
c. Last_pymnt_amnt- Negative sign of the coefficient shows that it
exhibits an inverse relationship with default.
d. Total_rec_int- Positive sign of the coefficient signifies that it has a
direct relationship with default.
4.8 Multicollinearity Test
It occurs when there are high correlations among predictor variables, leading to
unreliable and unstable estimates of regression coefficients. After model building, multi
collinearity check is normally performed to ensure that independent variables are not
highly correlated using VIF (Variable Inflation Factor) function. Normally the cut-off for
VIF is considered as 5. Variables with VIF more than 5 should be considered to have
collinearity. However, for factor/ categorical variables, GVIF value is considered as the
baseline. Variables which require more than 1 coefficient and thus more than 1 degree
of freedom are typically evaluated using the GVIF. For one-coefficient terms VIF is
equal to GVIF.
There are 4 options for producing the VIF value :-
 Corvif command from the AED package
 Vif command from the car package
 Vif command from the rms package
 Vif command from the DAAG package
Out of these,“car” and “AED” produce GVIFvalue andothertwo produce VIFvalue.
# Multi-colinearity check
library(car) vif(fit_train)

As we can see,Fourplotsare generatedata time while plottingthe logisticoutput.
4.9 Probability Prediction
From the above logisticoutputwe have predictedthe probabilitiesof individual customers.
Please findthe belowRcode :-
# Probablitiesprediction
prob_glm<- predict.glm(fit4,type="response",se.fit=FALSE)
write.csv(prob_glm,"probability.csv")
plot(prob_glm)

4.10 Model Prediction (Goodness-of-fit)
Goodness-of-fit attempts to get at how well a model fits the data. It is usually applied
after a final model has been selected. If we have multiple models, then goodness-of-
fit is performed to choose among all the models. Concordance, discordance, ROC
curve and KS-statistics are used for this purpose.

 Concordance :- In OLS regression, the R-squared and its more refined
measure adjusted R-square would be the ‘one-stop’ metric which would immediately
tell us if the model was a good fit or not. And since this was a value between 0 and 1,
we could easily change it to a percentage value and pass it off as ‘model accuracy’
for beginners and the not-so-much-math-oriented businesses. Unfortunately, looking
at adj-R square would be totally irrelevant in case of logistic regression because we
model the log odds ratio and it becomes very difficult in terms of explain ability. This
is where concordance helps. Concordance tells us the association between actual
values and the values fitted by the model in percentage terms. Concordance is
defined as the ratio of number of pairs where the 1 had a higher model score than
the model score of zero to the total number of 1-0 pairs possible. A higher value for
concordance (60-70%) means a better fitted model. However, a very large value for
concordance (85-95%) could also suggest that the model is over-fitted and needs to
be re-aligned to explain the entire population.
We have used an in-built R function OptimisedConc() for getting the concordance value.
OptimisedConc=function(model)
{
Data = cbind(model$y, model$fitted.values)
ones = Data[Data[,1] == 1,]
zeros = Data[Data[,1] == 0,]
conc=matrix(0, dim(zeros)[1], dim(ones)[1])
disc=matrix(0, dim(zeros)[1], dim(ones)[1])
ties=matrix(0, dim(zeros)[1], dim(ones)[1])
for (j in 1:dim(zeros)[1])
{
for (i in 1:dim(ones)[1])
{
if (ones[i,2]>zeros[j,2])
{conc[j,i]=1}
else if (ones[i,2]<zeros[j,2])
{disc[j,i]=1}
else if (ones[i,2]==zeros[j,2])
{ties[j,i]=1}
}
}
Pairs=dim(zeros)[1]*dim(ones)[1]
PercentConcordance=(sum(conc)/Pairs)*100
PercentDiscordance=(sum(disc)/Pairs)*100
PercentTied=(sum(ties)/Pairs)*100

return(list("Percent Concordance"=PercentConcordance,"Percent
Discordance"=PercentDiscordance,"Percent
Tied"=PercentTied,"Pairs"=Pairs))
}
OptimisedConc(fit_train)
In our model, we are getting 91 percent Concordance which means this model
is overfitted.
 Lorenz Curve and AUC :-
For train data :
-----------------
#Calculating ROC curve for model
library(ROCR)
#score train data set
train$score<-predict(fit_train,type='response',train)
pred_train<-prediction(train$score,train$def)
perf_train <- performance(pred_train,"tpr","fpr")
plot(perf_train)

# calculating AUC
auc_train <- performance(pred_train,"auc")
auc_train <- unlist(slot(auc_train, "y.values"))
# adding min and max ROC AUC to the center of the plot
minauc<-min(round(auc_train, digits = 2))
maxauc<-max(round(auc_train, digits = 2))
minauct <- paste(c("min(AUC) = "),minauc,sep="")
maxauct <- paste(c("max(AUC) = "),maxauc,sep="")
legend(0.3,0.6,c(minauct,maxauct,"n"),border="white",cex=1.7,
box.col = "white")

So for the train data,we are getting AUC as 0.91.
For Test Data :
--------------------
#score test data set
test$score<-predict(fit_train,type='response',test)
pred_test<-prediction(test$score,test$def)
perf_test <- performance(pred_test,"tpr","fpr")
plot(perf_test)
# calculating AUC
auc_test <- performance(pred_test,"auc")
# now converting S4 class to vector
auc_test <- unlist(slot(auc_test, "y.values"))
minauc<-min(round(auc_test, digits = 2))
maxauc<-max(round(auc_test, digits = 2))
legend(0.3,0.6,c(minauct,maxauct,"n"),border="white",cex=1.7,box.co
l = "white")

For Test data, we are getting AUC value as 0.84.
 KS Statistics (For train data) :-
# Calculating KS statistic for train data
max(attr(perf_train,'y.values')[[1]]-
attr(perf_train,'x.values')[[1]])
We are getting KS- stats value for train data as 0.706 .

 KS Statistics (For test data):-
# Calculating KS statistic
max(attr(perf_test,'y.values')[[1]]-
attr(perf_test,'x.values')[[1]])
4.11 Calculating top 3 variables affecting credit score function
#Calculating top 3 variables affecting Credit Score Function
g<-predict(fit_train,type='terms',test)
#function to pick top 3 reasons
#works by sorting coefficient terms in equation and selecting
top 3 in sort for each loan scored
ftopk<- function(x,top=3){
res=names(x)[order(x, decreasing = TRUE)][1:top]
paste(res,collapse=";",sep="")
}
# Application of the function using the top 3 rows
topk=apply(g,1,ftopk,top=3)
#add reason list to scored test sample
test<-cbind(test, topk)
summary(test)

4.12 Reject Inference
The term Reject Inference describes the issue of how to deal with the inherent
bias when modeling is based on a training dataset consisting only of those
previous applicants for whom the actual performance (Good Credit vs. Bad
Credit) has been observed; however, there are
likely another significant number of previous applicants, that had been rejected
and for whom final "credit performance" was never observed. The question is,
how to include those previous applicants in the modeling, in order to make the
predictive model more accurate and robust (and less biased), and applicable
also to those individuals.
This is of particular importance when the criteria for the decision whether or
not to extend credit need to be loosened, in order to attract and extend credit
to more applicants. This can for example happen during a severe economic
downturn, affecting many people and placing their overall financial well being
into a condition that would not qualify them as acceptable credit risk using
older criteria. In short, if nobody were to qualify for credit any more, then the

institutions extending credits would be out of business. So it is often critically
important to make predictions about observations with specific predictor
values that were essentially outside the range of what would have been
previously considered, and consequently is unavailable and has not been
observed in the training data where the actual outcomes are recorded.
There are a number of approaches that have been suggested on how to include
previously rejected applicants for credit in the model building step, in order to
make the model more broadly applicable (to those applicants as well). In short,
these methods come down to systematically extrapolating from the actual
observed data, often by deliberately introducing biases and assumptions about
the expected loan outcome, had the (in actuality not observed) applicant been
accepted for credit.
4.13 Model Performance
A specific performance window would be taken into consideration to identify
the accuracy of the predictive power of the model.
5 Decision Tree Approach
This is one of the classic data mining techniques available for better decision
making purpose as the tree structure gives a better understanding of the data
and the important variables- which in turn could help in minimizing the potential
risk involved. However, individual customer probobilties can’t be found using this
method. Hence, Logistic Regression is preferred over it in industries.
# Decision Tree Implementation
#load tree package
library(rpart)
#build model using 90% 10% priors
#with smaller complexity parameter to allow more complex trees
model_dt <-
rpart(def~.,data=train,parms=list(prior=c(.9,.1)),cp=.0002)
plot(model_dt) text(model_dt)

6 Tools and Techniques
Following tools and techniques have been used :-
 R Studio
 Predictive Modeling, Logistic Regression, Data Mining Techniques
7 Way Forward
Reject inference has not been implemented. So there is scope to implement that.
Also data mining techniques like Random Forest, Neural Network can be
implemented as per the scope and easeness of the project. We have only
implemented Decision Tree.
8 Recommendations and Applications
Typical applicationareasinthe consumermarketinclude:creditcards,autoloans,home mortgages,
home equityloans,mail catalogorders,andawide varietyof personal loanproducts.
9 References and Bibliography
Belowdocumentshave beenreferredforthisproject:-
 CreditriskscorecardsdevelopingandimplementingintelligentcreditscoringbyNaeem
Siddiqi
 Sharma- Creditscoring
 Machine learningwithR– BrettLantz
10 Project Code
Below is the complete R code that has been used for the project :-
rm(list=ls())
data <- read.csv(file.choose(), header=T, stringsAsFactors=FALSE)
str(data)

# Replacing NA values
new_data <- as.data.frame(data, stringsAsFactors=FALSE)
new_data$mths_since_last_delinq[is.na(new_data$mths_since_last_delin
q)] <- 9999999
new_data$mths_since_last_record[is.na(new_data$mths_since_last_recor
d)] <- 9999999
new_data[is.na(new_data)] <- 0
str(new_data)
attach(new_data)
# Introduction of Dummy variables
# For Loan status
def <- ifelse(loan_status=="Default", 1, 0)
# purpose dummies
purpose_debt <- ifelse(purpose=='debt_consolidation', 1, 0)
purpose_car <- ifelse(purpose=='car', 1, 0)
purpose_credit <- ifelse(purpose=="credit_card", 1, 0)
purpose_home_imp <- ifelse(purpose=="home_improvement", 1, 0)
purpose_maj_purchase <- ifelse(purpose=="major_purchase", 1, 0)
purpose_ren_energy <- ifelse(purpose=="renewable_energy", 1, 0)
purpose_education <- ifelse(purpose=="educational", 1, 0)
purpose_house <- ifelse(purpose=="house", 1, 0)
purpose_medical <- ifelse(purpose=="medical", 1, 0)
purpose_moving <- ifelse(purpose=="moving", 1, 0)
purpose_small_business <- ifelse(purpose=="small_business", 1, 0)
purpose_vacation <- ifelse(purpose=="vacation", 1, 0)
purpose_wedding <- ifelse(purpose=="wedding", 1, 0)
# home ownership dummies
home_ownership_mort <- ifelse(home_ownership=="MORTGAGE", 1, 0)
home_ownership_own <- ifelse(home_ownership=="OWN", 1, 0)

home_ownership_rent <- ifelse(home_ownership=="RENT", 1, 0)
home_ownership_other <- ifelse(home_ownership=="OTHER", 1, 0)
# Grade
grade_a <- ifelse(grade=="A",1 ,0)
grade_b <- ifelse(grade=="B",1 ,0)
grade_c <- ifelse(grade=="C",1 ,0)
grade_d <- ifelse(grade=="D",1 ,0)
grade_e <- ifelse(grade=="E",1 ,0)
grade_f <- ifelse(grade=="F",1 ,0)
# Fine classing
gloan_amnt <- cut(new_data$loan_amnt, br=c(0, 5000, 15000, 35000),
labels=c("Low","Medium","High"))
gloan_amnt_data <- data.frame(new_data, gloan_amnt)
summary(gloan_amnt_data$gloan_amnt)
gloan_amnt_data_final <-
gloan_amnt_data[complete.cases(gloan_amnt_data),]
summary(gloan_amnt_data_final$gloan_amnt)
glast_fico_range_high <- cut(new_data$last_fico_range_high, br=c(0,
649, 699, 744, 850), labels=c("Low","Medium","High","Very High"))
last_fico_high_data <- data.frame(new_data, glast_fico_range_high)
summary(last_fico_high_data$glast_fico_range_high)
last_fico_high_data_final <-
last_fico_high_data[complete.cases(last_fico_high_data),]
summary(last_fico_high_data_final$glast_fico_range_high)
glast_fico_range_low <- cut(new_data$last_fico_range_low, br=c(0,
645, 695,740, 845), labels=c("Low","Medium","High","Very High"))
last_fico_low_data <- data.frame(new_data, glast_fico_range_low)
summary(last_fico_low_data$glast_fico_range_low)

last_fico_low_data_final <-
last_fico_low_data[complete.cases(last_fico_low_data),]
summary(last_fico_low_data_final$glast_fico_range_low)
gfico_range_high <- cut(new_data$fico_range_high, br=c(0, 650, 700,
fico_high_data <- data.frame(new_data, gfico_range_high)
summary(fico_high_data$gfico_range_high)
fico_high_data_final <-
fico_high_data[complete.cases(fico_high_data),]
summary(fico_high_data_final$gfico_range_high)
gfico_range_low <- cut(new_data$fico_range_low, br=c(0, 650, 700,
fico_low_data <- data.frame(new_data, gfico_range_low)
summary(fico_low_data$gfico_range_low)
fico_low_data_final <- fico_low_data[complete.cases(fico_low_data),]
summary(fico_low_data_final$gfico_range_low)
gdti <- cut(new_data$dti, br=c(0, 8, 13, 19, 30),
labels=c("Low","Medium","High","Very High"))
dti_data <- data.frame(new_data, gdti)
summary(dti_data$gdti)
dti_data_final <- dti_data[complete.cases(dti_data),]
summary(dti_data_final$gdti)
gmths_since_last_delinq <- cut(new_data$mths_since_last_delinq,
br=c(0, 48, 6466000, 10000000), labels=c("Low","Medium","High"))
mths_since_last_delinq_data <- data.frame(new_data,
gmths_since_last_delinq)
summary(mths_since_last_delinq_data$gmths_since_last_delinq)

mths_since_last_delinq_data_final <-
mths_since_last_delinq_data[complete.cases(mths_since_last_delinq_da
ta),]
summary(mths_since_last_delinq_data_final$gmths_since_last_delinq)
gmths_since_last_record <- cut(new_data$mths_since_last_record,
br=c(0, 9299000, 10000000), labels=c("Low","High"))
mths_since_last_record_data <- data.frame(new_data,
gmths_since_last_record)
summary(mths_since_last_record_data$gmths_since_last_record)
mths_since_last_record_data_final <-
mths_since_last_record_data[complete.cases(mths_since_last_record_da
ta),]
summary(mths_since_last_record_data_final$gmths_since_last_record)
gannual_inc <- cut(new_data$annual_inc, br=c(4000, 40500, 59000,
annual_inc_data <- data.frame(new_data, gannual_inc)
summary(annual_inc_data$gannual_inc)
annual_inc_data_final <-
annual_inc_data[complete.cases(annual_inc_data),]
summary(annual_inc_data_final$gannual_inc)
last_pymnt_amnt_data_final <-
summary(last_pymnt_amnt_data_final$gannual_inc)
gtotal_pymnt <- cut(new_data$total_pymnt, br=c(0, 4707, 8089, 13090,
49490), labels=c("Low","Medium","High","Very High"))

total_pymnt_data <- data.frame(new_data, gtotal_pymnt)
summary(total_pymnt_data$gtotal_pymnt)
total_pymnt_data_final <-
total_pymnt_data[complete.cases(total_pymnt_data),]
summary(total_pymnt_data_final$gtotal_pymnt)
gtotal_pymnt_inv <- cut(new_data$total_pymnt_inv, br=c(0, 4312,
7590, 12390, 49100), labels=c("Low","Medium","High","Very High"))
total_pymnt_inv_data <- data.frame(new_data, gtotal_pymnt_inv)
summary(total_pymnt_inv_data$gtotal_pymnt_inv)
total_pymnt_inv_data_final <-
total_pymnt_inv_data[complete.cases(total_pymnt_inv_data),]
summary(total_pymnt_inv_data_final$gtotal_pymnt_inv)
dim(total_pymnt_inv_data_final)
last_pymnt_amnt_data_final <-
summary(last_pymnt_amnt_data_final$glast_pymnt_amnt)
dim(last_pymnt_amnt_data_final)
gtotal_rec_int <- cut(new_data$total_rec_int, br=c(0, 640, 1290,
total_rec_int_data <- data.frame(new_data, gtotal_rec_int)
summary(total_rec_int_data$gtotal_rec_int)
total_rec_int_data_final <-
total_rec_int_data[complete.cases(total_rec_int_data),]
summary(total_rec_int_data_final$gtotal_rec_int)
dim(total_rec_int_data_final)

# define global
purpose <- data.frame(purpose_debt ,purpose_car ,purpose_credit
,purpose_home_imp ,purpose_maj_purchase,purpose_ren_energy
,purpose_education ,purpose_house ,purpose_medical ,purpose_moving
,purpose_small_business ,purpose_vacation ,purpose_wedding)
home_ownership <- data.frame(home_ownership_mort,
home_ownership_own, home_ownership_rent, home_ownership_other)
grade <- data.frame(grade_a,grade_b,grade_c,grade_d,grade_e,grade_f)
final_data <- data.frame(def, annual_inc_data$gannual_inc,
last_fico_low_data$glast_fico_range_low,
fico_low_data$gfico_range_low, fico_high_data$gfico_range_high,
total_pymnt_data$gtotal_pymnt,
total_pymnt_inv_data$gtotal_pymnt_inv, term,
last_pymnt_amnt_data$glast_pymnt_amnt,
total_rec_int_data$gtotal_rec_int, purpose, grade,home_ownership)
fit1 <- glm(def~.,data=final_data, family="binomial")
summary(fit1)
final_data1<- data.frame(def, annual_inc_data$gannual_inc,
fit2 <- glm(def~.,data=final_data1, family="binomial")
summary(fit2)
#IV calculation
iv.mult(final_data1,"def",TRUE)
iv_data <- data.frame(def, annual_inc_data$gannual_inc,

iv.plot.summary(iv.mult(final_data1,"def",TRUE))
#taking IV cut off as 0.1
iv_data_final <- data.frame(def,
iv.mult(iv_data_final,"def",TRUE)
iv.plot.summary(iv.mult(iv_data_final,"def",TRUE))
fin_data <- data.frame(def, last_fico_range_high, last_pymnt_amnt,
total_pymnt, total_rec_int)
fit <- glm(def~., data=fin_data, family="binomial")
summary(fit)
vif(fit)
# Dividing data into train and test
totalrecords <- nrow(fin_data)
trainfraction = 0.7
trainrecords = as.integer(totalrecords * trainfraction)
allrows <- 1:totalrecords
trainrows <- sample(totalrecords,trainrecords)
testrows <- allrows[-trainrows]
train<-data.frame(fin_data[trainrows,])
test<-data.frame(fin_data[testrows,])
dim(train)
dim(test)
fit_train <- glm(def~., data=train, family="binomial")

summary(fit_train)
# multicolinearity
vif(fit_train)
#Calculating ROC curve for model
library(ROCR)
#score train data set
train$score<-predict(fit_train,type='response',train)
pred_train<-prediction(train$score,train$def)
perf_train <- performance(pred_train,"tpr","fpr")
plot(perf_train)
# calculating AUC
auc_train <- performance(pred_train,"auc")
auc_train <- unlist(slot(auc_train, "y.values"))
minauc<-min(round(auc_train, digits = 2))
maxauc<-max(round(auc_train, digits = 2))
l = "white")
# Calculating KS statistic for train data
max(attr(perf_train,'y.values')[[1]]-
attr(perf_train,'x.values')[[1]])
# Concordance

OptimisedConc=function(model)
{
Data = cbind(model$y, model$fitted.values)
ones = Data[Data[,1] == 1,]
zeros = Data[Data[,1] == 0,]
conc=matrix(0, dim(zeros)[1], dim(ones)[1])
disc=matrix(0, dim(zeros)[1], dim(ones)[1])
ties=matrix(0, dim(zeros)[1], dim(ones)[1])
for (j in 1:dim(zeros)[1])
{
for (i in 1:dim(ones)[1])
{
if (ones[i,2]>zeros[j,2])
{conc[j,i]=1}
else if (ones[i,2]<zeros[j,2])
{disc[j,i]=1}
else if (ones[i,2]==zeros[j,2])
{ties[j,i]=1}
}
}
Pairs=dim(zeros)[1]*dim(ones)[1]
PercentConcordance=(sum(conc)/Pairs)*100
PercentDiscordance=(sum(disc)/Pairs)*100
PercentTied=(sum(ties)/Pairs)*100
return(list("Percent Concordance"=PercentConcordance,"Percent
Discordance"=PercentDiscordance,"Percent
Tied"=PercentTied,"Pairs"=Pairs))
}
# concordance of train data

OptimisedConc(fit_train)
#score test data set
test$score<-predict(fit_train,type='response',test)
pred_test<-prediction(test$score,test$def)
perf_test <- performance(pred_test,"tpr","fpr")
plot(perf_test)
# calculating AUC
auc_test <- performance(pred_test,"auc")
auc_test <- unlist(slot(auc_test, "y.values"))
minauc<-min(round(auc_test, digits = 2))
maxauc<-max(round(auc_test, digits = 2))
l = "white")
# Calculating KS statistic
max(attr(perf_test,'y.values')[[1]]-attr(perf_test,'x.values')[[1]])
# Probabilities prediction
prob_glm <- predict.glm(fit_train, type="response",se.fit=FALSE)
write.csv(prob_glm, "probability.csv")
plot(prob_glm)
#Calculating top 3 variables affecting Credit Score Function
g<-predict(fit_train,type='terms',test)

#function to pick top 3 reasons
#works by sorting coefficient terms in equation and selecting top 3
in sort for each loan scored
ftopk<- function(x,top=3){
res=names(x)[order(x, decreasing = TRUE)][1:top]
paste(res,collapse=";",sep="")
}
# Application of the function using the top 3 rows
topk=apply(g,1,ftopk,top=3)
#add reason list to scored test sample
test<-cbind(test, topk)
summary(test)
# Decision Tree Implementation
#load tree package
library(rpart)
#build model using 90% 10% priors
#with smaller complexity parameter to allow more complex trees
model_dt <-
rpart(def~.,data=train,parms=list(prior=c(.9,.1)),cp=.0002)
plot(model_dt)
text(model_dt)
#score test data
test$tscore1<-predict(model_dt,type='prob',test)
pred5<-prediction(test$tscore1[,2],test$def)
perf5 <- performance(pred5,"tpr","fpr")

# Random Forest
library(randomForest)
train$default <- data.frame(train$def[complete.cases(train$def),])
arf <-
randomForest(def~.,data=train,importance=TRUE,proximity=TRUE,ntree=5
00, keep.forest=TRUE)
#plot variable importance
varImpPlot(arf)
train$p <- predict(model_dt,train,type="prob")
train$p
summary(train$p)
testdata$p <- predict(model_rf,testdata,type="prob")
summary(testdata$p)
----------------------- End Of Report --------------------------

Project Report - Acquisition Credit Scoring Model

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Project Report - Acquisition Credit Scoring Model

Ähnlich wie Project Report - Acquisition Credit Scoring Model (20)

Project Report - Acquisition Credit Scoring Model