SlideShare ist ein Scribd-Unternehmen logo
1 von 5
Downloaden Sie, um offline zu lesen
IT
[1]@gsantosgo
Information Tecnology
Information Tecnology
Data Analysis
Title: What are the variables associated with the interest rate of a loan?
Identification and quantification associated variables.
Introduction:
Who didn't apply for a loan from a bank? How much data did you have to provide to the bank about you? ,..
Imagine foramomentthat you wantto be abankeroryou wantto lendmoney to people. The firstquestion could
be: How will I determine the interest rate of aloan foran applicant?, the secondquestion couldbe: Whatare the
variables associated with the interest of a loan?, and the third question could be: How can we identify and
quantifythevariablesassociatedwiththeinterestofaloan?
In this analysis, the interest rate is our response variable, and the result of this variable depends on one or more
independent variables. In our case, we have several explanatory variables about each applicant such as the
amountof money requestedinthe loanapplication, the amountof money loanedto the individual, the termof the
loan (time), the number of open lines of credit the applicant had at the time of application, the duration of
employed time at current job, the monthly income of the applicant, the measure of the creditworthiness of the
applicant,…
The interest rate of a loan depends on the level of the applicant´s risk, that is if an applicant has good monthly
income, agoodmeasure of creditworthiness,… thenthisapplicantwill have lessriskof non-paymentitsloanand
thereforetheinterestratewill belowerthananapplicanthasmoreriskofnon-payment.
So, in this analysis we performed a study to understand the associations or the relationships of the various
available variablesinthe datasetthatcanhelpusto determine the interestof loan. We usedexploratory analysis
to understand better the data of data set and get intimately familiar with them. We performed basic data
summaries and visualization (by plotting), and we could detect trends, patterns, anomalies and outliers. Using
multiple regression methods we obtained a regression model that could us there are significant relationship
betweenthe interestrate of loanandseveral associatedvariablessuchasthe amount(indollars)requestedinthe
loan application, the amount(indollars) loanedto the individual, the percentage of consumer’sgross income that
goestowardpaying debts [1], thenumberof openlinesofthecreditof the applicant, thetotal amountoutstanding
all lines of credit [2], the number of authorizedqueries about applicant creditworthiness in the term of 6 months
[3], thetermofloan(inmonths) andtheFICO score[4](themeasureofthecreditworthinessoftheapplicant).
IT
[2]@gsantosgo
Information Tecnology
Information Tecnology
Methods:
DataCollection
Forthisanalysiswe usedadatasetonpeer-to-peerloansissuedthroughthe platformLending Club[5]. Thisdata
set contains information of people applying for a loan such as the duration of employed time at current job, the
measure of their creditworthiness (FICO Score), their monthly incomes,… which are used to determine the
associated interest rate to a loan. The data were downloaded from coursera.org [6]in Data Analysis Course on
February13, 2013 using theR programming language.
ExploratoryAnalysis
Exploratory analysis was performed by examining tables and plots of the observed data. We identified
transformations to perform on the raw data on the basis of plots and knowledge of the scale of measured
variables. Exploratoryanalysiswasusedto(1)identify missing values, (2)verify thequality ofthedata, (3)perform
transformationsof the skeweddistributionsand(4) determine the termsusedinthe regressionmodel relating the
interestrateandotherassociatedvariables.
Statistical Modeling
To identify and quantify the association between the interest rate of a loan and other variables, we performed a
standard multivariate linear regression model [7]. Model selection was performed on the basis of our exploratory
analysisandpriorknowledgeoftherelationshipbetweentheamount(indollars)requestedintheloanapplication,
the amount (in dollars) loaned to the individual, the percentage of consumer’s gross income that goes toward
paying debts, the numberof openlinesof the creditthe applicant, the total amountoutstanding all linesof credit,
the number of authorized queries about applicant creditworthiness in the term of 6 months, the term of loan (in
months) and the FICO score (the measure of the creditworthiness of the applicant). Coefficients were estimated
withordinaryleastsquaresandstandarderrorswerecalculatedusing standardasymptotic approximations[8][9].
Reproducibility
All analysesperformedinthismanuscriptarereproducedintheR markdownfileloansLendingClubFinal.Rmd[10].
Note. Due to security concerns with the exchange of R code, we don’t submit code to reproduce analysis, in this
dataanalysis.
Results:
Thesampleofloansdatausedinthisanalysisisatotal size2500 observationswith14variablesthatcontainsthe
following information, theamountofmoney (indollars)requestedintheloanapplication(Amount.Requested),
theamountofmoney (indollars)loanedtotheindividual (Amount.Funded.By.Investors), thelending interestrate
IT
[3]@gsantosgo
Information Tecnology
Information Tecnology
(Interest.Rate), thelengthoftime(inmonths)oftheloan(Loan.Length), thepurposeoftheload(Loan.Purpose),
thepercentageofconsumer’sgrossincomethatgoestowardpaying debts[1](Debt.To.Income.Ratio), the
abbreviationfortheU.S. stateofresidenceoftheloanapplicant[11](State), avariableindicating whetherthe
applicantowns, rents, orhasamortgageontheirhome(Home.Ownership), themonthly income(indollars)ofthe
applicant, arangeindicating theapplicantsFICO score[4](FICO.Range), thenumberofopenlinesofcreditthe
applicanthadatthetimeofapplication(Open.CREDIT.Lines), thetotal amountoutstanding all linesofcredit[2]
(Revolving.CREDIT.Balance), thenumberofinquiresaboutthecreditworthinessoftheapplicantinthetermof6
monthsbeforetheloanwasissued [3](Inquiries.in.the.Last.6.Months)andthedurationofemployedtimeat
currentjob(Employment.Length). Weeliminatedmissing valuesinthedataset, specifically therewereseven
missing values(NA)thataffectedonlytworows. Therewerealsoincorrectdatainobservationsrelatedtothe
amount(indollars)loaned(Amount.Funded.By.Investors)withvaluesof0.0 and-1.0, wecorrectedthesevalues
withthesamevalueastheamount(indollars)requested(Amount.Requested). Inthedurationofemployedtime
(Employment.Length), wefoundrowsinthedatasetwith"n/a" valuesandherewedidn'tdoanything onthem.
Too, duetodetectedoutliersforthemonthlyincomeandthetotal amountoutstanding ofall linecreditvariables,
weremovedinvolvedrows.
Thedistributionsoftheamountrequested, theamountloaned, themonthly income, thenumberofopenlinesof
creditandthetotal amountoutstanding all linesofcreditwereright-skewed, weperformedtransformations
(log10)onthesevariablestoimprovetheirnormality. Inthecaseofthenumberofauthorizedqueriesintheterm
of6monthsweappliedonatransformationofsquaredroot(sqrt)type, becauseitisacountvariable.
Afterweperformedalotregressionmodelsforthisdataset, ourfinal regressionmodel was:
Term Description
b0 Intercept
b1 The change in the interest rate of loan with a change of 1 unit in log base 10 dollars for the amount requested in the
loan
b2 The change in the interest rate of loan with a change of 1 unit in log base 10 dollars for the amount loaned to the
individual
b3 Thechangeintheinterestrateof loanwithachangeof1 unitinlog base 10 percentforthepercentageofconsumer’s
grossincome
b4 Thechangeintheinterestrateofloanwithachangeof 1unitinlog base 10 unitsforthe numberopenlines
b5 The change in the interest rate of loan with a change of 1 unit in log base dollars for the total amount outstanding all
linesofcredit
b6 The change in the interest rate of loan with a change of 1 unit in squared root units for the number of authorized
queries
f(Loan.Length) Representfactormodels(thelengthoftimeoftheloan) with2 levels(36monthsand60 months)
IT
[4]@gsantosgo
Information Tecnology
Information Tecnology
g(FICO.Range) Represent factor models (a range indicating the applicants FICO Score) with 36 levels (640-644, 645-649, …….
820-824and830-834)
e Termerror. Representsall sourcesofunmeasuredandunmodeledrandomvariationininterestrateofloan
Theresultofourmultiplelinearregressionmodel offersus, thelinearrelationshipbetweentheinterestrateand
all previoustermsthatarehighlystatisticallysignificant(p-value= 2.2e-16)andhowwell theregressionmodel
fitstheobserveddatais0.7942. Weobservedastrong associationbetweentheinterestrateandFICO score.
Conclusions:
Clearly, in this analysis, there is a strong association between the interest rate of a loan and the FICO Score
Range. To fit still better our regression model we included other variables such as the amount of money
requested, the amount of money loaned, the DTI (Debt To Income Ratio), the number of open lines of credit the
applicant, the total amountoutstanding all linesofcredit, thenumberof authorizedqueriesandthe lengthoftime
employed. We included these last variables in the our regression model, to get a high R-squared percentage
(around 80%) and highly statistically significant, in spite of what some included variables have association with
theinterestratelessthantheFICO ScoreRange.
References
[1]Debt-to-Incomeratio
http://en.wikipedia.org/wiki/Debt-to-income_ratio. Accessed 02/17/2013
[2]Revolving CreditBalance
http://www.ehow.com/about_7550001_revolving-credit-balance.html. Accessed 02/17/2013
[3]Inquiriesinthelast6months
http://www.myfico.com/crediteducation/creditinquiries.aspx. Accessed 02/17/2013
[4]FICO Score
http://en.wikipedia.org/wiki/Credit_score_in_the_United_States. Accessed 02/17/2013
[5]Lending Club
https://www.lendingclub.com/. Accessed 02/17/2013
[6]Datasetofloans
https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv. Accessed02/17/2013
[7]Jank, Wolfgang. BusinessAnalyticsforManagers. Springer, 2011
[8]Ferguson, ThomasS. ACourseinLargeSampleTheory: TextsinStatistical Science. Vol.l 38. Chapman&
Hall/CRC, 1996.
[9]Shahbaba, Babak. BiostatisticswithR. AnIntroductiontoStatisticsThroughBiological Data. Springer, 2012
[10]R MarkdownPage.
http://www.rstudio.com/ide/docs/authoring/using_markdown. Accessed02/13/2013
[11]TheabbreviationforU.S. state
IT
[5]@gsantosgo
Information Tecnology
Information Tecnology
http://www.50states.com/abbreviations.htm#.USHpnWd1z2R. Accessed 02/17/2013
Figure 1 (a) Plot of fitted regression model relating log base Interest Rate and 10 Amount Request (Included
confidence bands). (b) Plot fitted regression model relating Interest Rate and FICO Score Range (Included
confidence bands). (c) Plot of fitted regression model relating Interest Rate and Debt To Income (Included
confidencebands). (d) A lotoftheresidualslinearregressionmodel relating fittedvalues.

Weitere ähnliche Inhalte

Was ist angesagt?

The Influence of Solvency Ratio Decision on Rural Bank Dinar Pusaka In The Di...
The Influence of Solvency Ratio Decision on Rural Bank Dinar Pusaka In The Di...The Influence of Solvency Ratio Decision on Rural Bank Dinar Pusaka In The Di...
The Influence of Solvency Ratio Decision on Rural Bank Dinar Pusaka In The Di...inventionjournals
 
CECL Project Overview
CECL Project OverviewCECL Project Overview
CECL Project OverviewRohit Khurana
 
Benchmark the Actual Bond Prices
Benchmark the Actual Bond PricesBenchmark the Actual Bond Prices
Benchmark the Actual Bond PricesRan Zhang
 
The ability of previous quarterly earnings, net interest margin, and average ...
The ability of previous quarterly earnings, net interest margin, and average ...The ability of previous quarterly earnings, net interest margin, and average ...
The ability of previous quarterly earnings, net interest margin, and average ...RyanMHolcomb
 
Data Engineering - Data Mining Assignment
Data Engineering - Data Mining AssignmentData Engineering - Data Mining Assignment
Data Engineering - Data Mining AssignmentDarran Mottershead
 
Game Theory Approach for Identity Crime Detection
Game Theory Approach for Identity Crime DetectionGame Theory Approach for Identity Crime Detection
Game Theory Approach for Identity Crime DetectionIOSR Journals
 
Business research report proposal customer delight in banking
Business research report proposal customer delight in bankingBusiness research report proposal customer delight in banking
Business research report proposal customer delight in bankingGagan Dharwal
 
A Holistic Approach to Property Valuations
A Holistic Approach to Property ValuationsA Holistic Approach to Property Valuations
A Holistic Approach to Property ValuationsCognizant
 
IRJET- Fraud Detection Algorithms for a Credit Card
IRJET- Fraud Detection Algorithms for a Credit CardIRJET- Fraud Detection Algorithms for a Credit Card
IRJET- Fraud Detection Algorithms for a Credit CardIRJET Journal
 
Financial incentives and loan officer behavior: multitasking and allocation o...
Financial incentives and loan officer behavior: multitasking and allocation o...Financial incentives and loan officer behavior: multitasking and allocation o...
Financial incentives and loan officer behavior: multitasking and allocation o...FGV Brazil
 
Best Keyword Cover Search
Best Keyword Cover SearchBest Keyword Cover Search
Best Keyword Cover Search1crore projects
 
IRJET- Credit Card Fraud Detection using Random Forest
IRJET-  	  Credit Card Fraud Detection using Random ForestIRJET-  	  Credit Card Fraud Detection using Random Forest
IRJET- Credit Card Fraud Detection using Random ForestIRJET Journal
 

Was ist angesagt? (14)

Credit defaulter analysis
Credit defaulter analysisCredit defaulter analysis
Credit defaulter analysis
 
The Influence of Solvency Ratio Decision on Rural Bank Dinar Pusaka In The Di...
The Influence of Solvency Ratio Decision on Rural Bank Dinar Pusaka In The Di...The Influence of Solvency Ratio Decision on Rural Bank Dinar Pusaka In The Di...
The Influence of Solvency Ratio Decision on Rural Bank Dinar Pusaka In The Di...
 
CECL Project Overview
CECL Project OverviewCECL Project Overview
CECL Project Overview
 
Benchmark the Actual Bond Prices
Benchmark the Actual Bond PricesBenchmark the Actual Bond Prices
Benchmark the Actual Bond Prices
 
The ability of previous quarterly earnings, net interest margin, and average ...
The ability of previous quarterly earnings, net interest margin, and average ...The ability of previous quarterly earnings, net interest margin, and average ...
The ability of previous quarterly earnings, net interest margin, and average ...
 
Data Engineering - Data Mining Assignment
Data Engineering - Data Mining AssignmentData Engineering - Data Mining Assignment
Data Engineering - Data Mining Assignment
 
Game Theory Approach for Identity Crime Detection
Game Theory Approach for Identity Crime DetectionGame Theory Approach for Identity Crime Detection
Game Theory Approach for Identity Crime Detection
 
Business research report proposal customer delight in banking
Business research report proposal customer delight in bankingBusiness research report proposal customer delight in banking
Business research report proposal customer delight in banking
 
A Holistic Approach to Property Valuations
A Holistic Approach to Property ValuationsA Holistic Approach to Property Valuations
A Holistic Approach to Property Valuations
 
IRJET- Fraud Detection Algorithms for a Credit Card
IRJET- Fraud Detection Algorithms for a Credit CardIRJET- Fraud Detection Algorithms for a Credit Card
IRJET- Fraud Detection Algorithms for a Credit Card
 
Financial incentives and loan officer behavior: multitasking and allocation o...
Financial incentives and loan officer behavior: multitasking and allocation o...Financial incentives and loan officer behavior: multitasking and allocation o...
Financial incentives and loan officer behavior: multitasking and allocation o...
 
Best Keyword Cover Search
Best Keyword Cover SearchBest Keyword Cover Search
Best Keyword Cover Search
 
IRJET- Credit Card Fraud Detection using Random Forest
IRJET-  	  Credit Card Fraud Detection using Random ForestIRJET-  	  Credit Card Fraud Detection using Random Forest
IRJET- Credit Card Fraud Detection using Random Forest
 
B05840510
B05840510B05840510
B05840510
 

Ähnlich wie Data Analysis. Regression. LendingClub Loans

IRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank LoansIRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank LoansIRJET Journal
 
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING mlaij
 
Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101
Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101
Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101Heather Lamoureux
 
Mba lecture notes about all the stuffs and learn
Mba lecture notes about all the stuffs and learnMba lecture notes about all the stuffs and learn
Mba lecture notes about all the stuffs and learnNIBMBusinessnetworkP
 
fast publication journals
fast publication journalsfast publication journals
fast publication journalsrikaseorika
 
Barclays - Case Study Competition | ISB | National Finalist
Barclays - Case Study Competition | ISB | National FinalistBarclays - Case Study Competition | ISB | National Finalist
Barclays - Case Study Competition | ISB | National FinalistNaveen Kumar
 
Using a Survival Model for Credit Risk Scoring and Loan Pricing Instead of XG...
Using a Survival Model for Credit Risk Scoring and Loan Pricing Instead of XG...Using a Survival Model for Credit Risk Scoring and Loan Pricing Instead of XG...
Using a Survival Model for Credit Risk Scoring and Loan Pricing Instead of XG...CFO Pro+Analytics
 
Modelling Credit Risk of Portfolios of Consumer Loans
Modelling Credit Risk of Portfolios of Consumer LoansModelling Credit Risk of Portfolios of Consumer Loans
Modelling Credit Risk of Portfolios of Consumer LoansMadhur Malik
 
Consumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestConsumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestHirak Sen Roy
 
PROBABILISTIC CREDIT SCORING FOR COHORTS OF BORROWERS
PROBABILISTIC CREDIT SCORING FOR COHORTS OF BORROWERSPROBABILISTIC CREDIT SCORING FOR COHORTS OF BORROWERS
PROBABILISTIC CREDIT SCORING FOR COHORTS OF BORROWERSAndresz26
 
Credit risk assessment with imbalanced data sets using SVMs
Credit risk assessment with imbalanced data sets using SVMsCredit risk assessment with imbalanced data sets using SVMs
Credit risk assessment with imbalanced data sets using SVMsIRJET Journal
 
credit scoring paper published in eswa
credit scoring paper published in eswacredit scoring paper published in eswa
credit scoring paper published in eswaAkhil Bandhu Hens, FRM
 

Ähnlich wie Data Analysis. Regression. LendingClub Loans (20)

Programming for big data
Programming for big dataProgramming for big data
Programming for big data
 
IRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank LoansIRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank Loans
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
 
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
 
Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101
Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101
Report-7-B-Searching-for-Harm-in-Storefront-Payday-Lending-nonPrime101
 
Mba lecture notes about all the stuffs and learn
Mba lecture notes about all the stuffs and learnMba lecture notes about all the stuffs and learn
Mba lecture notes about all the stuffs and learn
 
Loan Characteristics as Predictors of Default in Commercial Mortgage Portfolios
Loan Characteristics as Predictors of Default in Commercial Mortgage PortfoliosLoan Characteristics as Predictors of Default in Commercial Mortgage Portfolios
Loan Characteristics as Predictors of Default in Commercial Mortgage Portfolios
 
B05840510
B05840510B05840510
B05840510
 
Data mining on Financial Data
Data mining on Financial DataData mining on Financial Data
Data mining on Financial Data
 
fast publication journals
fast publication journalsfast publication journals
fast publication journals
 
Barclays - Case Study Competition | ISB | National Finalist
Barclays - Case Study Competition | ISB | National FinalistBarclays - Case Study Competition | ISB | National Finalist
Barclays - Case Study Competition | ISB | National Finalist
 
Using a Survival Model for Credit Risk Scoring and Loan Pricing Instead of XG...
Using a Survival Model for Credit Risk Scoring and Loan Pricing Instead of XG...Using a Survival Model for Credit Risk Scoring and Loan Pricing Instead of XG...
Using a Survival Model for Credit Risk Scoring and Loan Pricing Instead of XG...
 
Modelling Credit Risk of Portfolios of Consumer Loans
Modelling Credit Risk of Portfolios of Consumer LoansModelling Credit Risk of Portfolios of Consumer Loans
Modelling Credit Risk of Portfolios of Consumer Loans
 
Consumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestConsumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random Forest
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
 
PROBABILISTIC CREDIT SCORING FOR COHORTS OF BORROWERS
PROBABILISTIC CREDIT SCORING FOR COHORTS OF BORROWERSPROBABILISTIC CREDIT SCORING FOR COHORTS OF BORROWERS
PROBABILISTIC CREDIT SCORING FOR COHORTS OF BORROWERS
 
Credit risk assessment with imbalanced data sets using SVMs
Credit risk assessment with imbalanced data sets using SVMsCredit risk assessment with imbalanced data sets using SVMs
Credit risk assessment with imbalanced data sets using SVMs
 
credit scoring paper published in eswa
credit scoring paper published in eswacredit scoring paper published in eswa
credit scoring paper published in eswa
 
Fairness definitions explained
Fairness definitions explainedFairness definitions explained
Fairness definitions explained
 

Mehr von Guillermo Santos

Handwritten Digit recognition with R. Classification Problem
Handwritten Digit recognition with R. Classification ProblemHandwritten Digit recognition with R. Classification Problem
Handwritten Digit recognition with R. Classification ProblemGuillermo Santos
 
MadridJUG Mineria de Datos-Data Mining.09.may.2013
MadridJUG Mineria de Datos-Data Mining.09.may.2013MadridJUG Mineria de Datos-Data Mining.09.may.2013
MadridJUG Mineria de Datos-Data Mining.09.may.2013Guillermo Santos
 
Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...
Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...
Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...Guillermo Santos
 
Instalación R y RStudio en Windows
Instalación R y RStudio en WindowsInstalación R y RStudio en Windows
Instalación R y RStudio en WindowsGuillermo Santos
 
Presentación Geolocalización Noticias (geo news).2012
Presentación Geolocalización Noticias (geo news).2012Presentación Geolocalización Noticias (geo news).2012
Presentación Geolocalización Noticias (geo news).2012Guillermo Santos
 
Algoritmos Aprendizaje Automático.2012
Algoritmos Aprendizaje Automático.2012Algoritmos Aprendizaje Automático.2012
Algoritmos Aprendizaje Automático.2012Guillermo Santos
 
Kettle. Recuperación y Procesado de datos.2012
Kettle. Recuperación y Procesado de datos.2012Kettle. Recuperación y Procesado de datos.2012
Kettle. Recuperación y Procesado de datos.2012Guillermo Santos
 

Mehr von Guillermo Santos (7)

Handwritten Digit recognition with R. Classification Problem
Handwritten Digit recognition with R. Classification ProblemHandwritten Digit recognition with R. Classification Problem
Handwritten Digit recognition with R. Classification Problem
 
MadridJUG Mineria de Datos-Data Mining.09.may.2013
MadridJUG Mineria de Datos-Data Mining.09.may.2013MadridJUG Mineria de Datos-Data Mining.09.may.2013
MadridJUG Mineria de Datos-Data Mining.09.may.2013
 
Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...
Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...
Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...
 
Instalación R y RStudio en Windows
Instalación R y RStudio en WindowsInstalación R y RStudio en Windows
Instalación R y RStudio en Windows
 
Presentación Geolocalización Noticias (geo news).2012
Presentación Geolocalización Noticias (geo news).2012Presentación Geolocalización Noticias (geo news).2012
Presentación Geolocalización Noticias (geo news).2012
 
Algoritmos Aprendizaje Automático.2012
Algoritmos Aprendizaje Automático.2012Algoritmos Aprendizaje Automático.2012
Algoritmos Aprendizaje Automático.2012
 
Kettle. Recuperación y Procesado de datos.2012
Kettle. Recuperación y Procesado de datos.2012Kettle. Recuperación y Procesado de datos.2012
Kettle. Recuperación y Procesado de datos.2012
 

Kürzlich hochgeladen

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 

Kürzlich hochgeladen (20)

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 

Data Analysis. Regression. LendingClub Loans

  • 1. IT [1]@gsantosgo Information Tecnology Information Tecnology Data Analysis Title: What are the variables associated with the interest rate of a loan? Identification and quantification associated variables. Introduction: Who didn't apply for a loan from a bank? How much data did you have to provide to the bank about you? ,.. Imagine foramomentthat you wantto be abankeroryou wantto lendmoney to people. The firstquestion could be: How will I determine the interest rate of aloan foran applicant?, the secondquestion couldbe: Whatare the variables associated with the interest of a loan?, and the third question could be: How can we identify and quantifythevariablesassociatedwiththeinterestofaloan? In this analysis, the interest rate is our response variable, and the result of this variable depends on one or more independent variables. In our case, we have several explanatory variables about each applicant such as the amountof money requestedinthe loanapplication, the amountof money loanedto the individual, the termof the loan (time), the number of open lines of credit the applicant had at the time of application, the duration of employed time at current job, the monthly income of the applicant, the measure of the creditworthiness of the applicant,… The interest rate of a loan depends on the level of the applicant´s risk, that is if an applicant has good monthly income, agoodmeasure of creditworthiness,… thenthisapplicantwill have lessriskof non-paymentitsloanand thereforetheinterestratewill belowerthananapplicanthasmoreriskofnon-payment. So, in this analysis we performed a study to understand the associations or the relationships of the various available variablesinthe datasetthatcanhelpusto determine the interestof loan. We usedexploratory analysis to understand better the data of data set and get intimately familiar with them. We performed basic data summaries and visualization (by plotting), and we could detect trends, patterns, anomalies and outliers. Using multiple regression methods we obtained a regression model that could us there are significant relationship betweenthe interestrate of loanandseveral associatedvariablessuchasthe amount(indollars)requestedinthe loan application, the amount(indollars) loanedto the individual, the percentage of consumer’sgross income that goestowardpaying debts [1], thenumberof openlinesofthecreditof the applicant, thetotal amountoutstanding all lines of credit [2], the number of authorizedqueries about applicant creditworthiness in the term of 6 months [3], thetermofloan(inmonths) andtheFICO score[4](themeasureofthecreditworthinessoftheapplicant).
  • 2. IT [2]@gsantosgo Information Tecnology Information Tecnology Methods: DataCollection Forthisanalysiswe usedadatasetonpeer-to-peerloansissuedthroughthe platformLending Club[5]. Thisdata set contains information of people applying for a loan such as the duration of employed time at current job, the measure of their creditworthiness (FICO Score), their monthly incomes,… which are used to determine the associated interest rate to a loan. The data were downloaded from coursera.org [6]in Data Analysis Course on February13, 2013 using theR programming language. ExploratoryAnalysis Exploratory analysis was performed by examining tables and plots of the observed data. We identified transformations to perform on the raw data on the basis of plots and knowledge of the scale of measured variables. Exploratoryanalysiswasusedto(1)identify missing values, (2)verify thequality ofthedata, (3)perform transformationsof the skeweddistributionsand(4) determine the termsusedinthe regressionmodel relating the interestrateandotherassociatedvariables. Statistical Modeling To identify and quantify the association between the interest rate of a loan and other variables, we performed a standard multivariate linear regression model [7]. Model selection was performed on the basis of our exploratory analysisandpriorknowledgeoftherelationshipbetweentheamount(indollars)requestedintheloanapplication, the amount (in dollars) loaned to the individual, the percentage of consumer’s gross income that goes toward paying debts, the numberof openlinesof the creditthe applicant, the total amountoutstanding all linesof credit, the number of authorized queries about applicant creditworthiness in the term of 6 months, the term of loan (in months) and the FICO score (the measure of the creditworthiness of the applicant). Coefficients were estimated withordinaryleastsquaresandstandarderrorswerecalculatedusing standardasymptotic approximations[8][9]. Reproducibility All analysesperformedinthismanuscriptarereproducedintheR markdownfileloansLendingClubFinal.Rmd[10]. Note. Due to security concerns with the exchange of R code, we don’t submit code to reproduce analysis, in this dataanalysis. Results: Thesampleofloansdatausedinthisanalysisisatotal size2500 observationswith14variablesthatcontainsthe following information, theamountofmoney (indollars)requestedintheloanapplication(Amount.Requested), theamountofmoney (indollars)loanedtotheindividual (Amount.Funded.By.Investors), thelending interestrate
  • 3. IT [3]@gsantosgo Information Tecnology Information Tecnology (Interest.Rate), thelengthoftime(inmonths)oftheloan(Loan.Length), thepurposeoftheload(Loan.Purpose), thepercentageofconsumer’sgrossincomethatgoestowardpaying debts[1](Debt.To.Income.Ratio), the abbreviationfortheU.S. stateofresidenceoftheloanapplicant[11](State), avariableindicating whetherthe applicantowns, rents, orhasamortgageontheirhome(Home.Ownership), themonthly income(indollars)ofthe applicant, arangeindicating theapplicantsFICO score[4](FICO.Range), thenumberofopenlinesofcreditthe applicanthadatthetimeofapplication(Open.CREDIT.Lines), thetotal amountoutstanding all linesofcredit[2] (Revolving.CREDIT.Balance), thenumberofinquiresaboutthecreditworthinessoftheapplicantinthetermof6 monthsbeforetheloanwasissued [3](Inquiries.in.the.Last.6.Months)andthedurationofemployedtimeat currentjob(Employment.Length). Weeliminatedmissing valuesinthedataset, specifically therewereseven missing values(NA)thataffectedonlytworows. Therewerealsoincorrectdatainobservationsrelatedtothe amount(indollars)loaned(Amount.Funded.By.Investors)withvaluesof0.0 and-1.0, wecorrectedthesevalues withthesamevalueastheamount(indollars)requested(Amount.Requested). Inthedurationofemployedtime (Employment.Length), wefoundrowsinthedatasetwith"n/a" valuesandherewedidn'tdoanything onthem. Too, duetodetectedoutliersforthemonthlyincomeandthetotal amountoutstanding ofall linecreditvariables, weremovedinvolvedrows. Thedistributionsoftheamountrequested, theamountloaned, themonthly income, thenumberofopenlinesof creditandthetotal amountoutstanding all linesofcreditwereright-skewed, weperformedtransformations (log10)onthesevariablestoimprovetheirnormality. Inthecaseofthenumberofauthorizedqueriesintheterm of6monthsweappliedonatransformationofsquaredroot(sqrt)type, becauseitisacountvariable. Afterweperformedalotregressionmodelsforthisdataset, ourfinal regressionmodel was: Term Description b0 Intercept b1 The change in the interest rate of loan with a change of 1 unit in log base 10 dollars for the amount requested in the loan b2 The change in the interest rate of loan with a change of 1 unit in log base 10 dollars for the amount loaned to the individual b3 Thechangeintheinterestrateof loanwithachangeof1 unitinlog base 10 percentforthepercentageofconsumer’s grossincome b4 Thechangeintheinterestrateofloanwithachangeof 1unitinlog base 10 unitsforthe numberopenlines b5 The change in the interest rate of loan with a change of 1 unit in log base dollars for the total amount outstanding all linesofcredit b6 The change in the interest rate of loan with a change of 1 unit in squared root units for the number of authorized queries f(Loan.Length) Representfactormodels(thelengthoftimeoftheloan) with2 levels(36monthsand60 months)
  • 4. IT [4]@gsantosgo Information Tecnology Information Tecnology g(FICO.Range) Represent factor models (a range indicating the applicants FICO Score) with 36 levels (640-644, 645-649, ……. 820-824and830-834) e Termerror. Representsall sourcesofunmeasuredandunmodeledrandomvariationininterestrateofloan Theresultofourmultiplelinearregressionmodel offersus, thelinearrelationshipbetweentheinterestrateand all previoustermsthatarehighlystatisticallysignificant(p-value= 2.2e-16)andhowwell theregressionmodel fitstheobserveddatais0.7942. Weobservedastrong associationbetweentheinterestrateandFICO score. Conclusions: Clearly, in this analysis, there is a strong association between the interest rate of a loan and the FICO Score Range. To fit still better our regression model we included other variables such as the amount of money requested, the amount of money loaned, the DTI (Debt To Income Ratio), the number of open lines of credit the applicant, the total amountoutstanding all linesofcredit, thenumberof authorizedqueriesandthe lengthoftime employed. We included these last variables in the our regression model, to get a high R-squared percentage (around 80%) and highly statistically significant, in spite of what some included variables have association with theinterestratelessthantheFICO ScoreRange. References [1]Debt-to-Incomeratio http://en.wikipedia.org/wiki/Debt-to-income_ratio. Accessed 02/17/2013 [2]Revolving CreditBalance http://www.ehow.com/about_7550001_revolving-credit-balance.html. Accessed 02/17/2013 [3]Inquiriesinthelast6months http://www.myfico.com/crediteducation/creditinquiries.aspx. Accessed 02/17/2013 [4]FICO Score http://en.wikipedia.org/wiki/Credit_score_in_the_United_States. Accessed 02/17/2013 [5]Lending Club https://www.lendingclub.com/. Accessed 02/17/2013 [6]Datasetofloans https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv. Accessed02/17/2013 [7]Jank, Wolfgang. BusinessAnalyticsforManagers. Springer, 2011 [8]Ferguson, ThomasS. ACourseinLargeSampleTheory: TextsinStatistical Science. Vol.l 38. Chapman& Hall/CRC, 1996. [9]Shahbaba, Babak. BiostatisticswithR. AnIntroductiontoStatisticsThroughBiological Data. Springer, 2012 [10]R MarkdownPage. http://www.rstudio.com/ide/docs/authoring/using_markdown. Accessed02/13/2013 [11]TheabbreviationforU.S. state
  • 5. IT [5]@gsantosgo Information Tecnology Information Tecnology http://www.50states.com/abbreviations.htm#.USHpnWd1z2R. Accessed 02/17/2013 Figure 1 (a) Plot of fitted regression model relating log base Interest Rate and 10 Amount Request (Included confidence bands). (b) Plot fitted regression model relating Interest Rate and FICO Score Range (Included confidence bands). (c) Plot of fitted regression model relating Interest Rate and Debt To Income (Included confidencebands). (d) A lotoftheresidualslinearregressionmodel relating fittedvalues.