Text mining is a process of deriving/extracting high quality meaningful information and patterns.
Text analysis involves information retrieval, analysis to study word frequency distributions, pattern recognition, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics.
A research study applying Text Mining and Machine Learning tools.
The authors find that loan applicants' choice of words reveals insights into their intentions, circumstances, and personality.
This information is powerful in predicting loan repayment, going beyond typical financial and demographic factors.
Potential borrowers submit their request for a loan for a specific amount with a specific maximum interest rate (they are willing to pay).
The loan amount they wish to borrow must in (between $1,000 and $25,000 in the data).
Prosper verifies all financial information, including the potential borrower’s credit score.
Textual variables:
The number of characters in the title and the text box.
The percentage of words with six or more letters.
SMOG: This measures writing quality by mapping it to number of years of formal education needed to easily understand the text in first reading.
Count of spelling mistakes.
Bigrams : Two-word combinations (help to understand the context and the pattern).
Financial variable:
Loan amount, borrower’s credit grade, Debt to income ratio.
Demographic variables:
Gender, age, location, race.
Aim:
To evaluate whether the text used by borrowers in their loan application predicts their loan default.
Machine Learning Methods:
Ensemble stacking approach
Train each model on the calibration data (2 logistics regression and 3 tree-based methods).
Build a weighting model to combine the models calibrated in the first model.
2. TABLE OF CONTENT
1. What is Text Mining?
2. Introduction
3. Main Models Used
4. Key Contributions
5. Marketing and Non-marketing Applications
6. Limitations
7. Avenues for future research
8. Key Takeaways
5. WHAT IS TEXT MINING?
Text mining is a process of deriving/extracting high
quality meaningful information and patterns.
Text analysis involves information retrieval, analysis
to study word frequency distributions, pattern
recognition, information extraction, data mining
techniques including link and association analysis,
visualization, and predictive analytics.
7. INTRODUCTION
● A research study applying Text Mining and
Machine Learning tools.
● The authors find that loan applicants' choice
of words reveals insights into their intentions,
circumstances, and personality.
● This information is powerful in predicting
loan repayment, going beyond typical
financial and demographic factors.
8. Setting and Data
1. Potential borrowers submit their request for a loan for a specific
amount with a specific maximum interest rate (they are willing to pay).
2. The loan amount they wish to borrow must in (between $1,000 and
$25,000 in the data).
3. Prosper verifies all financial information, including the potential
borrower’s credit score.
9. Textual, Financial, and Demographic Variables
1. Textual variables:
a. The number of characters in the title and the text box.
b. The percentage of words with six or more letters.
c. SMOG: This measures writing quality by mapping it to number of years of formal
education needed to easily understand the text in first reading.
d. Count of spelling mistakes.
e. Bigrams : Two-word combinations (help to understand the context and the pattern).
2. Financial variable:
a. Loan amount, borrower’s credit grade, Debt to income ratio.
3. Demographic variables:
a. Gender, age, location, race.
10. PROCESS OF
TEXT MINING
The authors used something called "Term
frequency-inverse document frequency" or tf-
idf to compare how often a word is used in a
loan request to how often it's used in all the
loan requests and how long the request is.
Process 04
Process 01
tm package in r was used to select
distinct words in each loan application.
Process 02
- Porter’s stemming algorithm to collapse
variations of words into one e.g., “borrower,”
“borrowed,” “borrowing,” and “borrowers”
become “borrow” (3.5M words → 30,920 unique
words and 1052 bigrams.
PyEnchant 1.6.6 package in Python was
used to count spelling mistakes in the
loan applications. This allows them to
identify words that are misspelled and
potentially serve as a proxy for
characteristics correlated with lower
income.
Process 03
4
12. MODEL 1 - Predictive model
Aim:
To evaluate whether the text used by borrowers in their loan application predicts
their loan default.
Machine Learning Methods:
Ensemble stacking approach
1. Train each model on the calibration data (2 logistics regression and 3 tree-
based methods).
2. Build a weighting model to combine the models calibrated in the first model.
13. Result
Source: Netzer, O., Lemaire, A., & Herzenstein, M. (2019). When words sweat: Identifying signals for loan default in the text of loan applications. Journal of Marketing Research,
56(6), 960-980.
14. Result
Source: Netzer, O., Lemaire, A., & Herzenstein, M. (2019). When words sweat: Identifying signals for loan default in the text of loan applications. Journal of
Marketing Research, 56(6), 960-980.
15. MODEL 2 - Words and writing styles of default loan request
Aim:
Learn which words, writing styles, and general ideas conveyed by the text are more
likely to be associated with default loan request.
Machine Learning Methods:
1)Machine learning tools
Naive Bayes
L1 regularization binary logistic model
Word Count Dictionary (LIWC)
2) Standard Econometrics tools
Topic’s Logistic regression extracted from
a latent Dirichlet allocation (LDA) analysis
and the sub-dictionaries of the Linguistic
Inquiry.
16. Result
Source: Netzer, O., Lemaire, A., & Herzenstein, M. (2019). When words sweat: Identifying signals for loan default in the text of loan applications. Journal of
Marketing Research, 56(6), 960-980.
17. MODEL 3 - Potential Borrower’s Personality
Aim:
Further exploration of potential traits and states of borrowers.
Machine Learning Methods:
Applying LIWC library.
Results:
Defaulting loan requests are written in a manner consistent
with the writing styles of extroverts and liars.
19. Analyzing applications
Borrower 1: “I am a hard working person, married for 25 years, and have
two wonderful boys. Please let me explain why I need help. I would use
the $2,000 loan to fix our roof. Thank you, God bless you, and I promise to
pay you back.”
Borrower 2: “While the past year in our new place has been more than
great, the roof is now leaking and I need to borrow $2,000 to cover the
cost of the repair. I pay all bills (e.g., car loans, cable, utilities) on time.”
Which borrower is more likely to default?
20. KEY CONTRIBUTIONS
Textual information
on the loan
significantly helps
predict loan default.
Source: Netzer, O., Lemaire, A., & Herzenstein, M. (2019). When words sweat: Identifying signals for loan default in the text of loan applications. Journal of
Marketing Research, 56(6), 960-980.
21. KEY CONTRIBUTIONS
Words indicative of
loan repayment.
Source: Netzer, O., Lemaire, A., & Herzenstein, M. (2019). When words sweat: Identifying signals for loan default in the text of loan applications. Journal of
Marketing Research, 56(6), 960-980.
30. LIMITATIONS
1. Text data may not be available for all loan
applications, as some borrowers may not
provide any text or may provide incomplete
or inaccurate information.
2. Text data may be subject to
interpretation and bias, as different lenders
may interpret the same text differently
based on their own biases and assumptions.
3. The use of text data to predict loan
default raises ethical and legal concerns
32. FURTHER RESEARCH
● The predictive ability of text analysis
regarding future behavior extended
to other behaviors and industries.
● Extension of results to other types of
communication, e.g., phone calls
and online chats.
● How word usage can change
overtime.
33. FURTHER RESEARCH
● Exploring the role of emotions and
mental states in financial behaviors.
● Investigate the impact of different
writing styles on loan default.
● Application of the findings to other
loan types and platforms.
● Develop more accurate and
efficient text-mining and machine
learning tools for analyzing loan
applications.
35. KEY TAKEAWAYS
● Text mining and machine learning tools can be
employed to predict psychographics, including
the likelihood of future loan defaults.
36. KEY TAKEAWAYS
● The LIWC dictionaries associated with
extroversion and deception are significantly
correlated with default.
37. KEY TAKEAWAYS
● There may be variables that are affected by
both the observable text and unobservable
personality traits.