Text Mining - Advanced Customer Analytics

TEXT
MINING
Team 4
Syed Aqib Ali
Syeda Ramsha Habib Gilani
Lateefah Omoyosola Yusuf
Rochelle Star Velasquez

TABLE OF CONTENT
1. What is Text Mining?
2. Introduction
3. Main Models Used
4. Key Contributions
5. Marketing and Non-marketing Applications
6. Limitations
7. Avenues for future research
8. Key Takeaways

WHAT IS TEXT MINING?
Text mining is a process of deriving/extracting high
quality meaningful information and patterns.
Text analysis involves information retrieval, analysis
to study word frequency distributions, pattern
recognition, information extraction, data mining
techniques including link and association analysis,
visualization, and predictive analytics.

INTRODUCTION
● A research study applying Text Mining and
Machine Learning tools.
● The authors find that loan applicants' choice
of words reveals insights into their intentions,
circumstances, and personality.
● This information is powerful in predicting
loan repayment, going beyond typical
financial and demographic factors.

Setting and Data
1. Potential borrowers submit their request for a loan for a specific
amount with a specific maximum interest rate (they are willing to pay).
2. The loan amount they wish to borrow must in (between $1,000 and
$25,000 in the data).
3. Prosper verifies all financial information, including the potential
borrower’s credit score.

Textual, Financial, and Demographic Variables
1. Textual variables:
a. The number of characters in the title and the text box.
b. The percentage of words with six or more letters.
c. SMOG: This measures writing quality by mapping it to number of years of formal
education needed to easily understand the text in first reading.
d. Count of spelling mistakes.
e. Bigrams : Two-word combinations (help to understand the context and the pattern).
2. Financial variable:
a. Loan amount, borrower’s credit grade, Debt to income ratio.
3. Demographic variables:
a. Gender, age, location, race.

PROCESS OF
TEXT MINING
The authors used something called "Term
frequency-inverse document frequency" or tf-
idf to compare how often a word is used in a
loan request to how often it's used in all the
loan requests and how long the request is.
Process 04
Process 01
tm package in r was used to select
distinct words in each loan application.
Process 02
- Porter’s stemming algorithm to collapse
variations of words into one e.g., “borrower,”
“borrowed,” “borrowing,” and “borrowers”
become “borrow” (3.5M words → 30,920 unique
words and 1052 bigrams.
PyEnchant 1.6.6 package in Python was
used to count spelling mistakes in the
loan applications. This allows them to
identify words that are misspelled and
potentially serve as a proxy for
characteristics correlated with lower
income.
Process 03
4

MODEL 1 - Predictive model
Aim:
To evaluate whether the text used by borrowers in their loan application predicts
their loan default.
Machine Learning Methods:
Ensemble stacking approach
1. Train each model on the calibration data (2 logistics regression and 3 tree-
based methods).
2. Build a weighting model to combine the models calibrated in the first model.

Result
Source: Netzer, O., Lemaire, A., & Herzenstein, M. (2019). When words sweat: Identifying signals for loan default in the text of loan applications. Journal of Marketing Research,
56(6), 960-980.

Result
Source: Netzer, O., Lemaire, A., & Herzenstein, M. (2019). When words sweat: Identifying signals for loan default in the text of loan applications. Journal of
Marketing Research, 56(6), 960-980.

MODEL 2 - Words and writing styles of default loan request
Aim:
Learn which words, writing styles, and general ideas conveyed by the text are more
likely to be associated with default loan request.
1)Machine learning tools
Naive Bayes
L1 regularization binary logistic model
Word Count Dictionary (LIWC)
2) Standard Econometrics tools
Topic’s Logistic regression extracted from
a latent Dirichlet allocation (LDA) analysis
and the sub-dictionaries of the Linguistic
Inquiry.

MODEL 3 - Potential Borrower’s Personality
Aim:
Further exploration of potential traits and states of borrowers.
Applying LIWC library.
Results:
Defaulting loan requests are written in a manner consistent
with the writing styles of extroverts and liars.

Analyzing applications
Borrower 1: “I am a hard working person, married for 25 years, and have
two wonderful boys. Please let me explain why I need help. I would use
the $2,000 loan to fix our roof. Thank you, God bless you, and I promise to
pay you back.”
Borrower 2: “While the past year in our new place has been more than
great, the roof is now leaking and I need to borrow $2,000 to cover the
cost of the repair. I pay all bills (e.g., car loans, cable, utilities) on time.”
Which borrower is more likely to default?

KEY CONTRIBUTIONS
Textual information
on the loan
significantly helps
predict loan default.

KEY CONTRIBUTIONS
Words indicative of
loan repayment.

KEY CONTRIBUTIONS
Loan default requests mimic the
writing styles of extroverts and liars.

KEY CONTRIBUTIONS
Evidence of people with different
educational backgrounds and
economic situations use words
differently.

KEY CONTRIBUTIONS
Evidence of supplementing
traditional measures and replacing
some aspects of it.

KEY CONTRIBUTIONS
Help lenders avoid defaulting borrowers
and help borrowers better express
themselves when requesting a loan.

MARKETING AND
NON-MARKETING
APPLICATIONS

MARKETING APPLICATIONS
• Sentiment analysis
• Brand monitoring
• Customer feedback analysis
• Churn prediction
• Predictive analysis
• Market research
• Personalized marketing
• Social media analytics

NON-MARKETING APPLICATIONS
• Psychological profiling
• Fraud detection
• Credit risk assessment
• Customer service

LIMITATIONS
1. Text data may not be available for all loan
applications, as some borrowers may not
provide any text or may provide incomplete
or inaccurate information.
2. Text data may be subject to
interpretation and bias, as different lenders
may interpret the same text differently
based on their own biases and assumptions.
3. The use of text data to predict loan
default raises ethical and legal concerns

FURTHER RESEARCH
● The predictive ability of text analysis
regarding future behavior extended
to other behaviors and industries.
● Extension of results to other types of
communication, e.g., phone calls
and online chats.
● How word usage can change
overtime.

FURTHER RESEARCH
● Exploring the role of emotions and
mental states in financial behaviors.
● Investigate the impact of different
writing styles on loan default.
● Application of the findings to other
loan types and platforms.
● Develop more accurate and
efficient text-mining and machine
learning tools for analyzing loan
applications.

KEY TAKEAWAYS
● Text mining and machine learning tools can be
employed to predict psychographics, including
the likelihood of future loan defaults.

KEY TAKEAWAYS
● The LIWC dictionaries associated with
extroversion and deception are significantly
correlated with default.

KEY TAKEAWAYS
● There may be variables that are affected by
both the observable text and unobservable
personality traits.

Text Mining - Advanced Customer Analytics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Text Mining - Advanced Customer Analytics

Ähnlich wie Text Mining - Advanced Customer Analytics (20)

Mehr von Aqib Syed

Mehr von Aqib Syed (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Text Mining - Advanced Customer Analytics