Mykola Herasymovych: Optimizing Acceptance Threshold in Credit Scoring using Reinforcement Learning

Optimizing Acceptance Threshold
in Credit Scoring
using Reinforcement Learning
Student: Mykola Herasymovych
Supervisors: Oliver Lukason (PhD)
Karl Märka (MSc)

Credit Scoring Problem
(Crook et al., 2007; Lessmann et al. 2015; Thomas et al., 2017)
• Predict the probability of a loan application being bad:
Pr 𝐵𝑎𝑑 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑖𝑠𝑡𝑖𝑐𝑠 𝒙} = 𝑝 𝑦 = 1|𝒙 = 𝑦
• Transform it into a credit score reflecting application’s
creditworthiness level:
𝑠 𝐶𝑆
𝒙 = 𝑠 𝐶𝑆
(𝑦, 𝑧),
𝑠 𝐶𝑆
- credit score, 𝑦 - estimated probability, 𝑧 – other factors (e.g.
policy rules)
2

Credit Business Process
(Creditstar Group)
Loan
application
Estimate
credit score
50%
Credit score is high:
give loan
50%
Credit score is low:
reject application
Client doesn’t repay:
money loss
Client repays:
money gain
Profits change
credit score
3

Acceptance Threshold Optimization
in Credit Scoring
Acceptance Threshold
(Viaene and Dedene, 2005; Verbraken et al., 2014; Skarestad, 2017)(Banasik et al., 2003; Wu and Hand, 2007; Dey, 2010)(Sousa et al., 2013; Bellotti and Crook, 2013; Nikolaidis, 2017)
Selection BiasPopulation Drift
4

Credit Scoring Literature 1
(Number of published articles with “credit scoring” keyword)
0
50
100
150
200
250
300
Articles by year
General trend
Note: adapted from Louzada et al. (2016) and updated by the author based on literature review.
5

Credit Scoring Literature 2
(Percentage of papers published on the topic in 1992-2015)
Note: adapted from Louzada et al. (2016) and updated by the author based on literature review.
0% 10% 20% 30% 40% 50% 60%
New method to propose rating
Comparison in traditional techinques
Conceptual discussion
Variable selection
Literature review
Performance measures
Other issues
Acceptance threshold optimization
0% 10% 20% 30% 40% 50% 60%
Variable selection
Literature review
Other issues
0% 10% 20% 30% 40% 50% 60%
Variable selection
Literature review
Other issues
0% 10% 20% 30% 40% 50% 60%
Variable selection
Literature review
Other issues
6

Shortcomings of
Traditional Approach
• Is static and backward looking;
• Ignores credit scoring model’s performance
uncertainty (Thomas et al., 2017);
• Ignores selection bias (Hand, 2006; Dey, 2010);
• Ignores population drift (Sousa et al., 2013; Nikolaidis, 2017);
• Oversimplifies lender’s utility function (Finlay, 2010;
Skarestad, 2017).
7

Solution
A Reinforcement Learning (RL) agent:
• a dynamic forward-looking system
• that adapts to the live data feedback
• and adjusts acceptance threshold
• to maximize accurately specified lender’s
utility function.
Reinforcement Learning
8

RL Achievements
• Forex, stocks and securities trading (Neuneier, 1996);
• Resource allocation (Tesauro et al., 2006);
• Tax and debt collection optimization (Abe et al., 2010);
• Dynamic pricing (Kim et al., 2016);
• Behavioral marketing (Sato, 2016);
• Bank portfolio optimization (Strydom, 2017).
• Has not been applied to the credit scoring yet,
to the best of our knowledge.
9

Where We Fit
Portfolio
Optimization
Credit
Scoring
Artificial
Intelligence
10

RL parameters:
𝛼 – learning rate;
𝛾 – discount rate;
𝑡 𝑝𝑜𝑤𝑒𝑟_𝑡
– inverse
scaling parameter
of the learning
rate;
Credit Business
Environment
RL Agent
Q-Value Function Update Rule:
𝑤 𝑎 ← 𝑤 𝑎 + 𝛼 𝑡 𝑅𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑆𝑡+1, 𝑎 − 𝑄 𝑆𝑡, 𝐴 𝑡
𝜕𝑄(𝑠,𝑎)
𝜕𝑤 𝑎
,
𝛼 𝑡 =
𝛼0
𝑡 𝑝𝑜𝑤𝑒𝑟_𝑡,
𝜕𝑄(𝑠,𝑎)
𝜕𝑤 𝑎
= 𝑆𝑡
Profit
Reward
𝑅 𝑆, 𝐴 = 𝑃𝑟𝑜𝑓𝑖𝑡𝑠𝑗
𝑎 𝑗
𝑎 𝑚𝑎𝑥
𝑎= 𝑎 𝑗
𝑡
𝑗=0
Acceptance
Rate
State
𝑆(𝐴) =
𝑠𝑖
𝐶𝑆
≥ 𝑡 𝐴𝑇 𝑎 𝑡−1
𝑛 𝑡
𝑖=1
𝑛 𝑡
Acceptance
Threshold
Action
𝐴 𝑆 = 𝜋(𝑄(𝑤, 𝑋(𝑆)))
Reward
𝑅 𝑎 (𝑆, 𝐴) = 𝑃𝑟𝑜𝑓𝑖𝑡𝑠𝑗
𝑎 𝑗𝑡
𝑗=0
Q-values
𝑄 𝑤, 𝑋 = 𝑤𝑋
Q-values
𝑄 𝑎 𝑤 𝑎, 𝑋 = 𝑤 𝑎 𝑋
Action
𝐴 𝑄 = 𝜋 𝐴 𝑆)
RBF Features
𝑋 𝑆 =
2
𝑘
cos 𝑤 𝑅𝐵𝐹 𝑆(𝐴) + 𝑐 𝑅𝐵𝐹 ,
𝑤 𝑅𝐵𝐹 ~ 𝑁 0, 2𝛾 𝑅𝐵𝐹 , 𝑐 𝑅𝐵𝐹 ~ 𝑈(0, 2𝜋)
Prediction
Learning
Q-Value Function
𝑄 𝜋 𝑆, 𝐴 = 𝔼 𝜋 𝛾 𝑖 𝑅 𝑡+𝑖
∞
𝑖=0
𝑆𝑡 = 𝑠, 𝐴 𝑡 = 𝑎
Policy
𝜋 𝐴 𝑆) = ℙ 𝐴 𝑡 = 𝑎 𝑆𝑡 = 𝑠]
CS variables:
𝑛 – number of
loan applications;
𝑠 𝐶𝑆
– credit score;
𝑡 𝐴𝑇
– acceptance
threshold.
RBF parameters:
𝑤 𝑅𝐵𝐹
– RBF weights;
𝑐 𝑅𝐵𝐹
– RBF offset values;
𝛾 𝑅𝐵𝐹
- variance parameter
of a normal distribution;
𝑘 - numbers of RBF
components.
Policy parameters:
𝜏 – the temperature parameter
of the Boltzmann distribution.
Exploitative
𝜋 𝐺𝑟𝑒𝑒𝑑𝑦
(𝑄) = argmax
𝑎
𝑄 𝑆, 𝐴
Explorative
𝜋 𝐵𝑜𝑙𝑡𝑧𝑚𝑎𝑛𝑛
𝐴 𝑆 =
𝑒
𝑄(𝑆,𝐴)
𝜏
𝑒
𝑄(𝑆,𝐴′)
𝜏
𝐴′∈𝒜
Note: the process repeats at a weekly frequency: t – week number.
Note: The State
object
summarizes
characteristics
of the loan
portfolio.
Note: The Action object is mapped to one out of
20 discrete values of acceptance threshold.
Note: The policy is
explorative during
training episodes
and exploitative
during test ones.
Note: Higher 𝜏 lead to a more greedy policy,
while lower 𝜏 – to a more random one.
Note: The Q-value function is
approximated with Stochastic
Gradient Descent (SGD)
models.
Note: The RL is less
responsive during
training and more
responsive during
test episodes.
11

Credit Business
Environment
RL Agent
(the dog)
Profit
Reward
Acceptance
Rate
State
Acceptance
Threshold
Action
Acceptance
Rate
Next
State
Week
Reinforcement Learning (RL)
(Sutton and Barto, 2017)
12 Loss
Reward
3 Higher
Profit
Reward
1010010004
12

Learned Value Function shape
(after 6000 simulated weeks of training)
Notes: state denotes the application acceptance rate during the previous week, action denotes the acceptance
threshold for the following week, value is the prediction of the Value Function model for a particular state-action pair,
optimum shows the state-action pair that corresponds to the highest value in the state-action space.
13

Traditional Approach
(Baseline)
Notes: Baseline approach follows methodology of Verbraken et al. (2014) and Skarestad (2017).
14

Test Simulation Results 1
(shift in score distribution)
Notes: figures show 100 simulation runs and their average. In each scenario the distribution of total profit differences is
significantly higher than zero according to the one-tailed t-test. Profit is measured in thousands of euros.
15

Notes: figures show 100 simulation runs and their average. In each scenario the distribution of total profit differences is
significantly higher than zero according to the one-tailed t-test. Profit is measured in thousands of euros.
Test Simulation Results 2
(shift in default rates)
16

Performance on the Real Data 1
(acceptance threshold policy)
Notes: figure shows the difference between action variables and the baseline action. Baseline denotes the acceptance
threshold optimized using traditional approach, RL chosen denotes the one used by the RL agent, Value Function-optimal
denotes the one optimal according to the Value Function model.
17

Performance on the Real Data 2
(profits received)
Note: figure shows the difference between reward variables and the baseline reward. Baseline denotes the profits received with
the acceptance threshold optimized using traditional approach, RL received weekly and total denote profits received by the RL
agent. Profit is measured in thousands of euros.
18

Implications
• The work improves the traditional acceptance
threshold optimization approach in credit scoring of
Verbraken et al. (2014) and Skarestad (2017);
• Solves the problem of optimization in a dynamic
partially observed credit business environment
outlined in Thomas et al. (2017) and Nikolaidis (2017);
• Provides more evidence on superiority of RL-based
systems compared to traditional methodology in line
with Strydom (2017) and Sutton and Barto (2017);
• Produces practical benefit to Creditstar Group as a
decision support system.
19

Conclusions
• The credit scoring literature usually omits the problem of
acceptance threshold optimization, despite its
significant impact on the credit business efficiency;
• The traditional approach fails to optimize the
acceptance threshold due to issues like population drift
and selection bias;
• The developed RL algorithm manages to correct for the
flawed knowledge and successfully adapt to the real
environment, significantly outperforming the traditional
approach;
• Being a proof of concept, our work describes a large
room for further research and improvement of the
acceptance threshold optimization issue.
Q&A
20

Acceptance Threshold Optimization
in Credit Scoring
Acceptance Threshold
22

Traditional Approach 1
(Viaene and Dedene, 2005; Hand, 2009; Lessmann et al., 2015)
• Construct the misclassification costs function:
𝑀𝐶 𝑡 𝐴𝑇
; 𝑐𝐵
, 𝑐𝐺
= 𝑐𝐵
𝜋 𝐵
𝑃𝑃
(1 − 𝐹𝐵 𝑡 𝐴𝑇
) + 𝑐𝐺
𝜋 𝐺
𝑃𝑃
𝐹𝐺(𝑡 𝐴𝑇
)
• Minimize using FOC w.r.t. acceptance threshold:
𝑓𝐵(𝑇 𝐴𝑇)
𝑓𝐺(𝑇 𝐴𝑇)
=
𝜋 𝐺
𝑃𝑃
𝜋 𝐵
𝑃𝑃
𝑐𝐺
𝑐𝐵
𝑡 𝐴𝑇
– acceptance threshold, 𝑇 𝐴𝑇
– optimal acceptance threshold,
𝑐𝐵
and 𝑐𝐺
– average cost per misclassified bad (Type I error)and good
(Type II error)application respectively,
𝜋 𝐺
𝑃𝑃
and 𝜋 𝐵
𝑃𝑃
– prior probabilities of being a good and a bad application
respectively and
𝑓𝐺 𝑡 𝐴𝑇
and 𝑓𝐵 𝑡 𝐴𝑇
– probability density of the scores at cut-off point 𝑡 𝐴𝑇
for good and bad applications respectively.
23

Traditional Approach 2
(Viaene and Dedene, 2005; Hand, 2009; Lessmann et al., 2015)
24
Note: Based on Crook et al. (2007), Hand (2009) and Verbraken et al. (2014). 𝑠 𝐶𝑆(𝒙) – application’s credit score estimated based on the
application data 𝒙; 𝑓 𝐺 (𝑠 𝐶𝑆 ) and 𝑓 𝐵 (𝑠 𝐶𝑆) – credit score’s probability density functions of actual good and bad applications respectively; 𝑡 𝐴𝑇
– acceptance threshold for the credit score; 𝐹 𝐵 (𝑡 𝐴𝑇 ) – correctly classified bad applications; 1 − 𝐹 𝐺 (𝑡 𝐴𝑇 ) – correctly classified good
applications; 1 − 𝐹 𝐵 (𝑡 𝐴𝑇 ) – bad applications misclassified as good ones; 𝐹 𝐺 (𝑡 𝐴𝑇 ) – good applications misclassified as bad ones; blue line is
the estimated potential profit (in thousands of euros for illustration purposes); grey dotted lines show alternative acceptance thresholds 𝑡 𝑖
𝐴𝑇
and corresponding levels of potential profit; vertical red dotted line is the estimated optimal acceptance threshold 𝑇 𝐴𝑇 , while horizontal red
dotted lines show the corresponding potential profit and shares of correctly classified and misclassified good and bad applications.

RL Benefits
• solves optimization problems with little or no prior
information about the environment (Kim et al., 2016);
• learns directly from the real-time data without any
simplifying assumptions (Rana and Oliveira, 2015);
• dynamically adjusts the policy over the learning period
adapting to environmental changes (Abe et al., 2010);
• avoids suffering potential costly poor performance by
training in a simulated environment or learning off-policy
(Aihe and Gonzalez, 2015);
• satisfies contradictive performance goals (Varela et al.,
2016);
• was found effective in portfolio optimization problems
(mainly stock and forex trading) (Strydom, 2017);
25

Parameters:
𝛼 – learning rate;
𝛾 – discount rate.
Credit Business
Environment
RL Agent
Value Update Target:
𝑄 𝑆, 𝐴 + 𝛼[𝑅 + 𝛾max 𝑎 𝑄 𝑆′
, 𝑎 − 𝑄 𝑆, 𝐴 ]
Profit
Reward
𝑅(𝑆, 𝐴)
Acceptance
Rate
State
𝑆(𝐴)
Acceptance
Threshold
Action
𝐴(𝑆)
Value Function
Policy
Reward
𝑅(𝑆, 𝐴)
Q-values
𝑄(𝑆)
Q-values
𝑄(𝑆, 𝐴)
Action
𝐴(𝑄)
State
𝑆(𝐴)
Prediction
Learning
26

Value Function
Action value function (also called Q-value function)
describes an expected discounted reward of taking
action a in a state s and following a policy π
thereafter:
𝑄 𝜋 𝑠, 𝑎 = 𝔼 𝜋 𝑅𝑡 + 𝛾𝑅𝑡+1 + 𝛾2 𝑅𝑡+2 + ⋯ 𝑆𝑡 = 𝑠, 𝐴 𝑡 = 𝑎],
where 𝛾 is a discount rate.
Usually, the value function is approximated by a
model. In our case, we use Gaussian Radial Basis
Functions approximator and a set of Stochastic
Gradient Descent models.
27

Value Function
ActionState
20
action
values
2000
transformed
features
RBFs
transformation
SGD
weights
Policy
28

Gaussian Radial Basis Functions (RBF) transformation:
𝑥 =
2
𝑘
cos 𝑤 𝑅𝐵𝐹
𝑠 + 𝑐 𝑅𝐵𝐹
, 𝑤 𝑅𝐵𝐹
~ 𝑁 0, 2𝛾 𝑅𝐵𝐹 , 𝑐 𝑅𝐵𝐹
~ 𝑈(0, 2𝜋),
where 𝑥 is the resulting transformed feature vector, 𝑠 is the input state variable,
𝑘 is the number of Monte Carlo samples per original feature, 𝑤 𝑅𝐵𝐹
is a
𝑘-element vector of randomly generated RBF weights, 𝑐 𝑅𝐵𝐹
is a 𝑘-element
vector of randomly generated RBF offset values and 𝛾 𝑅𝐵𝐹 is the variance
parameter of a normal distribution.
Stochastic Gradient Descent (SGD) model for each action:
𝑄 𝑤 𝑎, 𝑠 = 𝑤 𝑎 𝑅𝐵𝐹(𝑠) = 𝑤 𝑎 𝑥,
where 𝑤 𝑎 is a vector of regression weights for action 𝑎, 𝑠 is the state variable,
𝑅𝐵𝐹 is the RBF transformation function, 𝑥 is the resulting vector of features
and 𝑄 is the value of action 𝑎 in state 𝑠 corresponding to the feature vector 𝑥.
Choose action according to the current policy:
𝑎 = 𝜋 𝐺𝑟𝑒𝑒𝑑𝑦
(𝑠) = argmax
𝑎
𝑄 𝑠, 𝑎
𝑎 = 𝜋 𝐵𝑜𝑙𝑡𝑧𝑚𝑎𝑛𝑛
𝑎 𝑠 = ℙ 𝐴 𝑡 = 𝑎 𝑆𝑡 = 𝑠] =
𝑒
𝑄(𝑠,𝑎)
𝜏
𝑒
𝑄(𝑠,𝑎′)
𝜏
𝑎′∈𝒜
,
where 𝒜 is the set of all actions, 𝑎′ is any action except action 𝑎 and 𝜏 is the
temperature parameter of the Boltzmann distribution.
Forward Propagation (Prediction)
29

Backward Propagation (Learning)
The approximation error is:
𝑅𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑆𝑡+1, 𝑎 − 𝑄 𝑆𝑡, 𝐴 𝑡
To adjust the SGD model weights in the direction of the steepest error descent we
use the following update rule:
𝑤 𝑎 ← 𝑤 𝑎 + 𝛼[𝑅𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑆𝑡+1, 𝑎 − 𝑄 𝑆𝑡, 𝐴 𝑡 ]
𝜕𝑄(𝑠,𝑎)
𝜕𝑤 𝑎
,
which under assumption that 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑆𝑡+1, 𝑎 does not depend on 𝑤 𝑎 simplifies
to the general SGD update rule:
𝑤 𝑎 ← 𝑤 𝑎 + 𝛼 𝑅𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑆𝑡+1, 𝑎 − 𝑄 𝑆𝑡, 𝐴 𝑡 𝑆𝑡,
where 𝑄 𝑆𝑡, 𝐴 𝑡 can be thought of as current model prediction,
𝑅𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑆𝑡+1, 𝑎 – the target and 𝑆𝑡 – the gradient of the weights.
30

Learning Episode
-52 0 60 82
Warming-up
Phase
Interaction
Phase
Delayed
Learning
Phase
Learning and
state-action
generation
starts
Simulation
starts
State-action
generation
ends
Learning and
simulation
ends
31

Note: State denotes the application acceptance rate during the previous iteration, action denotes the acceptance threshold for the following iteration,
value is the prediction of the Value Function model for a particular state-action pair, optimum shows the state-action pair that corresponds to the
highest value in the state-action space.
Value Function Model Convergence
1st episode Whole run
32

Result of the t-test for
Various Distortion Scenarios
33
Scenario t-statistic p-value
Scenario 1: downwards shift in score distribution 29.56631 1.55E-51
Scenario 2: upwards shift in score distribution 42.72066 2.45E-66
Scenario 3: downwards shift in default rates 5.172688 5.95E-07
Scenario 4: upwards shift in default rates 4.600158 6.20E-06
Note: the t-test null hypothesis is that the mean difference between the episode reward received by the RL agent and the episode
reward received using the traditional approach throughout 100 episodes is equal to or lower than zero.

Credit Scoring Literature
• Thomas, L. C., D. B. Edelmann, and J. N. Crook. "Credit Scoring and
Application." SIAM, Philadelphia (2017);
• Crook, Jonathan N., David B. Edelman, and Lyn C. Thomas. "Recent
developments in consumer credit risk assessment." European Journal of
Operational Research 183.3 (2007): 1447-1465;
• Hand, David J. "Measuring classifier performance: a coherent alternative
to the area under the ROC curve." Machine learning 77.1 (2009): 103-123;
• Verbraken, Thomas, et al. "Development and application of consumer
credit scoring models using profit-based classificatio measures." European
Journal of Operational Research 238.2 (2014): 505-513.
• Viaene, Stijn, and Guido Dedene. "Cost-sensitive learning and decision
making revisited." European journal of operational research 166.1 (2005):
212-220.
• Lessmann, Stefan, et al. "Benchmarking state-of-the-art classification
algorithms for credit scoring: An update of research." European Journal of
Operational Research 247.1 (2015): 124-136.
• Oliver, R. M., and L. C. Thomas. "Optimal score cutoffs and pricing in
regulatory capital in retail credit portfolios." (2009).
• Bellotti, Tony, and Jonathan Crook. "Forecasting and stress testing credit
card default using dynamic models." International Journal of Forecasting
29.4 (2013): 563-574.
34

RL Literature• Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press,
2017;
• Mnih, Volodymyr, et al. "Human-level control through deep reinforcement
learning." Nature 518.7540 (2015): 529;
• Neuneier, Ralph. "Optimal asset allocation using adaptive dynamic programming." Advances in
Neural Information Processing Systems. 1996;
• Tesauro, Gerald, et al. "A hybrid reinforcement learning approach to autonomic resource
allocation." Autonomic Computing, 2006. ICAC'06. IEEE International Conference on. IEEE, 2006;
• Abe, Naoki, et al. "Optimizing debt collections using constrained reinforcement
learning." Proceedings of the 16th ACM SIGKDD international conference on Knowledge
discovery and data mining. ACM, 2010;
• Kim, Byung-Gook, et al. "Dynamic pricing and energy consumption scheduling with
reinforcement learning." IEEE Transactions on Smart Grid 7.5 (2016): 2187-2198;
• Sato, Masamichi. "Quantitative Realization of Behavioral Economic Heuristics by Cognitive
Category: Consumer Behavior Marketing with Reinforcement Learning." (2016);
• Strydom, Petrus. "Funding optimization for a bank integrating credit and liquidity risk." Journal of
Applied Finance and Banking 7.2 (2017): 1;
• Aihe, David O., and Avelino J. Gonzalez. "Correcting flawed expert knowledge through
reinforcement learning." Expert Systems with Applications 42.17-18 (2015): 6457-6471;
• Rana, Rupal, and Fernando S. Oliveira. "Dynamic pricing policies for interdependent perishable
products or services using reinforcement learning." Expert Systems with Applications 42.1 (2015):
426-436;
• Varela, Martín, Omar Viera, and Franco Robledo. "A q-learning approach for investment
decisions." Trends in Mathematical Economics. Springer, Cham, 2016. 347-368.
35

Mykola Herasymovych: Optimizing Acceptance Threshold in Credit Scoring using Reinforcement Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Mykola Herasymovych: Optimizing Acceptance Threshold in Credit Scoring using Reinforcement Learning

Ähnlich wie Mykola Herasymovych: Optimizing Acceptance Threshold in Credit Scoring using Reinforcement Learning (20)

Mehr von Eesti Pank

Mehr von Eesti Pank (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mykola Herasymovych: Optimizing Acceptance Threshold in Credit Scoring using Reinforcement Learning