Mykola Herasymovych: Optimizing Acceptance Threshold in Credit Scoring using Reinforcement Learning
1. Optimizing Acceptance Threshold
in Credit Scoring
using Reinforcement Learning
Student: Mykola Herasymovych
Supervisors: Oliver Lukason (PhD)
Karl Mรคrka (MSc)
2. Credit Scoring Problem
(Crook et al., 2007; Lessmann et al. 2015; Thomas et al., 2017)
โข Predict the probability of a loan application being bad:
Pr ๐ต๐๐ ๐โ๐๐๐๐๐ก๐๐๐๐ ๐ก๐๐๐ ๐} = ๐ ๐ฆ = 1|๐ = ๐ฆ
โข Transform it into a credit score reflecting applicationโs
creditworthiness level:
๐ ๐ถ๐
๐ = ๐ ๐ถ๐
(๐ฆ, ๐ง),
๐ ๐ถ๐
- credit score, ๐ฆ - estimated probability, ๐ง โ other factors (e.g.
policy rules)
2
3. Credit Business Process
(Creditstar Group)
Loan
application
Estimate
credit score
50%
Credit score is high:
give loan
50%
Credit score is low:
reject application
Client doesnโt repay:
money loss
Client repays:
money gain
Profits change
credit score
3
4. Acceptance Threshold Optimization
Optimizing Acceptance Threshold
in Credit Scoring
using Reinforcement Learning
Acceptance Threshold
(Viaene and Dedene, 2005; Verbraken et al., 2014; Skarestad, 2017)(Banasik et al., 2003; Wu and Hand, 2007; Dey, 2010)(Sousa et al., 2013; Bellotti and Crook, 2013; Nikolaidis, 2017)
Selection BiasPopulation Drift
4
5. Credit Scoring Literature 1
(Number of published articles with โcredit scoringโ keyword)
0
50
100
150
200
250
300
Articles by year
General trend
Note: adapted from Louzada et al. (2016) and updated by the author based on literature review.
5
6. Credit Scoring Literature 2
(Percentage of papers published on the topic in 1992-2015)
Note: adapted from Louzada et al. (2016) and updated by the author based on literature review.
0% 10% 20% 30% 40% 50% 60%
New method to propose rating
Comparison in traditional techinques
Conceptual discussion
Variable selection
Literature review
Performance measures
Other issues
Acceptance threshold optimization
0% 10% 20% 30% 40% 50% 60%
New method to propose rating
Comparison in traditional techinques
Conceptual discussion
Variable selection
Literature review
Performance measures
Other issues
Acceptance threshold optimization
0% 10% 20% 30% 40% 50% 60%
New method to propose rating
Comparison in traditional techinques
Conceptual discussion
Variable selection
Literature review
Performance measures
Other issues
Acceptance threshold optimization
0% 10% 20% 30% 40% 50% 60%
New method to propose rating
Comparison in traditional techinques
Conceptual discussion
Variable selection
Literature review
Performance measures
Other issues
Acceptance threshold optimization
6
7. Shortcomings of
Traditional Approach
โข Is static and backward looking;
โข Ignores credit scoring modelโs performance
uncertainty (Thomas et al., 2017);
โข Ignores selection bias (Hand, 2006; Dey, 2010);
โข Ignores population drift (Sousa et al., 2013; Nikolaidis, 2017);
โข Oversimplifies lenderโs utility function (Finlay, 2010;
Skarestad, 2017).
7
8. Solution
A Reinforcement Learning (RL) agent:
โข a dynamic forward-looking system
โข that adapts to the live data feedback
โข and adjusts acceptance threshold
โข to maximize accurately specified lenderโs
utility function.
Reinforcement Learning
8
9. RL Achievements
โข Forex, stocks and securities trading (Neuneier, 1996);
โข Resource allocation (Tesauro et al., 2006);
โข Tax and debt collection optimization (Abe et al., 2010);
โข Dynamic pricing (Kim et al., 2016);
โข Behavioral marketing (Sato, 2016);
โข Bank portfolio optimization (Strydom, 2017).
โข Has not been applied to the credit scoring yet,
to the best of our knowledge.
9
11. RL parameters:
๐ผ โ learning rate;
๐พ โ discount rate;
๐ก ๐๐๐ค๐๐_๐ก
โ inverse
scaling parameter
of the learning
rate;
Credit Business
Environment
RL Agent
Q-Value Function Update Rule:
๐ค ๐ โ ๐ค ๐ + ๐ผ ๐ก ๐ ๐ก + ๐พ๐๐๐ฅ ๐ ๐ ๐๐ก+1, ๐ โ ๐ ๐๐ก, ๐ด ๐ก
๐๐(๐ ,๐)
๐๐ค ๐
,
๐ผ ๐ก =
๐ผ0
๐ก ๐๐๐ค๐๐_๐ก,
๐๐(๐ ,๐)
๐๐ค ๐
= ๐๐ก
Profit
Reward
๐ ๐, ๐ด = ๐๐๐๐๐๐ก๐ ๐
๐ ๐
๐ ๐๐๐ฅ
๐= ๐ ๐
๐ก
๐=0
Acceptance
Rate
State
๐(๐ด) =
๐ ๐
๐ถ๐
โฅ ๐ก ๐ด๐ ๐ ๐กโ1
๐ ๐ก
๐=1
๐ ๐ก
Acceptance
Threshold
Action
๐ด ๐ = ๐(๐(๐ค, ๐(๐)))
Reward
๐ ๐ (๐, ๐ด) = ๐๐๐๐๐๐ก๐ ๐
๐ ๐๐ก
๐=0
Q-values
๐ ๐ค, ๐ = ๐ค๐
Q-values
๐ ๐ ๐ค ๐, ๐ = ๐ค ๐ ๐
Action
๐ด ๐ = ๐ ๐ด ๐)
RBF Features
๐ ๐ =
2
๐
cos ๐ค ๐ ๐ต๐น ๐(๐ด) + ๐ ๐ ๐ต๐น ,
๐ค ๐ ๐ต๐น ~ ๐ 0, 2๐พ ๐ ๐ต๐น , ๐ ๐ ๐ต๐น ~ ๐(0, 2๐)
Prediction
Learning
Q-Value Function
๐ ๐ ๐, ๐ด = ๐ผ ๐ ๐พ ๐ ๐ ๐ก+๐
โ
๐=0
๐๐ก = ๐ , ๐ด ๐ก = ๐
Policy
๐ ๐ด ๐) = โ ๐ด ๐ก = ๐ ๐๐ก = ๐ ]
CS variables:
๐ โ number of
loan applications;
๐ ๐ถ๐
โ credit score;
๐ก ๐ด๐
โ acceptance
threshold.
RBF parameters:
๐ค ๐ ๐ต๐น
โ RBF weights;
๐ ๐ ๐ต๐น
โ RBF offset values;
๐พ ๐ ๐ต๐น
- variance parameter
of a normal distribution;
๐ - numbers of RBF
components.
Policy parameters:
๐ โ the temperature parameter
of the Boltzmann distribution.
Exploitative
๐ ๐บ๐๐๐๐๐ฆ
(๐) = argmax
๐
๐ ๐, ๐ด
Explorative
๐ ๐ต๐๐๐ก๐ง๐๐๐๐
๐ด ๐ =
๐
๐(๐,๐ด)
๐
๐
๐(๐,๐ดโฒ)
๐
๐ดโฒโ๐
Note: the process repeats at a weekly frequency: t โ week number.
Note: The State
object
summarizes
characteristics
of the loan
portfolio.
Note: The Action object is mapped to one out of
20 discrete values of acceptance threshold.
Note: The policy is
explorative during
training episodes
and exploitative
during test ones.
Note: Higher ๐ lead to a more greedy policy,
while lower ๐ โ to a more random one.
Note: The Q-value function is
approximated with Stochastic
Gradient Descent (SGD)
models.
Note: The RL is less
responsive during
training and more
responsive during
test episodes.
11
12. Credit Business
Environment
RL Agent
(the dog)
Profit
Reward
Acceptance
Rate
State
Acceptance
Threshold
Action
Acceptance
Rate
Next
State
Week
Reinforcement Learning (RL)
(Sutton and Barto, 2017)
12 Loss
Reward
3 Higher
Profit
Reward
1010010004
12
13. Learned Value Function shape
(after 6000 simulated weeks of training)
Notes: state denotes the application acceptance rate during the previous week, action denotes the acceptance
threshold for the following week, value is the prediction of the Value Function model for a particular state-action pair,
optimum shows the state-action pair that corresponds to the highest value in the state-action space.
13
15. Test Simulation Results 1
(shift in score distribution)
Notes: figures show 100 simulation runs and their average. In each scenario the distribution of total profit differences is
significantly higher than zero according to the one-tailed t-test. Profit is measured in thousands of euros.
15
16. Notes: figures show 100 simulation runs and their average. In each scenario the distribution of total profit differences is
significantly higher than zero according to the one-tailed t-test. Profit is measured in thousands of euros.
Test Simulation Results 2
(shift in default rates)
16
17. Performance on the Real Data 1
(acceptance threshold policy)
Notes: figure shows the difference between action variables and the baseline action. Baseline denotes the acceptance
threshold optimized using traditional approach, RL chosen denotes the one used by the RL agent, Value Function-optimal
denotes the one optimal according to the Value Function model.
17
18. Performance on the Real Data 2
(profits received)
Note: figure shows the difference between reward variables and the baseline reward. Baseline denotes the profits received with
the acceptance threshold optimized using traditional approach, RL received weekly and total denote profits received by the RL
agent. Profit is measured in thousands of euros.
18
19. Implications
โข The work improves the traditional acceptance
threshold optimization approach in credit scoring of
Verbraken et al. (2014) and Skarestad (2017);
โข Solves the problem of optimization in a dynamic
partially observed credit business environment
outlined in Thomas et al. (2017) and Nikolaidis (2017);
โข Provides more evidence on superiority of RL-based
systems compared to traditional methodology in line
with Strydom (2017) and Sutton and Barto (2017);
โข Produces practical benefit to Creditstar Group as a
decision support system.
19
20. Conclusions
โข The credit scoring literature usually omits the problem of
acceptance threshold optimization, despite its
significant impact on the credit business efficiency;
โข The traditional approach fails to optimize the
acceptance threshold due to issues like population drift
and selection bias;
โข The developed RL algorithm manages to correct for the
flawed knowledge and successfully adapt to the real
environment, significantly outperforming the traditional
approach;
โข Being a proof of concept, our work describes a large
room for further research and improvement of the
acceptance threshold optimization issue.
Q&A
20
23. Traditional Approach 1
(Viaene and Dedene, 2005; Hand, 2009; Lessmann et al., 2015)
โข Construct the misclassification costs function:
๐๐ถ ๐ก ๐ด๐
; ๐๐ต
, ๐๐บ
= ๐๐ต
๐ ๐ต
๐๐
(1 โ ๐น๐ต ๐ก ๐ด๐
) + ๐๐บ
๐ ๐บ
๐๐
๐น๐บ(๐ก ๐ด๐
)
โข Minimize using FOC w.r.t. acceptance threshold:
๐๐ต(๐ ๐ด๐)
๐๐บ(๐ ๐ด๐)
=
๐ ๐บ
๐๐
๐ ๐ต
๐๐
๐๐บ
๐๐ต
๐ก ๐ด๐
โ acceptance threshold, ๐ ๐ด๐
โ optimal acceptance threshold,
๐๐ต
and ๐๐บ
โ average cost per misclassified bad (Type I error)and good
(Type II error)application respectively,
๐ ๐บ
๐๐
and ๐ ๐ต
๐๐
โ prior probabilities of being a good and a bad application
respectively and
๐๐บ ๐ก ๐ด๐
and ๐๐ต ๐ก ๐ด๐
โ probability density of the scores at cut-off point ๐ก ๐ด๐
for good and bad applications respectively.
23
24. Traditional Approach 2
(Viaene and Dedene, 2005; Hand, 2009; Lessmann et al., 2015)
24
Note: Based on Crook et al. (2007), Hand (2009) and Verbraken et al. (2014). ๐ ๐ถ๐(๐) โ applicationโs credit score estimated based on the
application data ๐; ๐ ๐บ (๐ ๐ถ๐ ) and ๐ ๐ต (๐ ๐ถ๐) โ credit scoreโs probability density functions of actual good and bad applications respectively; ๐ก ๐ด๐
โ acceptance threshold for the credit score; ๐น ๐ต (๐ก ๐ด๐ ) โ correctly classified bad applications; 1 โ ๐น ๐บ (๐ก ๐ด๐ ) โ correctly classified good
applications; 1 โ ๐น ๐ต (๐ก ๐ด๐ ) โ bad applications misclassified as good ones; ๐น ๐บ (๐ก ๐ด๐ ) โ good applications misclassified as bad ones; blue line is
the estimated potential profit (in thousands of euros for illustration purposes); grey dotted lines show alternative acceptance thresholds ๐ก ๐
๐ด๐
and corresponding levels of potential profit; vertical red dotted line is the estimated optimal acceptance threshold ๐ ๐ด๐ , while horizontal red
dotted lines show the corresponding potential profit and shares of correctly classified and misclassified good and bad applications.
25. RL Benefits
โข solves optimization problems with little or no prior
information about the environment (Kim et al., 2016);
โข learns directly from the real-time data without any
simplifying assumptions (Rana and Oliveira, 2015);
โข dynamically adjusts the policy over the learning period
adapting to environmental changes (Abe et al., 2010);
โข avoids suffering potential costly poor performance by
training in a simulated environment or learning off-policy
(Aihe and Gonzalez, 2015);
โข satisfies contradictive performance goals (Varela et al.,
2016);
โข was found effective in portfolio optimization problems
(mainly stock and forex trading) (Strydom, 2017);
25
27. Value Function
Action value function (also called Q-value function)
describes an expected discounted reward of taking
action a in a state s and following a policy ฯ
thereafter:
๐ ๐ ๐ , ๐ = ๐ผ ๐ ๐ ๐ก + ๐พ๐ ๐ก+1 + ๐พ2 ๐ ๐ก+2 + โฏ ๐๐ก = ๐ , ๐ด ๐ก = ๐],
where ๐พ is a discount rate.
Usually, the value function is approximated by a
model. In our case, we use Gaussian Radial Basis
Functions approximator and a set of Stochastic
Gradient Descent models.
27
29. Gaussian Radial Basis Functions (RBF) transformation:
๐ฅ =
2
๐
cos ๐ค ๐ ๐ต๐น
๐ + ๐ ๐ ๐ต๐น
, ๐ค ๐ ๐ต๐น
~ ๐ 0, 2๐พ ๐ ๐ต๐น , ๐ ๐ ๐ต๐น
~ ๐(0, 2๐),
where ๐ฅ is the resulting transformed feature vector, ๐ is the input state variable,
๐ is the number of Monte Carlo samples per original feature, ๐ค ๐ ๐ต๐น
is a
๐-element vector of randomly generated RBF weights, ๐ ๐ ๐ต๐น
is a ๐-element
vector of randomly generated RBF offset values and ๐พ ๐ ๐ต๐น is the variance
parameter of a normal distribution.
Stochastic Gradient Descent (SGD) model for each action:
๐ ๐ค ๐, ๐ = ๐ค ๐ ๐ ๐ต๐น(๐ ) = ๐ค ๐ ๐ฅ,
where ๐ค ๐ is a vector of regression weights for action ๐, ๐ is the state variable,
๐ ๐ต๐น is the RBF transformation function, ๐ฅ is the resulting vector of features
and ๐ is the value of action ๐ in state ๐ corresponding to the feature vector ๐ฅ.
Choose action according to the current policy:
๐ = ๐ ๐บ๐๐๐๐๐ฆ
(๐ ) = argmax
๐
๐ ๐ , ๐
๐ = ๐ ๐ต๐๐๐ก๐ง๐๐๐๐
๐ ๐ = โ ๐ด ๐ก = ๐ ๐๐ก = ๐ ] =
๐
๐(๐ ,๐)
๐
๐
๐(๐ ,๐โฒ)
๐
๐โฒโ๐
,
where ๐ is the set of all actions, ๐โฒ is any action except action ๐ and ๐ is the
temperature parameter of the Boltzmann distribution.
Forward Propagation (Prediction)
29
30. Backward Propagation (Learning)
The approximation error is:
๐ ๐ก + ๐พ๐๐๐ฅ ๐ ๐ ๐๐ก+1, ๐ โ ๐ ๐๐ก, ๐ด ๐ก
To adjust the SGD model weights in the direction of the steepest error descent we
use the following update rule:
๐ค ๐ โ ๐ค ๐ + ๐ผ[๐ ๐ก + ๐พ๐๐๐ฅ ๐ ๐ ๐๐ก+1, ๐ โ ๐ ๐๐ก, ๐ด ๐ก ]
๐๐(๐ ,๐)
๐๐ค ๐
,
which under assumption that ๐พ๐๐๐ฅ ๐ ๐ ๐๐ก+1, ๐ does not depend on ๐ค ๐ simplifies
to the general SGD update rule:
๐ค ๐ โ ๐ค ๐ + ๐ผ ๐ ๐ก + ๐พ๐๐๐ฅ ๐ ๐ ๐๐ก+1, ๐ โ ๐ ๐๐ก, ๐ด ๐ก ๐๐ก,
where ๐ ๐๐ก, ๐ด ๐ก can be thought of as current model prediction,
๐ ๐ก + ๐พ๐๐๐ฅ ๐ ๐ ๐๐ก+1, ๐ โ the target and ๐๐ก โ the gradient of the weights.
30
32. Note: State denotes the application acceptance rate during the previous iteration, action denotes the acceptance threshold for the following iteration,
value is the prediction of the Value Function model for a particular state-action pair, optimum shows the state-action pair that corresponds to the
highest value in the state-action space.
Value Function Model Convergence
1st episode Whole run
32
33. Result of the t-test for
Various Distortion Scenarios
33
Scenario t-statistic p-value
Scenario 1: downwards shift in score distribution 29.56631 1.55E-51
Scenario 2: upwards shift in score distribution 42.72066 2.45E-66
Scenario 3: downwards shift in default rates 5.172688 5.95E-07
Scenario 4: upwards shift in default rates 4.600158 6.20E-06
Note: the t-test null hypothesis is that the mean difference between the episode reward received by the RL agent and the episode
reward received using the traditional approach throughout 100 episodes is equal to or lower than zero.
34. Credit Scoring Literature
โข Thomas, L. C., D. B. Edelmann, and J. N. Crook. "Credit Scoring and
Application." SIAM, Philadelphia (2017);
โข Crook, Jonathan N., David B. Edelman, and Lyn C. Thomas. "Recent
developments in consumer credit risk assessment." European Journal of
Operational Research 183.3 (2007): 1447-1465;
โข Hand, David J. "Measuring classifier performance: a coherent alternative
to the area under the ROC curve." Machine learning 77.1 (2009): 103-123;
โข Verbraken, Thomas, et al. "Development and application of consumer
credit scoring models using profit-based classificatio measures." European
Journal of Operational Research 238.2 (2014): 505-513.
โข Viaene, Stijn, and Guido Dedene. "Cost-sensitive learning and decision
making revisited." European journal of operational research 166.1 (2005):
212-220.
โข Lessmann, Stefan, et al. "Benchmarking state-of-the-art classification
algorithms for credit scoring: An update of research." European Journal of
Operational Research 247.1 (2015): 124-136.
โข Oliver, R. M., and L. C. Thomas. "Optimal score cutoffs and pricing in
regulatory capital in retail credit portfolios." (2009).
โข Bellotti, Tony, and Jonathan Crook. "Forecasting and stress testing credit
card default using dynamic models." International Journal of Forecasting
29.4 (2013): 563-574.
34
35. RL Literatureโข Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press,
2017;
โข Mnih, Volodymyr, et al. "Human-level control through deep reinforcement
learning." Nature 518.7540 (2015): 529;
โข Neuneier, Ralph. "Optimal asset allocation using adaptive dynamic programming." Advances in
Neural Information Processing Systems. 1996;
โข Tesauro, Gerald, et al. "A hybrid reinforcement learning approach to autonomic resource
allocation." Autonomic Computing, 2006. ICAC'06. IEEE International Conference on. IEEE, 2006;
โข Abe, Naoki, et al. "Optimizing debt collections using constrained reinforcement
learning." Proceedings of the 16th ACM SIGKDD international conference on Knowledge
discovery and data mining. ACM, 2010;
โข Kim, Byung-Gook, et al. "Dynamic pricing and energy consumption scheduling with
reinforcement learning." IEEE Transactions on Smart Grid 7.5 (2016): 2187-2198;
โข Sato, Masamichi. "Quantitative Realization of Behavioral Economic Heuristics by Cognitive
Category: Consumer Behavior Marketing with Reinforcement Learning." (2016);
โข Strydom, Petrus. "Funding optimization for a bank integrating credit and liquidity risk." Journal of
Applied Finance and Banking 7.2 (2017): 1;
โข Aihe, David O., and Avelino J. Gonzalez. "Correcting flawed expert knowledge through
reinforcement learning." Expert Systems with Applications 42.17-18 (2015): 6457-6471;
โข Rana, Rupal, and Fernando S. Oliveira. "Dynamic pricing policies for interdependent perishable
products or services using reinforcement learning." Expert Systems with Applications 42.1 (2015):
426-436;
โข Varela, Martรญn, Omar Viera, and Franco Robledo. "A q-learning approach for investment
decisions." Trends in Mathematical Economics. Springer, Cham, 2016. 347-368.
35