Influence of the Event Rate on Discrimination Abilities of Bankruptcy Prediction Models

Lili Zhang1, Jennifer Priestley2, and Xuelei Ni3
1Program in Analytics and Data Science, Kennesaw State University, Georgia, USA
2Analytics and Data Science Institute, Kennesaw State University, Georgia, USA
3Department of Statistics, Kennesaw State University, Georgia, USA
Influence of the Event Rate on Discrimination Abilities
of Bankruptcy Prediction Models

22
Outline
• Introduction
• Related Work
• Data
• Exploratory Analysis
• Methodology
• Discussions and Conclusions
• Future Work

333
Introduction
• Bankruptcy prediction has been studied
since 1960s to improve decision making.
• To improve its discrimination ability,
efforts have been done in respects of
significant predictors, algorithms, and
sampling.
3

444
Related Work
4
• Altman: financial ratios (Working
Capital/Total Assets, etc.) + multiple
discrimination analysis.
• Odom: first neural network.
• Shin: support vector machine
outperformed neural network on small
training datasets.
• Zibanezhad: decision tree + most
important predictors.

555
Related Work
5
• Sun: Bayesian network + influence of
variable selection and discretization.
• Kim: geometric mean based boosting
algorithm to address the imbalance.
• Zhou: effects of sampling methods for
five models.
• etc.

666
Related Work
6
Logistic Regression:
𝑝 =
1
1 + 𝑒−(𝛽0+𝛽1 𝑥1+⋯+𝛽 𝑛 𝑥 𝑛)
max 𝑦𝑙𝑜𝑔 𝑝 + 1 − 𝑦 log 1 − 𝑝

777
Related Work
7
Decision Tree:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑥 = −
𝑖=1
𝑘
𝑝 𝑖 𝑥 𝑙𝑜𝑔2 𝑝 𝑖 𝑥
𝐺𝑖𝑛𝑖 𝑥 = 1 −
𝑖=1
𝑘
𝑝 𝑖 𝑥 2

888
Related Work
8
Random Forest: train trees individually on
each bootstrap dataset.
Gradient Boosting: utilize the information
of previously trained trees.
Support Vector Machine: maximize the
marginal distance.

999
Related Work
9
Neural Network: transform by activation
functions between layers + minimum loss.
Bayesian Network: represent probability
relationship and conditional dependencies
via a directed acyclic graph.

101010
Data
Information of 11,787,287 U.S. companies in
the 4th Quarter of 2012 and 2013:
• Market Participant Identifier
• Public Records: judgments and liens .
• Firmographics: industry, location, size,
and structure.
• Bankruptcy Indicator.
10

111111
Data
Question:
Given the public records and firmographics
indicators of an organization, can we predict its
operation status one year ahead?
11

121212
Data
• Create dependent variable.
• Drop interval variables if the percentage
of coded values or missing values is
greater than 30%.
12
BrtInd 2012 BrtInd 2013 BrtIndChg
0 1 1
0 0 0
Table 1. Creation of the Dependent Variable BrtIndChg

131313
Data
• Drop observations in an interval variable
or a categorical variable if the percentage
of the abnormal/incorrect values in that
variable is less than 5%.
• Discretize continuous variables.
• Retain the variable with the best
predictive ability among several
correlated variables
13

141414
Data
14
Variable Type Description
MPID Nominal Market Participant Identifier
BrtIndChg Binary Bankruptcy Indicator Change
curLiensJudInd Nominal Current Liens/Judgment Indicator
histLiensJudInd Nominal Historical Liens/Judgment Indicator
Industry Nominal Industry
LargeBusinessInd Nominal Large Business Indicator
Region Nominal Geographical Region
PublicCompanyFlag Nominal Public Company Flag
SubsidiaryInd Nominal Subsidiary Indicator
MonLstRptDatePlcRec Interval
Number of Months Since Last
Report Date on Public Records
Table 2. Variables for Analysis and Modelling.

151515
Data
15
BrtIndChg Frequency Percent (%)
1 1031 0.12
0 843330 99.88
Table 3. Frequency of Dependent Variable.

161616
Exploratory Analysis
16
Effect Odds Ratio 95% Confidence Interval Chi-Square p-value
curLiensJudInd 0 vs 1 0.529 [0.447, 0.627] <.0001
histLiensJudInd 0 vs 1 0.680 [0.601, 0.768] <.0001
LargeBusinessInd N vs Y 0.542 [0.474, 0.620]
<.0001
LargeBusinessInd U vs Y 0.202 [0.165, 0.249]
PublicCompanyFlag N vs Y 0.295 [0.104, 0.838]
0.065
PublicCompanyFlag U vs Y 0.370 [0.138, 0.989]
SubsidiaryInd N vs Y 1.745 [0.997, 3.053]
<.0001
SubsidiaryInd U vs Y 0.411 [0.261, 0.648]
Industry 1 vs 8 1.538 [0.947, 2.496]
<.0001
Industry 2 vs 8 3.085 [1.118, 8.514]
Industry 3 vs 8 2.079 [1.545, 2.797]
Industry 4 vs 8 1.971 [1.365, 2.847]
Industry 5 vs 8 1.648 [1.136, 2.392]
Industry 6 vs 8 2.421 [1.704, 3.439]
Industry 7 vs 8 1.386 [1.033, 1.859]
Industry 9 vs 8 1.348 [1.012, 1.795]
Industry 10 vs 8 0.885 [0.216, 3.629]
Industry U vs 8 0.473 [0.343, 0.651]
Region 1 vs 9 0.699 [0.479, 1.019]
<.0001
Region 2 vs 9 0.443 [0.358, 0.549]
Region 3 vs 9 0.627 [0.505, 0.779]
Region 4 vs 9 0.913 [0.686, 1.215]
Region 5 vs 9 0.636 [0.525, 0.772]
Region 6 vs 9 1.203 [0.928, 1.558]
Region 7 vs 9 1.084 [0.875, 1.343]
Region 8 vs 9 1.194 [0.920, 1.549]
MonLstRptDatePlcRec 0.971 [0.969, 0.973] <.0001
Table 4. Univariate Odds Ratio and Chi-Square p-value.

171717
Methodology
Sampling:
• Oversample event rate from 0.12% to
10%, 20%, 30%, 40%, and 50%,
respectively, with the proportion of non-
events undersampled from 99.88% to
90%, 80%, 70%, 60%, and 50%
correspondingly.
• Train and Validation split: 70% vs. 30%.
17

181818
Methodology
Model Development and Evaluation:
• Platform: SAS Enterprise Miner 14.1.
• Tune hyperparameters of every model to
their best performance.
• Logistic Regression: backwards selection
+ significance level 0.05.
• Decision Tree: entropy.
• Random Forest and Gradient Descent:
Gini index.
18

191919
Methodology
Model Development and Evaluation:
• Support Vector Machine: linear kernel.
• Neural Network: tanh in hidden layer +
sigmoid in output layer + 3 hidden units
• Bayesian Network: G-square +
significance level 0.2.
19

202020
Methodology
20
Event (%)
Model
0.12 10 20 30 40 50
Decision Tree
0
(.)
0.49
(0.06)
0.424
(0.12)
0.435
(0.21)
0.475
(0.41)
0.497
(0.43)
Gradient Boosting
0
(.)
0.495
(0.10)
0.486
(0.25)
0.471
(0.35)
0.473
(0.45)
0.532
(0.58)
Bayesian Network
0.43
(0.12)
0.501
(0.22)
0.496
(0.33)
0.473
(0.4)
0.471
(0.44)
0.503
(0.6)
Random Forest
0.027
(0.02)
0.536
(0.1)
0.488
(0.25)
0.46
(0.25)
0.469
(0.43)
0.516
(0.49)
Neural Network
0.048
(0.01)
0.523
(0.1)
0.503
(0.2)
0.488
(0.26)
0.477
(0.51)
0.519
(0.46)
Support Vector Machine
0
(.)
0
(.)
0.02
(0.01)
0.475
(0.43)
0.439
(0.52)
0.516
(0.54)
Logistic Regression
0.037
(0.01)
0.526
(0.08)
0.502
(0.25)
0.474
(0.25)
0.44
(0.44)
0.526
(0.56)
Table 5. K-S Statistic (K-S Probability Cut-off) under Different Event Rates.

212121
Methodology
21
Model Accuracy F1 Score Type I Error Type II Error Cut-off
Decision Tree 99.88% . 0% 100% .
Gradient Boosting 99.88% . 0% 100% .
Bayesian Network 64.43% 0.0056 35.59% 17.31% 0.11
Random Forest 86.83% 0.0022 13.08% 87.95% 0.01
Neural Network 99.23% 0.0221 0.65% 92.94% 0.01
Support Vector
Machine
99.88% . 0.00% 100% .
Logistic Regression 99.41% 0.0204 0.47% 95.01% 0.01
Table 6. Performance of Models under Event Rate 0.12%.

222222
Methodology
22
Model Accuracy F1 Score Type I Error Type II Error Cut-off
Decision Tree 72.26% 0.7507 38.97% 16.50% 0.28
Gradient Boosting 73.44% 0.7623 38.28% 14.84% 0.42
Bayesian Network 70.53% 0.7413 43.41% 15.53% 0.37
Random Forest 73.93% 0.7656 37.31% 14.84% 0.42
Neural Network 72.75% 0.7579 39.81% 14.70% 0.37
Support Vector
Machine
73.23% 0.7605 38.56% 14.98% 0.49
Logistic Regression 72.61% 0.7575 40.36% 14.42% 0.42
Table 7. Performance of Models under Oversampled Event Rate 50%.

232323
Methodology
23
Figure 2. ROC Curve on Validation
Dataset under Event Rate 0.12%
Figure 3. ROC Curve on Validation Dataset
under Oversampled Event Rate 50%

242424
Methodology
24
Variable Decision Tree
Gradient
Boosting
Random
Forest
MonLstRptDatePlcRec 1.0000 1.0000 0.0911
Region 0.2423 0.2880 0.0048
Industry 0.1663 0.3516 0.0110
curLiensJudInd 0.1550 0.0820 0.0024
histLiensJudInd 0.1192 0.1205 0.0038
LargeBusinessInd 0.0308 0.2752 0.0100
Table 8. Variable Importance.

252525
Methodology
25
Effect Odds Ratio Chi-Square p-value
curLiensJudInd 0 vs 1 0.573 0.0046
histLiensJudInd 0 vs 1 0.508 <.0001
LargeBusinessInd N vs Y 0.796
<.0001
LargeBusinessInd U vs Y 0.332
Region 1 vs 9 1.067
0.0002
Region 2 vs 9 0.411
Region 3 vs 9 0.583
Region 4 vs 9 0.839
Region 5 vs 9 0.558
Region 6 vs 9 0.858
Region 7 vs 9 0.881
Region 8 vs 9 1.261
MonLstRptDatePlcRec 0.976 <.0001
Table 9. Multivariate Odds Ratio and Chi-Square p-value.

262626
Methodology
26
Probability Cut-off Tuning
Figure 4. Performance Measures vs. Probability Cut-offs of Logistic Regression

272727
Methodology
27
Overfitting Checking
Figure 5. ROC Curve on Training Dataset under Oversampled Event Rate 50%

282828
Discussions and Conclusions
28
• The statistical association and
significance of public records and
firmographics predictors are shown.
• The event rate influences the
performance of different classification
models in different ways.

292929
29
• The statistical association and
significance of public records and
firmographics predictors are shown.
• The event rate influences the
performance of different classification
models in different ways. Bayesian
Network is the most insensitive to the
event rate, while Support Vector Machine
is the most sensitive.

303030
30
• Practitioners may examine the
performance measures comprehensively
with different probability cut-offs and
handle the trade-offs among them as well
as the model interpretability based on
their expectations.

313131
Future Work
31
• Try different sampling techniques like
SMOTE to adjust the proportions of
events and non-events and examine its
influence further.
• Include financial ratio predictors to
improve the model performance as well
as testing the model performance in a
wider time span.

Influence of the Event Rate on Discrimination Abilities of Bankruptcy Prediction Models

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Influence of the Event Rate on Discrimination Abilities of Bankruptcy Prediction Models

Ähnlich wie Influence of the Event Rate on Discrimination Abilities of Bankruptcy Prediction Models (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Influence of the Event Rate on Discrimination Abilities of Bankruptcy Prediction Models

Hinweis der Redaktion