Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Linear Probability Models and Big Data: Prediction, Inference and Selection Bias
1. Linear Probability Models and Big Data:
Prediction, Inference and Selection Bias
Suneel Chatla
Galit Shmueli
Institute of Service Science
National Tsing Hua University
Taiwan
2. Outline
Introduction to binary outcome models
Motivation : Rare use of LPM
Study goals
o Estimation and inference
o Classification
o Selection bias
Simulation study
eBay data – in paper
Conclusions
2
5. The purpose of binary-outcome regression models?
Inference
and
estimation
Selection
Bias
Prediction
(Classificat
ion)
5
6. Summary of IS literature (MISQ,JAIS,ISR and MS: 2000~2016)
• Inference and
estimation60
• Selection bias31
• Classification and
prediction5
Only 8 used LPM
3 are from this year alone
6
”Implementing a campaign fixed effects model with
Multinomial logit is challenging due to incidental
parameter problem so we opt to employ LPM …” –
Burtch et al. (2016)
”The LPM is simple for both estimation and inference.
LPM is fast and it allows for a reasonable accurate
approximation of true preferences.” – Schlereth &
Skiera (2016)
8. Criticisms
Non normal error
Non constant
error variance
Unbounded
predictions
Functional form
Logit
✔
✔
✔
✔✖
Probit
✔
✔
✔
✔✖
LPM
✖
✖
✖
✖
Comparison of three models in terms their
theoretical properties
8
12. Latent Framework
𝒀 𝑛×1 = 𝑋 𝑛×(𝑝+1) 𝛽(𝑝+1)×1 + 𝜀 𝑛×1
𝑍 =
1, 𝑖𝑓 𝒀 > 0
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Latent continuous
(not observable)
12
Inference
and
estimation
𝑙𝑜𝑔𝑖𝑠(0,1) • Logit
model
𝑁(0,1) • Probit
model
𝑈(0,1)
• Linear
probability
model
13. The MLE’s of both logit and probit are consistent.
𝛽
𝑝
𝛽
LPM estimates are proportionally and directionally consistent
(Billinger, 2012) .
𝛽𝑙𝑝𝑚
𝑝
𝑘𝛽
n
𝑘𝛽
𝛽
𝛽
𝛽𝑙𝑝𝑚
13
Inference
and
estimation
14. Marginal effects for interpreting effect size
For LPM
ME for 𝑥𝑖𝑘 =
𝜕𝐸[𝑧 𝑖]
𝜕𝑥 𝑘
= 𝛽 𝑘
For logit model
ME for 𝑥𝑖𝑘 =
𝜕𝐸[𝑧 𝑖]
𝜕𝑥 𝑘
=
𝑒 𝑥 𝑖 𝛽
(1+𝑒 𝑥 𝑖 𝛽)2
𝛽 𝑘
For probit model
ME for 𝑥𝑖𝑘 =
𝜕𝐸[𝑧 𝑖]
𝜕𝑥 𝑘
= ∅(𝑥𝑖 𝛽) 𝛽 𝑘
14
Easy
Interpretation
No direct
Interpretation
Inference
and
estimation
15. Simulation study
• Sample sizes {50,500,50000}
• Error distribution {Logistic, Normal, Uniform}
• 100 Bootstrap samples
15
Inference
and
estimation
23. Quasi-experiments
Like randomized experimental designs that test causal hypotheses but lack
random assignment
Treatment Assignment
● Assigned by experimenter
● Self selection
23
Selection
Bias
25. Selection Bias
Outcome model coefficients (bootstrap)
Both Heckman
and Olsen’s
methods
perform similar
to the MLE
25
Selection
Bias
26. Bottom line
Inference and
Estimation
• Use LPM with
large sample;
otherwise
logit/probit is
preferable
• With small-
sample LPM
use robust
standard errors
Classification
• Use LPM if goal
is classification
or ranking
• Trim predicted
probabilities
• If probabilities
are needed,
then logit/probit
is preferable
Selection Bias
• Use LPM if the
sample is large
• If both selection
and outcome
models have
the same
predictors, LPM
suffers from
multicollinearity
26
27. Thank you!
Suneel Chatla, Galit Shmueli, (2016), An Extensive Examination of
Linear Regression Models with a Binary Outcome Variable, Journal of
the Association for Information Systems (Accepted).
27
Hinweis der Redaktion
b
Here is the outline of my presentation. First, I’m going to provide a brief introduction to the primary binary response models including LPM and talk about the motivation for our study. Then I’ll move on to examine the usage of LPM under different study goals namely estimation and inference, classification and selection bias. Finally, I’d like to discuss about the simulation study and the results. Then I ‘ll conclude with the guidelines about when the usage of LPM is appropriate and when it is not. I will be very happy to answer questions any time during the presentation
It actually tells two things. 1. LPM definitely is not very popular 2. People are still using because probably it has some advantages over the other competetive models
Change Y(nx1) to beta1,… betak notation
Do we really need k? need it if we want to retrieve the original coefficients?