Presentation at special event "To Explain or To Predict?" at Tel Aviv University, July 9, 2012. Event co-organized by the Israel Statistical Association and Tel Aviv University's Department of Statistics and OR.
6. Statistical modeling in
social science research
Purpose: test causal theory (“explain”)
Association-based statistical models
Prediction nearly absent
7. Explanatory modeling à-la social sciences
Start with a causal
theory
Generate causal
hypotheses on
constructs
Operationalize constructs → Measurable variables
Fit statistical model
Statistical inference → Causal conclusions
8. In the social sciences,
data analysis is mainly used for testing
causal theory.
“If it explains, it predicts”
9. “Empirical prediction alone
is un-scientific”
Some statisticians share this view:
The two goals in analyzing data... I prefer to describe
as “management” and “science”. Management seeks
profit... Science seeks truth.
- Parzen, Statistical Science 2001
11. Why Predict? for Scientific Research
new theory
develop measures
compare theories
improve theory
assess relevance
predictability
Shmueli & Koppius, “Predictive Analytics in IS Research”
(MISQ, 2011)
12. “A good explanatory model will also
predict well”
“You must understand the underlying
causes in order to predict”
13. Philosophy of Science
“Explanation and prediction have the
same logical structure”
Hempel & Oppenheim, 1948
“It becomes pertinent to investigate the
possibilities of predictive procedures
autonomous of those used for explanation”
Helmer & Rescher, 1959
“Theories of social and human behavior
address themselves to two distinct goals of
science: (1) prediction and (2) understanding”
Dubin, Theory Building, 1969
20. Four aspects Y=F(X)
Y=f(X)
1. Theory - Data
2. Causation – Association
3. Retrospective – Prospective
4. Bias - Variance
21. Predict ≠ Explain
“we tried to benefit from an extensive
set of attributes describing each of the
movies in the dataset. Those attributes
certainly carry a significant signal and
+
can explain some of the user behavior.
However… they could not help at all
?
for improving the [predictive]
accuracy.”
Bell et al., 2008
22. Predict ≠ Explain
The FDA considers two products
bioequivalent if the 90% CI of the
relative mean of the generic to brand
formulation is within 80%-125%
“We are planning to… develop predictive models for bioavailability
and bioequivalence”
Lester M. Crawford, 2005
Acting Commissioner of Food & Drugs
23. Goal Design & Data EDA
Definition Collection Preparation
Variables? Model Use &
Methods? Evaluation, V Reporting
alidation &
Model
Selection
24. Study design
& data collection
Observational or experiment?
Primary or secondary data?
Instrument (reliability+validity vs. measur accuracy)
How much data?
How to sample?
Hierarchical data
27. Which Variables?
endogeneity
ex-post
availability
causation associations
Multicollinearity? A, B, A*B?
28. Methods / Models
Blackbox / interpretable
Mapping to theory
variance bias
Shrinkage models
ensembles
29. Model fit ≠
Validation
Explanatory power
Theoretical Empirical
Data
model model
Evaluation, Validation
& Model Selection
Empirical Training data Over-fitting
model Holdout data analysis
Predictive power
30. Model Use
test causal theory Inference
Null hypothesis
new theory
Develop measures
compare theories Predictive performance
improve theory Naïve/baseline
assess relevance Over-fitting analysis
predictability
34. The predictive power of an
explanatory model has important
scientific value
Relevance, reality check, predictability
35. In “explanatory” fields
Prediction underappreciated
Distinction blurred
Unfamiliar with predictive
modeling/assessment
“While the value of scientific prediction… is beyond
question… the inexact sciences *do not+ have…the
use of predictive expertise well in hand.”
Helmer & Rescher, 1959