Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatment Effect Heterogeneity: model parameterization, prior Choice, and posterior summarization - Jared Murray, December 9, 2019
We describe different approaches for specifying models and prior distributions for estimating heterogeneous treatment effects using Bayesian nonparametric models. We make an affirmative case for direct, informative (or partially informative) prior distributions on heterogeneous treatment effects, especially when treatment effect size and treatment effect variation is small relative to other sources of variability. We also consider how to provide scientifically meaningful summaries of complicated, high-dimensional posterior distributions over heterogeneous treatment effects with appropriate measures of uncertainty.
Ähnlich wie Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatment Effect Heterogeneity: model parameterization, prior Choice, and posterior summarization - Jared Murray, December 9, 2019
Ähnlich wie Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatment Effect Heterogeneity: model parameterization, prior Choice, and posterior summarization - Jared Murray, December 9, 2019 (20)
Web & Social Media Analytics Previous Year Question Paper.pdf
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatment Effect Heterogeneity: model parameterization, prior Choice, and posterior summarization - Jared Murray, December 9, 2019
1. Bayesian Nonparametric Models for Treatment
Effect Heterogeneity: Parameterization, Prior
Choice, and Posterior Summarization
Jared S. Murray – The University of Texas at Austin.
Joint work with Carlos Carvalho, P. Richard Hahn, David Yeager, et al.
December 2019
3. National Study of Learning Mindsets
• National Study of Learning Mindsets (Yeager et. al., 2019):
Randomized controlled trial of a low-cost mindset intervention
• Probability sample, 65 schools in this analysis (> 11, 000 9th
grade students)
• Specifically designed to assess treatment effect heterogeneity
by factors including baseline levels of growth mindset beliefs
and overall school quality/achievement
2
4. National Study of Learning Mindsets
Desiderata for our analysis:
• Avoid model selection/specification search
• School-level effect heterogeneity explained and unexplained by
covariates
• Individual-level effect heterogeneity explained by covariates
• Interpretable model summaries for communicating results
• Uncertainty!
We can get all these by modifying Bayesian causal forests (Hahn,
Murray, and Carvalho (2017))
3
5. Our (generic) assumptions
Strong ignorability:
Yi(0), Yi(1) ⊥⊥ Zi | Xi = xi,
Positivity:
0 < Pr(Zi = 1 | Xi = xi) < 1
for all i. Then
P(Y(z) | x) = P(Y | Z = z, x)
,
and the conditional average treatment effect (CATE) is
τ(xi) : = E(Yi(1) − Yi(0) | xi)
= E(Yi | xi, Zi = 1) − E(Yi | xi, Zi = 0).
4
6. Parameterizing Nonparametric Models of Causal Effects
Let’s forget covariates for a second and pretend we have a simple
RCT.
A simple model:
(Yi | Zi = 0)
iid
∼ N(µ0, σ2
)
(Yi | Zi = 1)
iid
∼ N(µ1, σ2
)
where the estimand of interest is τ ≡ µ1 − µ0.
If µj ∼ N(ϕj, δj) independently then τ ∼ N(ϕ1 − ϕ0, δ0 + δ1)
Often we have stronger prior information about τ than µ1 or µ0 – in
particular, we expect it to be small.
5
7. Parameterizing Nonparametric Models of Causal Effects
A more natural parameterization:
(Yi | Zi = 0)
iid
∼ N(µ, σ2
)
(Yi | Zi = 1)
iid
∼ N(µ + τ, σ2
)
where the estimand of interest is still τ.
Now we can express prior beliefs on τ directly.
6
8. Parameterizing Nonparametric Models of Causal Effects
How does this relate to models for heterogeneous treatment effects?
Consider (mostly) separate models for treatment arms:
yi = fzi
(xi) + ϵi ϵi ∼ P
Independent priors on f0, f1 → prior on τ(x) ≡ f1(x) − f0(x) has larger
variance than prior on f0 or f1
No direct prior control – simple forms for f0, f1 can compose to
complex τ.
Every variable in x is necessarily a potential effect modifier.
7
9. Parameterizing Nonparametric Models of Causal Effects
What about the “just another covariate” parameterization?
yi = f(xi, zi) + ϵi ϵi ∼ P
Then the heterogeneous treatment effects given by
τ(x) ≡ f(x, 1) − f(x, 0)
and every variable in x is still a potential effect modifier, and we
have no direct prior control for most priors on f...
8
10. Parameterizing Nonparametric Models of Causal Effects
Set f(xi, zi) = µ(xi) + τ(wi)zi, where w is a subset of x:
yi = µ(xi) + τ(wi)zi + ϵi, ϵi ∼ P
The heterogeneous treatment effects are given by τ(w) directly, and
can be easily regularized.
In Hahn et. al. (2017) we use independent Bayesian additive
regression tree priors on µ and τ (“Bayesian causal forests”)
9
11. Building Functions with Regression Trees
Represent each function µ and τ as the sum of many small
regression trees:
∑
h
g(x, Th, Mh)Building Regression Functions with Trees
10
12. Tweaking priors for causal effects
Several adjustments to the BART prior on τ:
• Higher probability on smaller τ trees (than BART defaults)
• Higher probability on “stumps”, trees that never split (all stumps
= homogeneous effects)
• N+
(0, v) Hyperprior on the scale of leaf parameters in τ
11
13. What about observational data?
What changes with (measured) confounding?
Not much, but we should include an estimated propensity score as a
covariate:
yi = µ(xi, ˆπ(xi)) + τ(wi)zi + ϵi, ϵi ∼ P
Mitigates regularization induced confounding (RIC); see Hahn et. al.
(2017) for details.
Lots of independent empirical evidence this is important (cf Dorie et
al (2019), Wendling et al (2019)).
12
15. Analysis of National Study of Learning Mindsets
• A new analaysis of effects on math GPA in the overall population
of students
• Interesting moderators are baseline level of mindset norms,
school achievement, and minority composition
• Many, many other controls
The usual approach is the multilevel linear model...
14
16. Multilevel BCF
Replaces the linear pieces of a multilevel model with functions that
have BART priors:
yij = αj + µ(xij) +
[
ϕj + τ(wij)
]
zij + ϵij, ϵij ∼ N(0, σ2
)
αj, ϕj have standard random effect priors (normal with half-t
hyperpriors on scale)
w = school achievement, baseline mindset norms, minority
composition
15
17. We fit this model, now what?
How do we present our results? Average treatment effect (ATE),
school ATE, plots of (school) ATE by xj....
Two questions from our collaborators:
• For which groups was the treatment most/least effective?
• What is the impact of posited treatment effect modifiers,
holding other modifiers constant?
16
18. We fit this model, now what?
How do we present our results? Average treatment effect (ATE),
school ATE, plots of (school) ATE by xj....
Two questions from our collaborators:
• For which groups was the treatment most/least effective?
• What is the impact of posited treatment effect modifiers,
holding other modifiers constant?
16
19. Subgroup finding as a decision problem
The action γ is choosing subgroups, here represented by a recursive
partition (tree)
• Minimize the posterior expected loss
ˆγ = arg min
˜γ∈Γ
Eτ [d(τ, ˜γ) + p(˜γ) | Y, x]
where p() is a complexity penalty and d() is squared error
averaged over a distribution for x.
• With ˆγ in hand we can look at the joint posterior distribution of
subgroup ATEs
Relaxing complexity penalties in the loss function → growing a
deeper tree
(Hahn et al (2017), Sivaganesan et al (2017))
17
20. Achievement
Norms
Lower Achieving
Low Norm
CATE = 0.032
n = 3253
Lower Achieving
High Norm
CATE = 0.073
n = 3265
−0.05 0.00 0.05 0.10 0.15
0510152025
High Achieving
CATE = 0.016
n = 5023
> 0.67 ≤ 0.67
≤ 0.53 > 0.53
21. Achievement
Norms
Lower Achieving
Low Norm
CATE = 0.032
n = 3253
Lower Achieving
High Norm
CATE = 0.073
n = 3265
−0.05 0.00 0.05 0.10 0.15
0510152025
High Achieving
CATE = 0.016
n = 5023
−0.05 0.05 0.10 0.15 0.20
04812
Diff in Subgroup ATE
Density
Pr(diff > 0) = 0.93
> 0.67 ≤ 0.67
≤ 0.53 > 0.53
22. Achievement
Norms
Lower Achieving
High Norm
CATE = 0.073
n = 3265
High Achieving
CATE = 0.016
n = 5023
Pr(diff > 0) = 0.81
> 0.67 ≤ 0.67
≤ 0.53 > 0.53
Low Achieving
Low Norm
CATE = 0.010
n = 1208
Mid Achieving
Low Norm
CATE = 0.045
n = 2045
Achievement
≤ 0.38 > 0.38
−0.05 0.00 0.05 0.10 0.15
0510152025
−0.05 0.05 0.10 0.15 0.20
048
Diff in Subgroup ATE
23. We fit this model, now what?
How do we present our results? ATE, school ATE, plots of school ATE
by variables....
Two questions we got from our collaborators:
• For which groups was the treatment most/least effective?
• What was the impact of posited treatment effect modifiers,
holding other modifiers constant?
We could generate posterior distributions of individual-level partial
effects manually and try to summarize these, or try to approximate τ
by a proxy with simple partial effects (i.e. an additive function)
18
24. Counterfactual treatment effect predictions
• How do estimated treatment
effects change in lower
achieving/low norm schools if
norms increase, holding
constant school minority comp
& achievement?
• Not a formal causal mediation/
moderation analysis (roughly,
we would need “no
unmeasured moderators
correlated with norms”)
Achievement
Norms
Lower Achieving
Low Norm
CATE = 0.032
n = 3253
Lower Achieving
High Norm
CATE = 0.073
n = 3265
High Achieving
CATE = 0.016
n = 5023
> 0.67 ≤ 0.67
≤ 0.53 > 0.53
25. 0
5
10
15
0.0 0.1 0.2
CATE
density
Increase +0.5 IQR +1 IQR +10% Orig Group Low Norm/Lower Ach High Norm/Lower Ach
Original 0.032 (-0.011 0.076)
+10% 0.050 (0.005, 0.097)
+Half IQR 0.051 (0.005, 0.099)
+Full IQR 0.059 (0.009, 0.114)
1 IQR = 0.6 extra problems
on worksheet task
26. An Alternative: Interpretable Summaries
The goal: Examine the “best” (in a user-defined sense) simple
approximation to the “true” τ(x)
Given samples of τ,
1. Consider a class of simple/interpretable approximations Γ to τ
2. Make inference on
γ = arg min
˜γ∈Γ
d(τ, ˜γ) + p(˜γ)
for an appropriate distance function d and complexity penalty
p(γ)
Get draws of γ by solving the optimization for each draw of τ
Also get discrepancy metrics, like pseudo-R2
: Cor2
[γ(xi), τ(xi)]
(Woody, Carvalho, and Murray (2019) for this approach in predictive models)
19
27. Approximate partial effects of treatment effect modifiers
Here we use:
• An additive approximation γ(x)
• d(τ, γ) =
∑n
i=1(τ(xi) − γ(xi))2
• A smoothness penalty p(γ)
Don’t average draws to get a point estimate! Treat
d(τ, ˜γ) + p(˜γ) =
n∑
i=1
(τ(xi) − γ(xi))2
+ p(˜γ)
as a loss function, and minimize posterior expected loss:
ˆγ = arg min
˜γ∈Γ
Eτ [d(τ, ˜γ) + p(˜γ) | Y, x]
20
28. −2 −1 0 1 2
−0.2−0.10.00.10.2
Approx Additive Effect
School Achievement
gamma_j
2.0 2.5 3.0 3.5
−0.15−0.10−0.050.000.050.10
Approx Additive Effect
Baseline Mindset Norms
gamma_j
0.4 0.5 0.6 0.7 0.8 0.9 1.0
01234
Approximation R^2
N = 2000 Bandwidth = 0.01901
Density
(Partial effect of minority composition not shown)
29. −2 −1 0 1 2
−0.2−0.10.00.10.2
Approx Additive Effect
School Achievement
gamma_j
2.0 2.5 3.0 3.5
−0.15−0.10−0.050.000.050.10
Approx Additive Effect
Baseline Mindset Norms
gamma_j
0.4 0.5 0.6 0.7 0.8 0.9 1.0
051015
Approximation R^2
N = 2000 Bandwidth = 0.004762
Density
(Partial effect of minority composition not shown)
30. Thank you!
• Yeager, David S., Paul Hanselman, Gregory M. Walton, Jared S.
Murray, Robert Crosnoe, Chandra Muller, Elizabeth Tipton et al.
”A national experiment reveals where a growth mindset
improves achievement.” Nature 573, no. 7774 (2019): 364-369.
• Hahn, P. Richard , Jared S. Murray, Carlos M. Carvalho: “Bayesian
regression tree models for causal inference: regularization,
confounding, and heterogeneous effects”, 2017; arXiv:1706.09523.
• Woody, Spencer, Carlos M. Carvalho, Jared S. Murray: “Model
interpretation through lower-dimensional posterior
summarization”, 2019; arXiv:1905.07103.
21