Slides from Dec 3, 2021 talk at University of Minnesota School of Public Health, Epidemiology department.
Lecture topic: how do we ask good causal questions & once we've got our questions framed, how do we answer them?
Lecture recording will be posted to YouTube - date tbd.
Causal inference for complex exposures: asking questions that matter, getting answers that help.
1. @EpiEllie
Causal inference for complex
data: Asking questions that
matter, getting answers that
help
Eleanor Murray, ScD, MPH
Department of
Epidemiology
University of Minnesota
December 3, 2021
2. Epidemiology is:
Learning who has which health
problems where, and
Figuring out what to do to
change that
We can’t achieve these goals
unless we ASK the right
questions, and ESTIMATE
useful answers
Changing the public’s health requires
understanding and estimating causal
effects!
3. So how do we estimate causal
effects?
Miguel Hernàn’s two-step causal algorithm:
1. Ask good questions
2. Answer them with appropriate methods
? !
4. How do we ask good questions?
Start with a clear causal question:
What is the exact exposure(s) of interest?
What is the exact comparison group(s) of
interest?
What is the exact outcome?
Who do we want to learn about?
7. …and measuring exposure can be even
harder
Measurement error for depression
Measurement error for PTSD
Measurement error for suicid
Jiang et al 2020
8. Complex exposures give us complicated
answers, even if we can intervene…
Do we always need to give all parts of the intervention?
Does the timing of the intervention or pieces matter?
How would the intervention work in groups with other
types of usual care/ comparators?
9. What makes an exposure complex?
Multiple components that make it hard to define or
measure
e.g. race, socio-economic position, cognitive
behavioral therapy
Interference between individuals
e.g. infections, behaviors & habits, education
Exposures that vary over space
e.g. air pollution, access to goods & services
Exposures that vary over time
e.g. medication usage, unhealthy habits
Simple exposures that could occur at any time
13. … so what about when we can’t*
intervene?
We need to be even more careful about the
questions we ask!
* whether because of ethical, logistical, financial, or even time
constraints
We need well-defined causal questions!
14. Why are well-defined causal questions
important for complex exposures?
When there are multiple possible ‘interventions’
and we don’t specify one, our answer is a
weighted average of all ‘interventions’ but we
don’t know the weights
Murray, 2016. Agent-based models for causal inference. Harvard
University.
Murray, et al. 2019. Medical Decision Making
We call this the “WACE”:
weighted average causal effect
Useful for estimating effects of
biomarkers in a defined
population
15. Why are well-defined causal questions
important for complex exposures?
Worse, if the ‘intervention’ is ill-defined, the
confounding is probably also ill-defined!
Murray, 2016. Agent-based models for causal inference. Harvard
University.
Murray, et al. 2019. Medical Decision Making [in press]
16. Asking questions that matter for
complex exposures: target trial
framework
What is the intervention you would do if you
could do an experiment?
How would that experiment help you make
treatment, policy, or other decisions?
Are we asking a specific enough question to get
an answer we can understand and act upon?
17. The best questions lead to
action
What would the difference in total
income be for all tipped workers:
◦ if all tipped workers had been women
versus
◦ if all those same tipped workers had
been men?
How do we act upon the answer to
that question?
Does gender
cause poverty
for tipped
workers?
18. The best questions lead to
action
What would the difference in total
income be for women who work for
tips:
◦ if all women received the amount of tips
they typically receive versus
◦ if those same women received the
amount of tips that men typically
receive?
This question could help us plan
policy interventions
If women who worked
for tipped wages,
received the same
amount of tips as their
male colleagues, would
their poverty risk
decrease?
Example inspired by research by Dr Sarah
Andrea & colleagues
19. The best questions lead to
action
What would the death rates be from
COVID-19:
◦ if all individuals who became infected
had been people of color, versus
◦ if all those same individuals had been
white people?
How do we act upon the answer to
that question?
Does race cause
death from
COVID-19?
20. The best questions lead to
action
What would the difference in
deaths from COVID-19 have been:
◦ if all people of color who were
infected had the probability of dying
observed in 2020-21, versus
◦ if those same people of color had
instead had the same probability of
dying as white people who were
infected?
This question could help us plan
policy interventions
If people of color in
America had experienced
the same rate of death
from COVID-19 as white
Americans, how many
more people would be
alive?
Example inspired by research by Dr Justin
Feldman & colleagues
21. What do we want to know?
How would things have changed if the
world had been slightly different?
Treat
now
Treat later
22. What do we want to know?
If we can’t have a time machine, we’d like to
have a randomized trial.
Treat now
Treat later
23. Many decisions need to be made
NOW
A randomized trial would, in principle, answer
these questions …
… But we don’t always have randomized trials for
many reasons
Deferring a decision is not an option
No decision = decision: “Keep status quo”
Worse: in reality, even perfect randomized
trials are hard!
25. What do we want to know?
If we can’t have a randomized trial, we’d like
to emulate what would have happened if we
could have done one.
Treat now
Treat later
26. Emulate with randomized trial data
The target trial framework helps us avoid
biases caused by:
Informative censoring
Non-random non-adherence
Competing events
Poorly defined causal questions
Generalizability & transportability problems
Incorrect interpretations of trial results
28. Target trial framework also helps us
identify our causal estimand (i.e. the target
parameter)
1. Intention-to-treat effects
Effect of randomization to treatment
2.Per-protocol effects: effect of
treatment
Effect of initiating treatment
Effect of adhering to treatment protocol
Effect of receiving point intervention, among
the ‘compliers’ (not necessarily all adherers!)
Hernan & Robins, 2016. N
Available in
Randomized Trials Only
Available in
Randomized Trials
*and* Observational
Studies
29. The target trial framework
clarifies that these studies aren’t
asking the same question
because the causal estimands are
different!
Why does the estimand matter?
30. Emulate with observational
data
But if we don’t have a trial, we can also emulate
the target trial with observational data
Problem: observational data is hard to analyze
Who should we include?
When did baseline start?
Why did exposure happen (or not)?
31. Observational studies have potential for
some (relatively) unique biases
Baseline confounding
Time zero (immortal time bias)
Structural positivity violations
Ill-defined causal questions
32. Target trial framework addresses all
these biases & more
Explicit definition of baseline
Explicit description of target population &
inclusion/ exclusion criteria
Explicit description of well-defined
interventions
Clarifies the causal question
What do we really
want to know?
33. The target trial helps us ask better
questions
Are we asking a specific enough question to get an
answer we can understand?
Will our question guide decision making?
34. Emulate with observational
data
Problem: observational data is hard to analyze
Who should we include?
When did baseline start?
Why did exposure happen (or not)?
Solution: Target trial concept with g-methods
Parametric g-formula
Inverse probability weighting
G-estimation
35. A quick handshake intro to g-methods
1. Inverse probability weighting (aka IPW) of
marginal structural models
2. (Parametric) G-formula
3. Doubly-robust estimation (aka targeted
maximum likelihood estimation or TMLE if
estimated using machine learning)
4. G-estimation of structural nested models
37. G-methods are roughly similar
to …
Inverse probability weighting
(Parametric) G-formula
G-estimation
Propensity scores
Standardization
Instrumental variables
≈
≈
≈
Use when
you have
treatment
-
confound
er
feedback,
and ….
… you
would
normally
use
these.
38. Why use inverse probability weights?
Inverse probability weighting is a way of
correcting for missing information.
We can correct for:
Loss to follow-up: Missing outcome under
assigned treatment
Non-adherence: Missing counterfactual
outcome, had they received assigned treatment
Other missingness:
E.g. when visits are missed but we wanted to look at info
collected at those visits.
39. What is the parametric g-
formula?
A generalization (g) of standardization to
time-varying settings
An equation (formula) that relates the
observational data to the counterfactual
data
Solved using Monte-Carlo simulation,
which relies on (parametric) modeling
assumptions
40. The general formula for the
parametric g-formula
For a single time point of exposure:
The probability of the counterfactual outcome (Ya) if
everyone received exposure level A=a
the average of stratum-specific observed outcome
probabilities among people who received exposure a
(Pr(Y=1|A=a, L=l])
weighted by the probability of being in each stratum
(Pr[L=l])
r 𝑌𝑎
= 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 = 𝑙
𝑙
Pr 𝑌𝑎
= 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 =
𝑙
41. The general formula for the
parametric g-formula
For a single time point:
For multiple time points:
r 𝑌𝑎
= 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 = 𝑙
𝑙
Pr 𝑌𝑎
= 1 = Pr 𝑌 = 1|𝐴 = 𝑎 = Pr 𝑌 = 1 𝐴 = 𝑎, 𝐿 = 𝑙 Pr 𝐿 =
𝑙
42. Ways of solving the g-formula
For a (very) small number of covariates &
time points, we can calculate by hand
In practice, we almost always have too many
variables and/or time points.
Instead, we can use Monte Carlo simulation
Packages exist in SAS and R
(https://github.com/CausalInference)
We can also use an iterative approach
Wen et al. 2020. Biometrics.
43. But often we don’t have observational data
either
Not enough data available
Existing treatments in new populations
Novel indications for existing treatments
Hypothetical exposures or treatments
What else can we do?
44. Emulate with simulation
modeling
Simulation-based approaches give us
faster decisions and requires less data
A tool for combining our subject matter
knowledge, available data, and best
guesses into a quantitative prediction
45. What types of simulation modeling are
available?
Group-level:
Compartmental models
Markov cohorts
SIR models
Unit-level:
Individual-level simulation models
Agent-based models
Microsimulations
48. Group-level models make assumptions
about effect modification
What types of people are
sufficiently the same from
the perspective of your
causal question?
What information is
necessary to track and which
information can be
discarded?
Which strata of effect
modifiers are important?
49. Individual-level models make
assumptions about “risk factors”
What types of people
are in your population?
What determines who
comes in contact with
whom?
What determines who
gets sick?
What determines how
long until someone
recovers or dies?
50. Individual-level simulation models can
be built like layer cakes
Individual life history
layer
Environmental exposures
layer
Contact pattern layer
51. The more layers, the more
assumptions
Each layer has it’s own
assumptions
The choice of layers is an
assumption about what is
important
The flow of information
between layers has
assumptions
Together, these are the
“structure” assumptions
52. But that’s not all!
If we want our model to tell us about
decisions we can take in the real world, then
we need our model to replicate the real world.
Let’s consider the simplest model
Individual’s in a single layer
Cannot interact
Have no agency
Have no environment or neighborhood
53. Assumption – Data trade-off
Goal: causal
inference in
one
population
More data
required
Fewer
assumptions
Goal: causal
inference in
many
populations
Less data
required
More
assumptions
Agent-based models Observational
emulation
Murray, 2016. Agent-based models for
causal inference. Harvard University.
54. All causal inference requires
assumptions
Simulation models can be thought of as a
way to emulate what we would have learned
if we had conducted an observational or
randomized study
So, we need to also make all the
assumptions we would have made in those
studies when making causal inference!
55. What assumptions do we need?
No unmeasured confounding: all common causes of
the treatment and outcome are known and measured in
the data
No open colliders: all common effects of the treatment
and outcome are known and not conditioned on in the
data or analysis
56. What assumptions do we need?
Positivity: there is a non-zero probability of
all levels of treatment for all types of
individuals in our population
57. What assumptions do we need?
Consistency: our treatment levels are clearly
specified, aka:
Well-defined interventions
Well-defined causal questions
58. Why is target trial emulation using
simulation models harder than
observational studies or trials?
We need these 3 assumptions for all trial
emulation methods
but for simulation models we need these
assumptions to hold for every pair of
variables in our model!
This starts to get extra tricky!
Murray, 2016. Agent-based models for causal inference. Harvard
University.
Murray et al, Am J Epidemiol 2017 ; 186(2):
59. Special challenge: well-defined
mediators
To answer our question with observational data,
we need a well-defined question about treatment
To answer our question with a simulation model,
we also need well-defined questions about CD4
cell count.
Murray et al, Med Dec Making 2020; 40 (1),
60. Parameterizing mediators
We need a value for: Effect of antiretroviral
therapy initiation time on mortality when
CD4 count is held fixed at some value
How do we “fix” it?
What value do we choose? Does it matter?
Murray et al, Med Dec Making 2020; 40 (1),
61. Parameters must be externally valid
If we use more than one data source for our
model parameters, we need every parameter
to be externally valid.
This requires more assumptions!
62. External validity requires:
No unmeasured outcome causes: all causes
of outcome are known and either modeled
or identically distributed between
populations
(for every variable in the model that looks like an
‘outcome’)
63. So, does it work?
Can we change our assumptions into
knowledge & make causal inference using
simulation-based approaches?
Yes, but only if:
All our assumptions are correct!
64. But:
Even in the simplest case, it is hard to
get the assumptions right!
Even if we get them right, we never
know for sure they are right!
When we add in disease transmission,
things get even harder!
65. What about transmission?
Reminder, we started with models where
individuals
Cannot interact
Have no agency
Have no environment or neighborhood
The target trial framework can help us
understand how we could relax our
assumptions and still estimate causal effects
that we can define and understand
66. Agent-based model results can be hard to
interpret
What do we really
want to know?
Control Intervention
Connections: Shared HIV risk
Index: shaded brown or red nodes
Nearest neighbors: outlined nodes
67. How does the target trial help?
Problem: Causal inference typically requires the
assumption of no interference, but many topics
violate this assumption
infectious diseases
behavioral interventions
educational interventions
Solution: design our simulations to emulate
cluster-randomized, two-stage, ring-vaccine,
or other interference-friendly trial designs
68. Adapted from: Halloran and Struchiner (
Buchanan, et al. AJE 2021
Interference makes decision-making
hard…
…but we have frameworks for understanding effects under
interference in randomized trials
69. Buchanan, et al. AJE 2021
Murray, et al. AJE 2021
understand results of agent-based models
and network analyses
70. Summary 1: Asking good question is
hard
But the clearer we are about what we are
asking, the easier it is to make use of the
answer
71. Summary 2: Answering our questions
well is also hard too
Unbiased Intractably biased
Ideal randomized
controlled trial
Explanatory
randomized
controlled trials
Pragmatic
randomized
trials
Observatio
nal studies
Relying on
“Gut-
feeling”
Loss to follow-up
Non-adherence
Baseline confounding
Lack of generalizability
Simulation
studies
Ill-defined uncertainty
72. How do we estimate causal
effects?
Miguel Hernàn’s two-step causal algorithm:
1. Ask good questions
2. Answer them with appropriate methods
? !
The solution is well-defined interventions – i.e. well-defined causal questions
Note with these definitions, we don’t need to distinguish between per-protocol versus as-treated
“we may be interested in”
Both use Monte-Carlo simulation to estimate counterfactual outcome distributions
Microsimulations requires knowledge about mechanisms
G-formula requires data about individuals
No unmeasured confounding for any pair of variables
Positivity for every variable in our model, or rules that dictate who can & can’t have it
Consistency (well-defined intervention) for every variable in our model
Indirect effect == disseminated effect
Total effect == composite effect