The document discusses various challenges and methods for causal inference from observational data. It begins with two use cases - estimating the savings from installing heat pumps and the profit increase from placing beer coolers in stores. Both experiments fail standard assumptions as the test and control groups are statistically different. The document then covers methods for estimating average treatment effects such as propensity score matching and regression adjustment. It also discusses estimating individual treatment effects using techniques like honest forests and counterfactual regression that learn balanced representations of the data. The goal is to remove bias from differences between treated and untreated groups to infer valid causal effects.
5. Experiments you thought were good can still be invalid
Experiments you thought were bad can still be valid
6. Randomized testing: the set-up
Sample is randomly
split into two groups
Random subsample of
population is chosen
POPULATION INTERVENTION
CONTROL
= no change = improved outcome
Outcome in both
groups is measured
The same for all
participants
AVERAGE
TREATMENT
EFFECT
8. Measurement data: daily gas usage ~ outside temperature
Average outside temperature (°C)
Gasusage(m3)
9. The experiment in the randomized test framework
• Sample is based on
“friendly users”: Eneco
employees, early
adopters and energy
enthusiasts
• Rental homes are excluded
from the study
• Participation is initiated by
customer
• Outcome: average yearly gas
savings
• Placements over many months
• Changes made to intervention
halfway through study
AVERAGE
TREATMENT
EFFECT
INTERVENTION
CONTROL
10.
11. Fixing group imbalance: match test and control
Available covariates:
• House size (m2)
• Building type (terraced, apartment, detached, semi-detached)
• Construction period (<1946, 1946-1965, …, > 2010)
• Number of inhabitants (1, 2, 3, 4, 5+)
Number of possibilities: 10 x 4 x 6 x 5 = 1200
Our sample population is only 2500, exact matches infeasible partial matching
Propensity Score Matching
12. Propensity score matching – concept
38%
Calculate chance of receiving treatment
given X (house type, etc)
test A
83%39%
41%
Match test subject to k control subjects
on this probability
12% 22%
Calculate effect for test and (matched) control
-
500m3
-20m3average
-
480m3
Repeat for all participants
average effect over test group
RUN
AWAY!
13. Recap heat pump use case
• Experiment fails (almost) all standard assumptions
• Each of the “faults” can be corrected
• Measure months, need year extrapolate with model
• Bias in test group match with equally biased control using propensity
• Outcome: average effect over test group, not whole population
• We can not say anything about rental households without making additional
assumptions
15. USE CASE: effect of cooler placement @ HEINEKEN
POPULATION
• 13K off-trade* outlets
• Selling HEINEKEN beer brands
• May receive cooler
* Small to medium shops, e.g. mom and pop shops, groceries and kiosks; not retail
• Pool for ’experiment’ is
all outlets, sample is the
population
• Observational approach:
coolers are already placed
• Gold outlets higher
probability of getting cooler
than others
• Need effect on individual
outlets, to prioritize
future placements
AVERAGE
TREATMENT
EFFECT
INTERVENTION
CONTROL
The same for all
participants
• Outcome: yearly profit** uplift
• Placements over many years,
movements not tracked
sales before/after unknown
** Profit is measured as FGP/hl, a company-wide calculation of profit per hl sales
16.
17. Fig. Histograms showing the distribution of total profit per
outlet, when broken down by ranking and cooler setup
Problem 1: test and control group are statistically
different
Distribution of relevant characteristics* is different between test and control
profit
* A relevant characteristic is one that influences the probability of being selected for treatment
18. Problem 1: test and control group are statistically
different
Distribution of relevant characteristics* is different between test and control
* A relevant characteristic is one that influences the probability of being selected for treatment
• Outlet ranking (gold, silver, bronze)
• Outlet sub-channel (kiosk, grocery, convenience, etc)
• Outlet area type (city, urban, village)
• Area (name of neighborhood)
• Seasonality (is outlet only open in summer)
• Sales rep visits per month
• Volume of competitor vs HEINEKEN sales
• Number of assortment deals with HEINEKEN
• Amount of investment by HEINEKEN
• Number of HEINEKEN branding materials
• Census demographics in km2 (population, age, gender)
• Google Maps metrics in 500m2 (average venue rating, # venues
with photo, # of unique venue types, average venue opening times)
20. The need for effect correction – staging an experiment
Definition: conditional mean
Mean of y for given values of X, i.e. average of one variable as
a function of some other variables
𝐸 𝑌 𝑋 = 𝑋𝛽
Effect = mean treated – mean untreated
𝐸 𝑌 𝑤 = 1 − 𝐸 𝑌 𝑤 = 0 = 27.70 − 21.66 = 6.04 ??
21. The need for effect correction – staging an experiment
𝐴𝑇𝐸𝑖𝑛𝑠 = 𝐸 𝑌 𝑋 = 1, 𝑤 = 1 − 𝐸 𝑌 𝑋 = 1, 𝑤 = 0
= 30.07 − 24.90 = 5.17
𝐴𝑇𝐸 𝑛𝑜𝑛𝑖𝑛𝑠 = 𝐸 𝑌 𝑋 = 0, 𝑤 = 1 − 𝐸 𝑌 𝑋 = 0, 𝑤 = 0
= 20.00 − 22.96 = 2.96
Only gold
Effect = mean treated – mean untreated
Only non-gold
Effect = mean treated – mean untreated
22. The need for effect correction – staging an experiment
What would be the effect if all the imbalance in treatment
caused by gold ranking is removed?
50% of outlets are gold, if the probability of placement
were equal for all of them, the effect would be ...
𝐴𝑇𝐸 = 𝐸 𝑌 𝑋, 𝑤 = 1 − 𝐸 𝑌 𝑋, 𝑤 = 0
= 4.06
23. The need for effect correction – staging an experiment
Procedure
With the sample mean of the covariates, fit the
regression
And the coefficient on w will be the average treatment
effect
𝑌 𝑜𝑛 1, 𝑤, 𝑿, 𝑤(𝑿 − 𝑿)
𝑿
25. Estimating the ATE with regression – assumptions
Conditional mean independence
Mean dependence between treatment assignment w and
treatment-specific outcomes Yi can be removed by conditioning
on some variables X, provided that they are observable (AKA
weak ignorability)
𝐸 𝑌𝑖 𝑋, 𝑤 = 𝐸 𝑌𝑖 𝑋 𝑓𝑜𝑟 𝑖 ∈ {0,1}
26. Individual treatment effect estimation – assumptions
Many approaches exist, but most of your bias will be due to not observing enough confounders
X!
Conditional independence
Any dependence between treatment assignment w and
treatment-specific outcomes Yi can be removed by conditioning
on some variables X, provided that they are observable (AKA
strong ignorability)
𝑌0, 𝑌1 ⫫ 𝑤|𝑿
27. Estimating ITE with Virtual Twins*
Sales
Rating
=Bronze/Silver
Rating
=Gold
Cooler
=0
Cooler
=1
€2000 €3000
Procedure
Fit a tree ensemble with target Y and features X, w,
and interactions** between X and w
Predict all units with w=1 , predict all units with w=0
Subtract to get
Early stopping and OOB predictions reduce
overfitting, quantile objective can help to trim outliers
𝜏𝑖𝑡𝑒, 𝑖 = 𝑚1 𝑿𝑖 − 𝑚0 𝑿𝑖
* Foster, J. C., Taylor, J. M., and Ruberg, S. J. (2011). Subgroup identification from randomized clinical trial
data. Statistics in Medicine, 30(24):2867–2880.
** Scaling like we did with the linear ATE estimator is generally not needed with tree-based estimators
28. Fig. Model predicted profit versus actual profit, by
cooler type (all outlets)
USE CASE: effect of cooler placement @ HEINEKEN
Overview
29. USE CASE: effect of cooler placement @ HEINEKEN
Coolers to consider
Fig. Model predicted profit versus actual profit, by
cooler type (outlets within 90% confidence interval)
30. USE CASE: effect of cooler placement @ HEINEKEN
Coolers to upgrade
Fig. Model predicted profit versus actual profit, by
cooler type (outlets to upgrade / install)
31. USE CASE: effect of cooler placement @ HEINEKEN
Coolers to upgrade
Fig. Model predicted profit versus actual profit, by
cooler type (outlets to upgrade / install)
32. USE CASE: effect of cooler placement @ HEINEKEN
Coolers to upgrade
Fig. Model predicted profit versus actual profit, by
cooler type (outlets to upgrade / install)
33. • Your perfect experiment is likely ruined by harsh
reality
• But you may be able to fix it:
• Propensity score matching
• Average and individual treatment effect estimation
• Make sure you collect enough data:
• When is the treatment done?
• Measure Y before and after experiment
• What covariates X influence both treatment w and outcome Y?
34. Looking for:
• Senior Data Scientist
• Senior Data Engineer
Contact: ciaran.jetten@heineken.com
35. Estimating ITE with Honest RF*
* Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the
National Academy of Sciences, 113(27), 7353-7360.
Cooler 1/0
Rating
=Bronze/Silver
Rating
=Gold
𝐸 𝑌 𝑤 = 1 − 𝐸 𝑌 𝑤 = 0
€2000 − €3000 = €1000
Procedure
Fit a tree ensemble with target w and features X, with
constraint of minimum k units per class in each DT
leaf
Per leaf K in each DT, calculate mean difference in Y
between treatment and control units to get
𝜏𝑖𝑡𝑒, 𝑖 = 𝑁−1
𝑗=1
𝑁
[𝑌𝑗1 − 𝑌𝑗0]
𝑓𝑜𝑟 𝑖 ∈ 𝐾 𝑎𝑛𝑑 𝑗 ∈ 𝐾
36. Estimating ITE using Counterfactual Regression*
* Shalit, U., Johansson, F., & Sontag, D. (2016). Estimating individual treatment effect: generalization bounds
and algorithms. arXiv preprint arXiv:1606.03976.
Procedure
Learn a representation Φ of X split samples
according to w regress Y0 and Y1 on the
representation separately
Regularize Φ using IPM, which is the distance
between the distribution of X in w=1 and of X in w=0
Thus having joint objective of minimizing predictive
error and guaranteeing a balanced representation of
X
Hinweis der Redaktion
”HeatWinner” hybrid heat pump, installed alongside boiler
Takes over part of the heating demand from the boiler
save on gas
Goal of the pilot / experiment: calculate average gas savings
conditional mean. The conditional mean expresses the average of one variable as a function of some other variables. More formally, the mean of y conditional on x is the mean of y for given values of x; in other words, it is E(y|x).
conditional-independence assumption. The conditional-independence assumption requires that the common variables that affect treatment assignment and treatment-specific outcomes be observable. The dependence between treatment assignment and treatment-specific outcomes can be removed by conditioning on these observable variables.
Conditional independence (strong ignorability): This says that the distribution of the potential outcomes, (y 0 , y1 ), is the same across levels of the treatment variable, T , once we condition on confounding covariates X
Conditional mean independence (weak ignorability): This says that the mean of the potential outcomes, (y 0 , y1 ), is the same across levels of the treatment variable, T , once we condition on confounding covariates X
Experiment fails (almost) all standard assumptions
Each of the “faults” can be corrected
Measure months, need year extrapolate with model
Bias in test group match with equally biased control using propensity
Outcome: average effect over test group
If you want the effect over the entire population, more corrections are needed
Since all rental houses are dropped from the experiment, we can not say anything about rental households without making additional assumptions
”HeatWinner” hybrid heat pump, installed alongside boiler
Takes over part of the heating demand from the boiler
save on gas
Goal of the pilot / experiment: calculate average gas savings
conditional mean. The conditional mean expresses the average of one variable as a function of some other variables. More formally, the mean of y conditional on x is the mean of y for given values of x; in other words, it is E(y|x).
conditional-independence assumption. The conditional-independence assumption requires that the common variables that affect treatment assignment and treatment-specific outcomes be observable. The dependence between treatment assignment and treatment-specific outcomes can be removed by conditioning on these observable variables.
Conditional independence (strong ignorability): This says that the distribution of the potential outcomes, (y 0 , y1 ), is the same across levels of the treatment variable, T , once we condition on confounding covariates X
Conditional mean independence (weak ignorability): This says that the mean of the potential outcomes, (y 0 , y1 ), is the same across levels of the treatment variable, T , once we condition on confounding covariates X
conditional mean. The conditional mean expresses the average of one variable as a function of some other variables. More formally, the mean of y conditional on x is the mean of y for given values of x; in other words, it is E(y|x).
conditional-independence assumption. The conditional-independence assumption requires that the common variables that affect treatment assignment and treatment-specific outcomes be observable. The dependence between treatment assignment and treatment-specific outcomes can be removed by conditioning on these observable variables.
Conditional independence (strong ignorability): This says that the distribution of the potential outcomes, (y 0 , y1 ), is the same across levels of the treatment variable, T , once we condition on confounding covariates X
Conditional mean independence (weak ignorability): This says that the mean of the potential outcomes, (y 0 , y1 ), is the same across levels of the treatment variable, T , once we condition on confounding covariates X
Decision trees consecutively slice feature space into leaves with minimal target variance
Tree ensembles (Random Forest, Gradient Boosting) improve generalization to new data
Suitable for making predictions on individual units
By estimating classification trees on the treatment, effectively matches units on propensity score
When setting minimum of k units of each class per leaf, E(Y1 – Y0) can be calculated locally
Custom neural network architectures can constrain how X is distributed over treatment and control
Experimental results very strong, especially on the IHDP synthetic dataset