1. Optimal Personalized Treatment Learning Models
with Insurance Applications
PhD Student: Leo Guelman
Advisor: Dr. Montserrat Guill´en
PhD in Economics
University of Barcelona
March 2, 2015
1 / 22
2. Outline
1 Motivation
2 Problem formulation
3 Objectives
4 Challenges
5 Contributions
6 Limitations and future work
2 / 22
3. Motivation
Predictive learning has been established as a core strategic
capability in many scientific disciplines and industries.
Common goal: Predict the value of a response variable using
a collection of covariates or “predictors” under static
conditions.
From Predictive to Causal Modeling
• Predictive Modeling has been established as a core strategic capability of
many top insurers.
• Common goal: to predict a response variable using a collection of attributes
under static conditions — i.e., assumes “business as usual” conditions.
• Causal Modeling goes one step further: the interest is in estimating/
predicting the response under changing conditions — e.g., induced by
alternative actions or “treatments”.
3
X YCovariates Response
XCovariates Potential responses
Y (1)
Y (2)
Actions
A = 0
A = 1
Causal learning goes one step further: the interest is in
estimating/predicting the response under changing conditions
– e.g., induced by alternative actions or “treatments”.
From Predictive to Causal Modeling
• Predictive Modeling has been established as a core strategic capability of
many top insurers.
• Common goal: to predict a response variable using a collection of attributes
under static conditions — i.e., assumes “business as usual” conditions.
• Causal Modeling goes one step further: the interest is in estimating/
predicting the response under changing conditions — e.g., induced by
alternative actions or “treatments”.
3
X YCovariates Response
XCovariates Potential responses
Y (1)
Y (2)
Actions
A = 0
A = 1
3 / 22
4. Motivation
In the context of causal learning, the main interest has been in
identifying the Average Treatment Effect (ATE):
ATE = E[Y |A = 1] − E[Y |A = 0].
In many important settings, subjects can show significant
heterogeneity in response to actions/treatments – i.e., what
works for one subject may not work for another. Here the ATE is
less relevant.
The objective becomes to select the optimal action or
“treatment” for each subject based on individual characteristics.
Introduction
Current Approaches
Personalized Medicine
Goal
“Providing meaningful improved health outcomes for patients by
delivering the right drug at the right dose at the right time.”
How Do We Apply Personalized Medicine?
Learn individualized treatment rules: tailor treatments based
on patient characteristics.
Concepts & Tools
Symptoms
Demographics
Disease history
Biomarkers
Imaging
Bioinformatics
Pharmacogenomics
Concepts & Tools
Symptoms
Demographics
Disease history
Biomarkers
Imaging
Bioinformatics
Pharmacogenomics
When Do We Apply Personalized Medicine?
Single-Decision Setup.
Multi-Decision Setup.
Optimal defined as the treatment that maximizes the probability
of a desirable outcome.
We call the task of learning the optimal personalized treatment
personalized treatment learning (PTL).
4 / 22
5. Problem Formulation
Assume a randomized experiment – i.e., subjects are randomly
assigned to two treatments, denoted by A ∈ {0, 1}.
Let Y (a) ∈ {0, 1} denote a binary potential response of a
subject if assigned to treatment A = a, a = {0, 1}.
So the observed response is Y = AY (1) + (1 − A)Y (0).
Subjects are characterized by a p-dimensional vector of
baseline predictors X = (X1, . . . , Xp) .
Data consist of L i.i.d. realizations of
(Y , A, X), {(Y , A , X ), = 1, . . . , L}.
5 / 22
6. Problem Formulation
At the most granular level, the personalized treatment
effect (PTE) is a comparison between Y (1) and Y (0) on the
same subject. Usually,
Y (1) − Y (0) ∀ = {1, . . . , L}.
But this quantity is unobserved...
In practice, the best we can do is to estimate the PTE by
conditioning on subjects with profile X = x.
Thus, we define the PTE by
τ(x) = E[Y (1) − Y (0)|X = x]
= E[Y |X = x, A = 1] − E[Y |X = x, A = 0].
6 / 22
7. Problem Formulation
A personalized treatment rule H is a map from the space of
baseline covariates X to the space of treatments A,
H(X) : Rp
→ {0, 1}.
An optimal treatment rule is one that maximizes the expected
outcome, E[Y (H(X))], if the personalized treatment rule is
implemented for the whole population.
Since Y is binary, this expectation has a probabilistic interpretation.
That is, E[Y (H(X))] = P Y (H(X)) = 1 and thus τ(x) ∈ [−1, 1].
Assuming all treatments cost the same, the optimal personalized
treatment rule H∗
= argmaxH E[Y (H(X))] for a subject with
covariates X = x is given by
H∗
=
1 if τ(x) > 0
0 otherwise.
7 / 22
8. The Simplest Approach to PTE Estimation
1 Estimate E[Y |X, A = 1] using the treated subjects only.
2 Estimate E[Y |X, A = 0] using the control subjects only.
3 An estimate of the PTE for a subject with predictors X = x is
ˆτ(x) = ( ˆY |X = x , A = 1) − ( ˆY |X = x , A = 0).
Pros:
Any conventional statistical or algorithmic binary classification
method may serve to fit the models.
Cons:
Method aims to predict the wrong target: it emphasizes the
prediction accuracy on the response under each treatment, not
the accuracy in estimating the change in the response
caused by the treatment.
Relevant predictors for Y under each treatment are usually
different from relevant PTE predictors.
As a result, it tends to perform poorly in practice.
8 / 22
9. Objectives
Formalize personalized treatment learning (PTL) as a new
branch of statistical learning.
Create the first comprehensive systematic review of the
existing PTL methods (Tian and Tibshirani, 2014; Radcliffe and
Surry, 2011; Ja´skowski and Jaroszewicz, 2012; Su et al., 2009; Qian and
Murphy, 2011; Zhao et al., 2012; Rubin and Waterman, 2006;
Larsen, 2009; Imai and Ratkovic, 2013, and others).
Build improved statistical/algorithmic methods for
estimating, selecting and assessing PTL models.
Introduce PTL models to insurance applications.
Build open source software implementing our proposed
methods for fitting PTL models, as well as the existing ones,
and make it freely available for academia/industry.
9 / 22
10. Key Challenges
The fundamental problem of PTL models: The quantity we are
trying to predict (i.e., the optimal personalized treatment) is
unknown on a given training data set.
Size of main effects relative to treatment heterogeneity
effects: The magnitude of the variability in the response due to the
treatment heterogeneity effects is usually much smaller than the
variability due to the main effects.
Overfitting: The risk of overfitting increases markedly in PTL
models compared to conventional predictive learning problems.
Model selection and assessment: Methods for variable selection
and model selection/assessment used in conventional predictive
learning problems need to be redefined in the context of PTL
models.
10 / 22
11. Contributions
Introduced and formalized the concept of personalized treatment
learning (PTL) within a causal inference framework, and described
its relevance to a wide variety of fields ranging from economics to
medicine (Ch. 1-2).
Provided the first comprehensive description of the existing PTL
methods (Ch. 3) and proposed two novel methods – namely, uplift
random forests (Ch. 4) and causal conditional inference trees
(Ch. 5). Our proposal outperforms the existing methods in an
extensive numerical simulation study (Ch. 6).
PTL models require not only developing new estimation methods,
but also new methods for assessing model performance. We
formalized the concept of the Qini curve and the Qini coefficient,
and discussed general useful methods for model assessment and
selection for PTL models (Ch. 7).
11 / 22
12. Contributions
We described the relevance of PTL models to insurance
marketing, and illustrated two applications to optimize client
retention and cross-selling using experimental data from a large
Canadian insurer (Ch. 8).
We presented a novel approach to measuring price-elasticity and
economic price optimization in non-life insurance based on PTL
modeling principles in the context of observational data (Ch. 9).
Selecting the optimal personalized treatment in insurance also
requires consideration of the expected losses under treatment
alternatives. We described an unprecedented application of
gradient boosting models to estimate loss cost in non-life
insurance, with key advantages over the conventional generalized
linear model approach (Ch. 10).
We implemented most of the statistical methods and algorithms
described in this thesis in a package named uplift (Guelman, 2014),
which is now freely available from the CRAN (Comprehensive R
Archive Network) repository under the R statistical computing
environment (Ch. 11).
12 / 22
13. PTL with Experimental Data
An Application to Insurance Cross-Sell Optimization
Cross-sell rates by group
Treatment Control
Purchased home policy = N 30,184 3,322
Purchased home policy = Y 789 75
Cross-sell rate 2.55% 2.21%
The average treatment effect (ATE) is 0.34% (2.55% −
2.21%) which is NOT statistically significant (P value =
0.23).
Cross-sell rate by PTL model decile
Deciles
0.03 0.06
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
Profitable targets Not profitable targets
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10
Decile
Cross−sellrate(%)
Group q qTreatment Control
Prototype causal conditional inference tree
0.34%
0.05%
prodCnt ≤ 1
0.65%
1.22%
adjBranch
0.38%
0.67% 2.52%
age ≤ 45
-1.2% 0.32%
xdate ≤ 2
-1.6%
age ≤ 42
-0.2%
The tree-based procedure identifies
a subgroup of clients with significant
positive impact from the cross-sell
activity (PTE = 2.52%).
13 / 22
14. Causal Conditional Inference Tree - Pseudocode
Algorithm 1 Causal conditional inference tree
1: for each terminal node do
2: Test the global null hypothesis H0 of no interaction effect between
the treatment A and any of the p predictors at a level of significance
α based on a permutation test (Strasser and Weber, 1999)
3: if the null hypothesis H0 cannot be rejected then
4: Stop
5: else
6: Select the j∗
th predictor Xj∗ with the strongest interaction effect
(i.e., the one with the smallest adjusted P value)
7: Choose a partition Ω∗
of the covariate Xj∗ in two disjoint sets
M ⊂ Xj∗ and Xj∗ M based on the G2
(Ω) split criterion
8: end if
9: end for
G2
(Ω) =
(L − 4){
Left Node
( ¯YnL
(1) − ¯YnL
(0)) −
Right Node
( ¯YnR
(1) − ¯YnR
(0))}2
ˆσ2{1/LnL
(1) + 1/LnL
(0) + 1/LnR
(1) + 1/LnR
(0)}
14 / 22
17. PTL with Observational Data
An application to Auto Insurance Economic Price Optimization
Background
Objective: Determine the
policyholder-level premium
(playing the role of the
treatment) that maximizes the
expected profitability of an
existing insurance portfolio,
subject to a fixed overall retention
rate.
Requires estimating the expected
client retention outcome under
alternative insurance rates (price
elasticity), and loss cost.
Clients were historically exposed
to rating actions based on a
non-random assignment
mechanism, which requires
designing an observational study.
Maximize an expected profit function
Max
Z a∀ ∀a
∀ ∀a
Z a P (1 + RCa)(1 − ˆLR a)(1 − ˆˆr a)
subject to a retention constraint
∀a
Z a = 1 ∀
Z a ∈ {0, 1}
∀ a
Z a
ˆˆr a/L ≤ α.
q
Current state
Efficie
ntfrontie
r
A
B
C
0
10
20
30
0.90 0.92 0.94 0.96
Retention rate (1 − α)
Expectedprofit(%)
17 / 22
18. uplift Package Highlights
First R package implementing PTL models
Exploratory Data Analysis (EDA) tools customized for
PTL models
Check balance of covariates (checkBalance)
Univariate uplift analysis (explore)
Preliminary variable screening (niv)
Estimating personalized treatment effects
Causal conditional inference forests (ccif)
Uplift random forests (upliftRF)
Modified covariate method (tian_transf)
Modified outcome method (rvtu)
Uplift k-nearest neighbor (upliftKNN)
Performance assessment for PTL models
Uplift by decile (performance)
Qini curve and Qini-coefficient (qini)
Other functionality
Profiling PTL models (modelProfile)
Monte-Carlo simulations (sim_pte)
18 / 22
19. Fitting a CCIF using uplift
ccif implements recursive partitioning in a causal conditional inference
framework.
fit <- ccif(formula = Y ~ trt(A) + X1 + X2 + X3,
data = mydata,
ntree = 1000,
split_method = "Int",
distribution = approximate (B=999),
pvalue = 0.05,
verbose = TRUE)
Table: Some ccif options
ccif argument Description
mtry Number of variables to be tested in each node
ntree Number of trees in the forest
split_method Split criteria: "KL", "ED", "Int" or "L1"
interaction.depth The maximum depth of variable interactions
pvalue Maximum acceptable p-value required to make a split
bonferroni Apply Bonferroni adjustment to pvalue
minsplit Minimum number of obs. for a split to be attempted
... Additional args. passed to coin::independence_test.
19 / 22
20. Manuscripts Linked to Thesis
[1] Guelman, L. (2014). uplift: Uplift modeling. R package version 0.3.5.
[2] Guelman, L. and Guill´en, M. (2014). A causal inference approach to measure price
elasticity in automobile insurance. Expert Systems with Applications, 41(2):387–396.
[3] Guelman, L., Guill´en, M. and P´erez-Mar´ın, A. M. (2014). A survey of personalized
treatment models for pricing strategies in insurance. Insurance: Mathematics and
Economics, 58:68–76.
[4] Guelman, L., Guill´en, M. and P´erez-Mar´ın, A. M. (2014). Uplift random forests.
Cybernetics & Systems. Accepted.
[5] Guelman, L., Guill´en, M. and P´erez-Mar´ın, A. M. (2014). A decision support
framework to implement optimal personalized marketing interventions. Decision
Support Systems, 72: 24–32.
[6] Guelman, L. (2012). Gradient boosting trees for auto insurance loss cost modeling
and prediction. Expert Systems with Applications, 39(3):3659–3667.
[7] Guelman, L., Guill´en, M. and P´erez-Mar´ın, A. M. (2012). Random forests for uplift
modeling: An insurance customer retention case. In Engemann, K. J., Lafuente, A. M.
G., and Merig´o, J. M., editors, Modeling and Simulation in Engineering, Economics
and Management, pages 123–133. Springer Berlin Heidelberg, New York, NY.
20 / 22
21. Talks Linked to Thesis
[1] Guill´en, M. and Guelman, L. (2014). “New trends in predictive modelling - the uplift models success story”.
R in Insurance Conference, London, UK. (July 14, 2014).
[2] Guelman, L. and Guill´en, M. (2014). “Actionable predictive learning for insurance profit maximization”.
Casualty Actuarial Society, Ratemaking and Product Management Seminar, Washington, D.C., USA (April 1,
2014).
[3] Guelman, L. (2013). “An introduction to causal learning with applications to price elasticity modeling in
Casualty insurance”. University of Barcelona, UB Economics Seminars, Barcelona, Spain (November 28, 2013).
[4] Guelman, L., Guill´en, M. and P´erez-Mar´ın, A.M. (2013). “Evaluating customer loyalty with advanced uplift
models”. APRIA, Annual Conference, New York, USA (July 28-31, 2013).
[5] Guill´en, M., Guelman, L., M. and P´erez-Mar´ın, A.M. (2013). “Customer retention and price elasticity. Are
motor insurance policies homogeneous with respect to loyalty?”. 2013 Astin colloquium, The Hague, Netherlands
(May 21-24, 2013).
[6] Guelman, L. and Lee, S. (2013). “Balancing robust statistics - gradient boosting”. Casualty Actuarial Society,
Ratemaking and Product Management Seminar, Los Angeles, USA (March 12-13, 2013).
[7] Guelman, L., Guill´en, M. and P´erez-Mar´ın, A.M. (2012). “Random forests for uplift modeling: an insurance
customer retention case”. Association of Modeling and Simulation in Enterprise (AMSE) - International Conference
on Modeling and Simulation, New York, USA (May 30-June 1, 2012). – Outstanding Scholarly Research
Contribution Award, AMSE.
[8] Guelman, L. and Lee, S. (2012). “Balancing robust statistics and data mining in ratemaking: gradient boosting
modeling”. Casualty Actuarial Society, Ratemaking and Product Management Seminar, Philadelphia, USA (March
20-21, 2012).
21 / 22
22. Limitations and Future Work
1 Extensions to multi-category and continuous treatment settings.
2 Extensions to continuous uncensored and survival responses.
3 Extensions to dynamic treatment regimes: treatment type may be
repeatedly adjusted according to an ongoing individual response.
4 Absolute vs. relative treatment effects – in some settings, defining
the treatment effect in terms of the ratio of the expected responses
under alternative treatment conditions, instead of the difference
between the expected responses may be more appropriate.
22 / 22