1) The document discusses bias amplification that can occur when using instrumental variable calibration estimators with missing survey data. It presents models where a variable of interest (y) and instrumental variables (z) are related, and response propensity depends on the instrumental variables.
2) When an imperfect proxy for the instrumental variables (x) is used in calibration instead of the true variables, it can lead to bias amplification if the proxy is also related to response propensity. This violates the assumption that the proxy is independent of response given the instrumental variables.
3) A simulation study is presented to illustrate how using an imperfect proxy in calibration can amplify bias compared to the naive estimator that ignores nonresponse. The degree of bias
1. On the problem of bias amplification
of the instrumental calibration estimator
with missing survey data
Éric LESAGE
Laboratoire de statistique d’enquête
CREST-ENSAI
Joint work with David HAZIZA (Université de Montréal and CREST-ENSAI)
31 janvier 2013
Séminaire de Statistique ENSAE-ENSAI
CREST, Paris
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 1 / 37
2. Outlines
1 Introduction
2 Underlying models
3 Bias amplification of the instrumental calibration estimators
4 Simulation study
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 2 / 37
3. Introduction
Context
Nonresponse is a major problem in survey
In presence of nonresponse, the usual complete data estimators may
be biased when respondents and nonrespondents are different with
respect to the survey variables.
A weighting approach that has received a lot of attention recently is
the so-called single-step approach which uses calibration.
See Deville (1998, 2002), Sautory (2003), Särndal and Lundström
(2005), Kott (2006, 2009, 2012), among others.
Issue
We examine the properties of instrument vector calibration estimators,
where the instrumental variables (related to the response propensity)
are available for the responding units only.
More specifically, the problem of bias amplification is illustrated.
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 3 / 37
4. Introduction
Consider a finite population U of size N .
The objective is to estimate the population total ty = yk , of a
k∈U
variable of interest y(e.g., incomes).
A sample, s, of size n, is selected from U according to a given
sampling design p(s).
A complete data estimator of ty is the expansion estimator
ˆ
tπ = dk yk ,
k∈s
where
dk = 1/πk denotes the design weight attached to unit k
and πk = P (k ∈ s) denotes its first-order probability of inclusion in the
sample.
In the presence of unit nonresponse, only a subset sr of s is observed,
ˆ
which makes it impossible to compute tπ .
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 4 / 37
5. Introduction
To define a nonresponse adjusted estimator of ty , we assume that
a vector of auxiliary variables x is available for k ∈ sr ;
the vector of population totals tx = k∈U xk is known;
In practice, the x-vector is often defined by survey managers, who wish
to ensure consistency between survey weighted estimates and known
population totals for some important variables (e.g., age and sex).
In addition, we assume that
a vector of instrumental variables z is available for k ∈ sr ;
same dimension as x,;
The z-vector needs only to be available for the respondents.
The instrumental variables are believed to be associated with the
propensity of units to respond to the survey.
Let Rk be a response indicator attached to unit k such that
Rk = 1 if unit k is a respondent
Rk = 0 otherwise.
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 5 / 37
6. Introduction
ˆ
Instrumental calibration estimator tC
We consider an instrumental calibration estimator (Deville(1998,
2002)) of the form
ˆ
tC = wk Rk yk ,
k∈s
where
wk = dk F λ⊤ zk ,
r
F (.) is a function which is monotonic and twice differentiable.
F λ⊤ zk : weighting adjustment factor which is essentially an
r
estimate of the inverse of the response probability for unit k.
The weights wk are constructed so that the calibration constraints
wk xk = xk
k∈sr k∈U
are satisfied.
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 6 / 37
7. Introduction
Remarks:
Linear weighting is a special case for which the weights wk are given by
wk = dk 1 + λ⊤ zk .
r
When x is used in the calibration instead of z then we have a usual
calibration estimator and wk = dk F λ⊤ xk .
r
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 7 / 37
8. Introduction
Error decomposition
ˆ
The total error of tC can be expressed as
ˆ ˆ
tC − ty = (tπ − ty ) + ˆ ˆ
(tC − tπ ) .
sampling error nonresponse error
Since the sampling error does not depend on nonresponse, we focus on
the nonresponse error in the sequel.
Without loss of generality, we consider the case of a census s = U so
ˆ
that the sampling error, tπ − ty , is equal to zero.
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 8 / 37
9. Introduction
First approach: good specification of the y model
ˆ
Regardless of the choice of F (.), the instrument vector calibration, tC ,
perfectly estimates ty if the variable of interest y is perfectly explained
by the x-vector, i.e.,
yk = x⊤ β
k
for some vector β.
ˆ
Hence, we expect tC , to exhibit a small bias if the y-variable and the
x-vector are linearly related and the relationship is strong.
However, in multipurpose surveys, the number of variables of interest
is typically large (possibly few hundred) and therefore, it is unrealistic
to presume that the x-vector is linearly related to all y-variables, in
which case some estimates could suffer from bias.
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 9 / 37
10. Introduction
Second approach: estimation of the propensity of response
For linear weighting, Särndal and Lundström (2005, Chapter 9)
ˆ
showed that, tC is asymptotically unbiased for ty for every y-variable
provided that the response probability of unit k, pk , is such that
−1
pk = 1 + λ⊤ zk for all k ∈ U ; (1)
for a vector of unknown constants λ;
see also Kott and Liao (2012) for a discussion for nonlinear weighting.
However, in practice, it is not clear how to validate the form of the
relationship in (1) since the z-vector is available for the respondents
only.
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 10 / 37
11. Introduction
The purpose of this presentation is to examine the so-called problem
of bias amplification in the context of instrument vector calibration.
In the context of epidemiological studies, it has been found that,
including instrumental variables in the set of conditioning variables,
can increase unmeasured confounding bias; see Bahattacharya and
Vogt (2007), Wooldridge (2009), Pearl (2010) and Myers et al.
(2011).
We argue that the same is true in the context of instrument vector
calibration.
Some preliminary studies in this direction can be found in Lesage
(2012) and Osier (2012).
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 11 / 37
12. Underlying models
Superpopulation model
Let (yk , zk )⊤ be a realisation of the vector of random variables
(Yk , Zk )⊤ , k ∈ U.
Without loss of generality, we assume that
E (Zk ) = 0
and V (Zk ) = 1.
Further, we assume that the relationship between Y and Z can be
modeled using
y
Yk = β0 + β1 Zk + εk
such that
E(εy | Zk ) = 0.
k
This model is often called a prediction model or outcome regression
model.
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 12 / 37
13. Underlying models
Nonresponse model
We also assume the following nonresponse model:
Rk = γ0 + γ1 Zk + εR
k
E(εR | Zk ) = 0.
k
We assume that y is not a direct explanatory variable of nonresponse
cov (Yk Rk | Zk ) = 0.
Remarks
The nonresponse model states that the response indicators Rk are
linearly related to Z. Although this relationship may seem awkward, it
will be useful to study the problem of bias amplification.
A more realistic nonresponse model, namely the logistic model, is
considered in the empirical study.
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 13 / 37
14. Underlying models
Y
β1
Z
γ1
R
Figure: Graph of the variables y, z et R
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 14 / 37
15. Bias amplification of the instrumental calibration estimators
Naive estimator
We consider the naive estimator
k∈U y k Rk
ˆ
tnaive = N × .
k∈U Rk
We have:
ˆ
tnaive ty cov(Yk Rk )
− = + oP (1)
N N γ0
β1 γ1
= + oP (1)
γ0
√
Example: with γ0 = 0.5, γ1 = 3/10, β0 = 10 and β1 = 2
γ1 β1
× = 6.9%.
γ0 β0
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 15 / 37
16. Bias amplification of the instrumental calibration estimators
Instrument vector calibration estimators
We suppose that a proxy variable of z, denoted x, is available.
Definition
A proxy variable of z , in nonresponse context, is a variable x such that:
1 x is an auxiliary variable which we know the population total tx ;
2 cor(Xk , Zk ) = 0;
3 cov (Xk Zk | [Rk = 1]) = 0.
We assume that the relationship between X and Z can be modeled
using
Xk = α0 + α1 Zk + εx
k
E (εx | Zk ) = 0,
k
V (εx ) = σx = 1 − α2
k
2
1
Remarks: V (Xk ) = 1 and cor(Xk , Zk ) = α1 .
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 16 / 37
17. Bias amplification of the instrumental calibration estimators
Instrument vector calibration estimators
We assume also that x is not a direct explanatory variable of the
nonresponse
cov (Xk Rk | Zk ) = 0.
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 17 / 37
18. Bias amplification of the instrumental calibration estimators
Y
β1
Z
γ1
R
Figure: Graph of the variables y, z, x et R
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 18 / 37
19. Bias amplification of the instrumental calibration estimators
Y
β1
α1
Z X
γ1
R
Figure: Graph of the variables y, z, x et R
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 19 / 37
20. Bias amplification of the instrumental calibration estimators
Instrument vector calibration estimators
We consider the instrument vector calibration estimator with linear
weighting
−1
ˆ
tC = t⊤
x z k x⊤
k z k yk
k∈sr k∈sr
where
xk = (1, xk )⊤ ;
z k = (1, zk )⊤ ;
tx = (N, tx )⊤ .
Since cov (Xk , Rk | Zk ) = 0, we have:
ˆ
tC ty
− = oP (1).
N N
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 20 / 37
21. Bias amplification of the instrumental calibration estimators
What if...
It is not trivial to verifiy if cov (Xk , Rk | Zk ) = 0, since the variable z
is available only for the respondents.
What if cov (Xk , Rk | Zk ) = 0?
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 21 / 37
22. Bias amplification of the instrumental calibration estimators
Y
β1
α1
Z X U
γ1
R
Figure: Graph of the variables y, z, x, u et R
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 22 / 37
23. Bias amplification of the instrumental calibration estimators
Y
β1
α1 α2
Z X U
γ1 γ2
R
cov(Rk Xk | Z k ) = α2 γ2
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 23 / 37
24. Bias amplification of the instrumental calibration estimators
We now assume that it exists a non-observe variable u, independent of
z and y, that is an explanatory variable in the nonresponse model.
Without loss of generality, we assume that
E (Uk | Zk , Yk ) = 0
and V (Uk | Zk , Yk ) = 1.
The nonresponse model is rewritten
Rk = γ0 + γ1 Zk + γ2 Uk + εR
k
E εR | Zk , Uk = 0.
k
We still assume that
cov (Yk Rk | Zk ) = 0.
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 24 / 37
25. Bias amplification of the instrumental calibration estimators
Moreover, we assume that the variable x is linked to the variable u
Xk = α0 + α1 Zk + α2 Uk + εx
k
E εX | Zk , Uk = 0
k
V (εx ) = σx = 1 − α2 − α2
k
2
1 2
R x
E εk εk | Zk , Uk = 0
Then we have cov(Rk Xk | Z k ) = α2 γ2 .
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 25 / 37
26. Bias amplification of the instrumental calibration estimators
We have:
ˆ
tC ty β1 1 E (Zk Xk | [Rk = 1])
− = − cov(Rk Xk | Z k )
N N α1 E (Rk ) cov(Zk Xk | [Rk = 1])
+ oP (1)
γ1
α1 + α0
β1 α2 γ2 γ0
= −
γ0 α1 γ1 γ1 γ2
α1 − α1 + α2
γ0 γ0 γ0
+ oP (1)
If α2 γ2 = 0 then the instruments vector calibration is “biased”;
The “bias” is amplified if α1 is small (i.e. weak proxy).
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 26 / 37
27. Bias amplification of the instrumental calibration estimators
Bias amplification for weak proxy with the instrument vector
calibration estimator
H
HH α2
0 0.1 0.3 0.7
α1 HHH
0.7 0 -0.8 -2.3 -5.8
0.3 0 -1.8 -5.8 -15.5
0.1 0 -5.8 -21.7 -101
Y
β1
α1 α2
Z X U
γ1 γ2
R
Éric LESAGE Figure: Graph CREST(ENSAE-ENSAI) z, x, u et R31 janvier 2013
(CREST-ENSAI) of the variables y, 27 / 37
28. Bias amplification of the instrumental calibration estimators
Usual calibration estimators
We have seen that instruments vector calibration could lead to
estimators with large biases.
Would a simple calibration protect against such bias amplification risk?
−1
ˆ
tC
= t⊤
x xk x⊤
k xk y k . (2)
N
k∈sr k∈sr
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 28 / 37
29. Bias amplification of the instrumental calibration estimators
The simple calibration estimator is asymptotically biased
ˆ 2
tC ty β1 γ1 σx − α2 (α1 γ2 − α2 γ1 ) + B
− =
N N γ0 β0 γ1 γ2 2
1 − α1 + α2
γ0 γ0
+ oP (1),
where B = α0 α2 γ2 (α1 γ1 + α2 γ2 ) − γ1 1 − (α2 + α2 )
1 2 is a nul
term when α0 = 0,
but it offers a protection against bias amplification.
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 29 / 37
30. Bias amplification of the instrumental calibration estimators
No bias amplification for weak proxy with the simple
calibration estimator
The usual calibration has a bias similar to the bias of the naive
estimator.
This bias is not amplified with the decrease of the correlation, α1 ,
between x and z.
α1 α2 = 0 α2 = 0.1 α2 = 0.3 α2 = 0.7
0.7 3.8 3.5 2.8 1.5
0.3 6.4 6.3 6.1 5.7
0.1 6.9 6.8 6.8 6.8
Table: Asymptotic relative bias (in %) of the simple calibration for different
values of the parameters α1 and α2
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 30 / 37
31. Simulation study
Simulation study
We generated a population U of size N = 1 000 consisting of
a variable of interest Y ,
several proxy variables denoted X (α1 ,α2 ) where
α1 ∈ {0.2, 0.3, 0.5, 0.7} and α2 ∈ {0, 0.1, 0.3, 0.5},
an instrumental variable Z
and an unobserved variable U.
First, the variables Z and U were generated from a uniform
√ √
distribution − 3, 3 , which led to mean equal to zero and variance
equal to 1.
Then, given the z-values, the y-values were generated according to the
linear regression model
Yk = 10 + 2zk + εy ,
k
where εy is normally distributed with mean 0 and variance 1.
k
The resulting coefficient of determination was equal to 79.2%.
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 31 / 37
32. Simulation study
Finally, the proxy variables x(α1 ,α2 ) -values were generated according to
the linear regression models
(α ,α ) (α ,α )
Xk 1 2 = α1 zk + α2 uk + σ(α1 ,α2 ) εk 1 2
2
where σ1 (α1 , α2 ) = 1 − α2 − α2
1 2
and ε(α1 ,α2 ) was normally distributed with mean 0 and variance 1.
In order to focus on the nonresponse error, we considered the census
case; i.e., n = N = 1 000.
Each unit was assigned a response probability by
logit(pk ) = 1.5zk + uk
Then, the response indicators Rk for k ∈ U were generated
independently from a Bernoulli distribution with parameter pk .
This whole process was repeated K = 10 000 times leading to
K = 10 000 sets of respondents.
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 32 / 37
33. Simulation study
For each simulation, we computed instruments vector calibration estimators
ˆ
denoted tC (α1 , α2 ) where α1 ∈ {0.2, 0.3, 0.5, 0.7} and
α2 ∈ {0, 0.1, 0.3, 0.5}:
−1
⊤
ˆ N (α1 ,α2 )⊤
tC (α1 , α2 ) = z k xk z k yk .
tx(α1 ,α2 )
k∈sr k∈sr
We computed:
ˆ
the Monte Carlo percent relative bias: RBM C tC (α1 , α2 )
the Monte Carlo percent coefficient of variation (CV):
ˆ
CVM C tC (α1 , α2 )
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 33 / 37
34. Simulation study
Monte Carlo relative bias
ˆ
tC (α1 , α2 ) − ty
ˆ
RBM C tC (α1 , α2 ) = EM C × 100.
ty
Monte Carlo CV
ˆ
VM C tC (α1 , α2 ) − ty
ˆ
CVM C tC (α1 , α2 ) = × 100.
EM C (ty )
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 34 / 37
36. Simulation study
Conclusion
Instrument vector calibration is a good technique to adjust for
nonresponse under certain conditions such as
cov (Xk , Rk | Zk ) = 0
or at least α1 large.
otherwise, one can get bias and variance amplification.
Y
β1
α1 large
Z X
γ1
R
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 36 / 37
37. Simulation study
Merci de votre attention.
Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 37 / 37