Lesage

On the problem of bias ampliﬁcation
of the instrumental calibration estimator
with missing survey data

Éric LESAGE

Laboratoire de statistique d’enquête
CREST-ENSAI
Joint work with David HAZIZA (Université de Montréal and CREST-ENSAI)

31 janvier 2013

Séminaire de Statistique ENSAE-ENSAI
CREST, Paris

Éric LESAGE (CREST-ENSAI) CREST(ENSAE-ENSAI) 31 janvier 2013 1 / 37

Outlines

1 Introduction

2 Underlying models

3 Bias ampliﬁcation of the instrumental calibration estimators

4 Simulation study


Introduction

Context
Nonresponse is a major problem in survey
In presence of nonresponse, the usual complete data estimators may
be biased when respondents and nonrespondents are different with
respect to the survey variables.
A weighting approach that has received a lot of attention recently is
the so-called single-step approach which uses calibration.
See Deville (1998, 2002), Sautory (2003), Särndal and Lundström
(2005), Kott (2006, 2009, 2012), among others.

Issue
We examine the properties of instrument vector calibration estimators,
where the instrumental variables (related to the response propensity)
are available for the responding units only.
More specifically, the problem of bias amplification is illustrated.


Introduction

Consider a ﬁnite population U of size N .
The objective is to estimate the population total ty = yk , of a
k∈U
variable of interest y(e.g., incomes).
A sample, s, of size n, is selected from U according to a given
sampling design p(s).
A complete data estimator of ty is the expansion estimator

ˆ
tπ = dk yk ,
k∈s

where
dk = 1/πk denotes the design weight attached to unit k
and πk = P (k ∈ s) denotes its ﬁrst-order probability of inclusion in the
sample.

In the presence of unit nonresponse, only a subset sr of s is observed,
ˆ
which makes it impossible to compute tπ .

Introduction

To deﬁne a nonresponse adjusted estimator of ty , we assume that
a vector of auxiliary variables x is available for k ∈ sr ;
the vector of population totals tx = k∈U xk is known;
In practice, the x-vector is often deﬁned by survey managers, who wish
to ensure consistency between survey weighted estimates and known
population totals for some important variables (e.g., age and sex).

In addition, we assume that
a vector of instrumental variables z is available for k ∈ sr ;
same dimension as x,;
The z-vector needs only to be available for the respondents.
The instrumental variables are believed to be associated with the
propensity of units to respond to the survey.

Let Rk be a response indicator attached to unit k such that

Rk = 1 if unit k is a respondent
Rk = 0 otherwise.


Introduction

ˆ
Instrumental calibration estimator tC
We consider an instrumental calibration estimator (Deville(1998,
2002)) of the form
ˆ
tC = wk Rk yk ,
k∈s
where
wk = dk F λ⊤ zk ,
r

F (.) is a function which is monotonic and twice diﬀerentiable.
F λ⊤ zk : weighting adjustment factor which is essentially an
r
estimate of the inverse of the response probability for unit k.
The weights wk are constructed so that the calibration constraints
wk xk = xk
k∈sr k∈U

are satisﬁed.

Introduction

Remarks:

Linear weighting is a special case for which the weights wk are given by

wk = dk 1 + λ⊤ zk .
r

When x is used in the calibration instead of z then we have a usual
calibration estimator and wk = dk F λ⊤ xk .
r


Introduction

Error decomposition

ˆ
The total error of tC can be expressed as

ˆ ˆ
tC − ty = (tπ − ty ) + ˆ ˆ
(tC − tπ ) .
sampling error nonresponse error

Since the sampling error does not depend on nonresponse, we focus on
the nonresponse error in the sequel.

Without loss of generality, we consider the case of a census s = U so
ˆ
that the sampling error, tπ − ty , is equal to zero.


Introduction

First approach: good speciﬁcation of the y model

ˆ
Regardless of the choice of F (.), the instrument vector calibration, tC ,
perfectly estimates ty if the variable of interest y is perfectly explained
by the x-vector, i.e.,
yk = x⊤ β
k

for some vector β.
ˆ
Hence, we expect tC , to exhibit a small bias if the y-variable and the
x-vector are linearly related and the relationship is strong.

However, in multipurpose surveys, the number of variables of interest
is typically large (possibly few hundred) and therefore, it is unrealistic
to presume that the x-vector is linearly related to all y-variables, in
which case some estimates could suﬀer from bias.


Introduction

Second approach: estimation of the propensity of response

For linear weighting, Särndal and Lundström (2005, Chapter 9)
ˆ
showed that, tC is asymptotically unbiased for ty for every y-variable
provided that the response probability of unit k, pk , is such that
−1
pk = 1 + λ⊤ zk for all k ∈ U ; (1)

for a vector of unknown constants λ;
see also Kott and Liao (2012) for a discussion for nonlinear weighting.
However, in practice, it is not clear how to validate the form of the
relationship in (1) since the z-vector is available for the respondents
only.


Introduction

The purpose of this presentation is to examine the so-called problem
of bias ampliﬁcation in the context of instrument vector calibration.

In the context of epidemiological studies, it has been found that,
including instrumental variables in the set of conditioning variables,
can increase unmeasured confounding bias; see Bahattacharya and
Vogt (2007), Wooldridge (2009), Pearl (2010) and Myers et al.
(2011).

We argue that the same is true in the context of instrument vector
calibration.
Some preliminary studies in this direction can be found in Lesage
(2012) and Osier (2012).


Underlying models

Superpopulation model

Let (yk , zk )⊤ be a realisation of the vector of random variables
(Yk , Zk )⊤ , k ∈ U.
Without loss of generality, we assume that
E (Zk ) = 0
and V (Zk ) = 1.

Further, we assume that the relationship between Y and Z can be
modeled using
y
Yk = β0 + β1 Zk + εk
such that
E(εy | Zk ) = 0.
k

This model is often called a prediction model or outcome regression
model.

Underlying models

Nonresponse model

We also assume the following nonresponse model:

Rk = γ0 + γ1 Zk + εR
k

E(εR | Zk ) = 0.
k

We assume that y is not a direct explanatory variable of nonresponse

cov (Yk Rk | Zk ) = 0.

Remarks
The nonresponse model states that the response indicators Rk are
linearly related to Z. Although this relationship may seem awkward, it
will be useful to study the problem of bias ampliﬁcation.
A more realistic nonresponse model, namely the logistic model, is
considered in the empirical study.


Underlying models

Y
β1

Z

γ1

R

Figure: Graph of the variables y, z et R


Bias ampliﬁcation of the instrumental calibration estimators

Naive estimator

We consider the naive estimator
k∈U y k Rk
ˆ
tnaive = N × .
k∈U Rk

We have:
ˆ
tnaive ty cov(Yk Rk )
− = + oP (1)
N N γ0
β1 γ1
= + oP (1)
γ0
√
Example: with γ0 = 0.5, γ1 = 3/10, β0 = 10 and β1 = 2
γ1 β1
× = 6.9%.
γ0 β0


Instrument vector calibration estimators
We suppose that a proxy variable of z, denoted x, is available.

Deﬁnition
A proxy variable of z , in nonresponse context, is a variable x such that:
1 x is an auxiliary variable which we know the population total tx ;
2 cor(Xk , Zk ) = 0;
3 cov (Xk Zk | [Rk = 1]) = 0.

We assume that the relationship between X and Z can be modeled
using
Xk = α0 + α1 Zk + εx
k
E (εx | Zk ) = 0,
k
V (εx ) = σx = 1 − α2
k
2
1
Remarks: V (Xk ) = 1 and cor(Xk , Zk ) = α1 .



We assume also that x is not a direct explanatory variable of the
nonresponse
cov (Xk Rk | Zk ) = 0.



Y
β1

Z

γ1

R

Figure: Graph of the variables y, z, x et R



Y
β1

α1
Z X

γ1

R

Figure: Graph of the variables y, z, x et R




We consider the instrument vector calibration estimator with linear
weighting
 −1
ˆ
tC = t⊤ 
x z k x⊤ 
k z k yk
k∈sr k∈sr

where
xk = (1, xk )⊤ ;
z k = (1, zk )⊤ ;
tx = (N, tx )⊤ .

Since cov (Xk , Rk | Zk ) = 0, we have:

ˆ
tC ty
− = oP (1).
N N


What if...

It is not trivial to veriﬁy if cov (Xk , Rk | Zk ) = 0, since the variable z
is available only for the respondents.

What if cov (Xk , Rk | Zk ) = 0?



Y
β1

α1
Z X U

γ1

R

Figure: Graph of the variables y, z, x, u et R



Y
β1

α1 α2
Z X U

γ1 γ2

R

cov(Rk Xk | Z k ) = α2 γ2



We now assume that it exists a non-observe variable u, independent of
z and y, that is an explanatory variable in the nonresponse model.
Without loss of generality, we assume that
E (Uk | Zk , Yk ) = 0
and V (Uk | Zk , Yk ) = 1.

The nonresponse model is rewritten

Rk = γ0 + γ1 Zk + γ2 Uk + εR
k
E εR | Zk , Uk = 0.
k

We still assume that

cov (Yk Rk | Zk ) = 0.



Moreover, we assume that the variable x is linked to the variable u

Xk = α0 + α1 Zk + α2 Uk + εx
k

E εX | Zk , Uk = 0
k
V (εx ) = σx = 1 − α2 − α2
k
2
1 2
R x
E εk εk | Zk , Uk = 0

Then we have cov(Rk Xk | Z k ) = α2 γ2 .



We have:
ˆ
tC ty β1 1 E (Zk Xk | [Rk = 1])
− = − cov(Rk Xk | Z k )
N N α1 E (Rk ) cov(Zk Xk | [Rk = 1])
+ oP (1)

γ1
α1 + α0
β1 α2 γ2 γ0
= −
γ0 α1 γ1 γ1 γ2
α1 − α1 + α2
γ0 γ0 γ0
+ oP (1)

If α2 γ2 = 0 then the instruments vector calibration is “biased”;
The “bias” is ampliﬁed if α1 is small (i.e. weak proxy).



Bias ampliﬁcation for weak proxy with the instrument vector
calibration estimator
H
HH α2
0 0.1 0.3 0.7
α1 HHH
0.7 0 -0.8 -2.3 -5.8
0.3 0 -1.8 -5.8 -15.5
0.1 0 -5.8 -21.7 -101

Y
β1
α1 α2
Z X U

γ1 γ2
R

Éric LESAGE Figure: Graph CREST(ENSAE-ENSAI) z, x, u et R31 janvier 2013
(CREST-ENSAI) of the variables y, 27 / 37


Usual calibration estimators

We have seen that instruments vector calibration could lead to
estimators with large biases.

Would a simple calibration protect against such bias ampliﬁcation risk?

 −1
ˆ
tC
= t⊤ 
x xk x⊤ 
k xk y k . (2)
N
k∈sr k∈sr



The simple calibration estimator is asymptotically biased

ˆ 2
tC ty β1 γ1 σx − α2 (α1 γ2 − α2 γ1 ) + B
− =
N N γ0 β0 γ1 γ2 2
1 − α1 + α2
γ0 γ0
+ oP (1),

where B = α0 α2 γ2 (α1 γ1 + α2 γ2 ) − γ1 1 − (α2 + α2 )
1 2 is a nul
term when α0 = 0,

but it oﬀers a protection against bias ampliﬁcation.



No bias amplification for weak proxy with the simple
calibration estimator

The usual calibration has a bias similar to the bias of the naive
estimator.
This bias is not amplified with the decrease of the correlation, α1 ,
between x and z.

α1 α2 = 0 α2 = 0.1 α2 = 0.3 α2 = 0.7
0.7 3.8 3.5 2.8 1.5
0.3 6.4 6.3 6.1 5.7
0.1 6.9 6.8 6.8 6.8
Table: Asymptotic relative bias (in %) of the simple calibration for different
values of the parameters α1 and α2


Simulation study

Simulation study

We generated a population U of size N = 1 000 consisting of
a variable of interest Y ,
several proxy variables denoted X (α1 ,α2 ) where
α1 ∈ {0.2, 0.3, 0.5, 0.7} and α2 ∈ {0, 0.1, 0.3, 0.5},
an instrumental variable Z
and an unobserved variable U.
First, the variables Z and U were generated from a uniform
√ √
distribution − 3, 3 , which led to mean equal to zero and variance
equal to 1.
Then, given the z-values, the y-values were generated according to the
linear regression model
Yk = 10 + 2zk + εy ,
k
where εy is normally distributed with mean 0 and variance 1.
k
The resulting coeﬃcient of determination was equal to 79.2%.


Simulation study

Finally, the proxy variables x(α1 ,α2 ) -values were generated according to
the linear regression models
(α ,α ) (α ,α )
Xk 1 2 = α1 zk + α2 uk + σ(α1 ,α2 ) εk 1 2
2
where σ1 (α1 , α2 ) = 1 − α2 − α2
1 2
and ε(α1 ,α2 ) was normally distributed with mean 0 and variance 1.

In order to focus on the nonresponse error, we considered the census
case; i.e., n = N = 1 000.

Each unit was assigned a response probability by

logit(pk ) = 1.5zk + uk

Then, the response indicators Rk for k ∈ U were generated
independently from a Bernoulli distribution with parameter pk .

This whole process was repeated K = 10 000 times leading to
K = 10 000 sets of respondents.


Simulation study

For each simulation, we computed instruments vector calibration estimators
ˆ
denoted tC (α1 , α2 ) where α1 ∈ {0.2, 0.3, 0.5, 0.7} and
α2 ∈ {0, 0.1, 0.3, 0.5}:

 −1
⊤
ˆ N (α1 ,α2 )⊤ 
tC (α1 , α2 ) =  z k xk z k yk .
tx(α1 ,α2 )
k∈sr k∈sr

We computed:
ˆ
the Monte Carlo percent relative bias: RBM C tC (α1 , α2 )
the Monte Carlo percent coeﬃcient of variation (CV):
ˆ
CVM C tC (α1 , α2 )


Simulation study

Monte Carlo relative bias
ˆ
tC (α1 , α2 ) − ty
ˆ
RBM C tC (α1 , α2 ) = EM C × 100.
ty
Monte Carlo CV
ˆ
VM C tC (α1 , α2 ) − ty
ˆ
CVM C tC (α1 , α2 ) = × 100.
EM C (ty )


Simulation study

ˆ
RBM C (tC ) (in %)
ˆ
CVM C (tC ) (in %)

α1 α2 = 0 α2 = 0.1 α2 = 0.3 α2 = 0.5
0.7 0.02 −0.9 −2.8 −4.9
(0.9) (0.9) (1.0) (1.1)
0.5 −0.1 −1.3 −4.1 −7.2
(1.4) (1.5) (1.7) (2.1)
0.3 −0.2 −2.4 −7.5 −14.0
(2.6) (3.0) (4.1) (5.9)
0.2 −0.6 −4.5 −13.8 −27.4
(4.6) (15.6) (61.9) (65.6)


Simulation study

Conclusion
Instrument vector calibration is a good technique to adjust for
nonresponse under certain conditions such as
cov (Xk , Rk | Zk ) = 0
or at least α1 large.
otherwise, one can get bias and variance ampliﬁcation.

Y
β1

α1 large
Z X

γ1
R


Simulation study

Merci de votre attention.


Lesage

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Lesage

Ähnlich wie Lesage (20)

Mehr von eric_gautier

Mehr von eric_gautier (8)

Lesage