Propensity albert

An approximate Bayesian inference on propensity score
estimation under nonresponse
Jae Kwang Kim 1
Iowa State University
Albert Einstein College of Medicine
4/06/2017
1
Joint work with Hejain Sang

Motivation
1 Motivation
2 Setup
3 Proposed method
4 Asymptotic properties
5 Extension 1: Incorporate auxiliary variables
6 Extension 2: Bayesian Model selection
7 Simulation study
8 Conclusion
Kim BPS 2 / 50

Motivation
Bayesian approach
Population model: f (y; θ)
Prior distribution: θ ∼ π(θ).
Posterior distribution:
P(θ | Y ) ∝ L(θ; Y ) × π(θ).
That is, posterior is proportional to likelihood times prior.
In some situation, the likelihood function cannot be computed directly.
Kim BPS 3 / 50

Motivation
Example 1: Complex Sampling
Finite population is a realization from f (y; θ).
A sample of size n is selected by a probability sampling.
Let Ii = 1 if unit i is in the sample and Ii = 0 otherwise.
In the probability sampling, πi = P(Ii = 1) are known.
The selection probability πi can be correlated with yi . In this case,
the sample distribution is not necessarily the same as the population
distribution.
In general,
f (y | I = 1) =
P(I = 1 | y)f (y; θ)
P(I = 1 | y)f (y; θ)dy
.
Unless we know P(I = 1 | y), we cannot compute f (y | I = 1) from
the sample. Thus, we cannot ﬁnd the likelihood function.
Kim BPS 4 / 50

Motivation
Example 2: Estimating equation
We are interested in estimating θ deﬁned through E{U(θ; Y )} = 0.
No distributional assumptions on Y is made.
A consistent estimator of θ can be obtained by solving
n
i=1
U(θ; yi ) = 0
for θ.
For over-identiﬁed situation (more equations than parameters), we
can use Generalized Method of Moments (GMM).
ˆθW = arg min
n
i=1
U(θ; yi ) W (θ)U(θ; yi )
Since we do not know the parametric model for Y , we cannot directly
apply Bayesian method.
Kim BPS 5 / 50

Motivation
Approximate Bayesian approach
Instead of using the likelihood function, let’s use the sampling
distribution of ˆθ, a consistent estimator of θ.
Under some regularity conditions, the sampling distribution of ˆθ is
approximately normal.
Posterior distribution:
P(θ | ˆθ) ∝ g(ˆθ | θ) × π(θ).
Use of g(ˆθ | θ) instead of the full likelihood simpliﬁes the
computation (Approximate Bayesian Computation).
Kim BPS 6 / 50

Motivation
Motivation
The approximate Bayesian method is calibrated to frequentist
inference in the sense that, asymptotically, the credible region
obtained from the posterior distribution with ﬂat prior matches to the
frequentist conﬁdence interval obtained from Taylor linearization
method.
Other advantages of Bayesian tools (such as informative prior) can be
easily applied.
Kim BPS 7 / 50

Setup
1 Motivation
2 Setup
3 Proposed method
7 Simulation study
8 Conclusion
Kim BPS 8 / 50

Setup
Setup
Suppose that X is always observed and Y is the subject to
missingness.
Suppose that we are interested in estimating θ which is deﬁned
through E {U (θ; X, Y )} = 0 for some estimating function
U(θ; X, Y ).
Let (xi , yi ) , i = 1, · · · , n, be an independently and identically
distributed (IID) realizations of random variables (X, Y ).
Kim BPS 9 / 50

Setup
Setup (Con’t)
We can deﬁne the response indicator function δi as
δi =
1 if yi is observed
0 otherwise.
We assume that δi are independently generated from a Bernoulli
distribution with
Pr(δi = 1 | xi , yi ) = π (φ; x) (1)
for some parameter vector φ.
Kim BPS 10 / 50

Setup
Setup (Con’t)
We can obtain the propensity score (PS) estimator of θ in the
following two-steps:
[Step 1] Compute the maximum likelihood (ML) estimator ˆφ of φ.
[Step 2] Compute the PS estimator of θ by solving
1
n
n
i=1
δi
π(ˆφ; xi )
U (θ; xi , yi ) = 0 (2)
for θ.
The computation for ML estimator of φ can be greatly simpliﬁed if
the response mechanism is Missing At Random (MAR) in that
Pr (δ = 1|x, y) = Pr (δ = 1|x) (3)
holds.
Kim BPS 11 / 50

Proposed method
1 Motivation
2 Setup
3 Proposed method
7 Simulation study
8 Conclusion
Kim BPS 12 / 50

Proposed method
Estimating equations
Under assumption of MAR, we can derive the score equation of φ as
U1 (φ) =
1
n
n
i=1
δi
π (φ; xi )
−
1 − δi
1 − π (φ; xi )
∂π (φ; xi )
∂φ
=:
1
n
n
i=1
s (φ; xi , δi ) . (4)
The weighted estimating equation for θ is
U2 (φ, θ) =
1
n
n
i=1
δi
π (φ; xi )
U (θ; xi , yi ) (5)
Kim BPS 13 / 50

Proposed method
Bayesian technique
To introduce the proposed Bayesian inference corresponding to ˆθPS ,
instead of solving the estimating equations (4) and (5) directly, we
treat these estimating equations as random variables.
Rather than assigning a prior to (θ, φ), we ﬁrst make a transformation
of the parameters, denoted by η1 and η2, and put a prior distribution
on η = (η1, η2). where
E {U1 (φ) |φ} = η1 (φ)
E {U2 (φ, θ) |φ, θ} = η2 (φ, θ) .
Kim BPS 14 / 50

Proposed method
Asymptotic distribution
The transformation
φ
θ
−→
η1
η2
(6)
is one to one.
Under some regularity conditions, we can establish that
√
n
U1(φ)
U2(φ, θ)
−
η1
η2
d
−→ N (0, Σ(η1, η2)) (7)
Denote Un = UT
1 (φ), UT
2 (φ, θ)
T
, ζ = φT , θ
T
and η = (ηT
1 , η2)T .
From above asymptotic distribution, we can obtain that
[Un|η] ∼ N (η, Σ/n) (8)
Kim BPS 15 / 50

Proposed method
Key idea
The sampling distribution [Un|η] serves the role of the likelihood
function in the Bayesian inference.
Writing π(η) as a prior distribution of η, the posterior distribution of
η given Un can be expressed as
[η|Un] ∝ [Un|η] π(η) (9)
If we assign a ﬂat prior for η, then the posterior distribution can be
explicitly expressed as [η|Un = 0] ∼ N (0, Σ/n).
Kim BPS 16 / 50

Proposed method
Algorithm
1. Generate η∗ = (η∗
1, η∗
2) from posterior normal distribution
N 0, ˆΣ/n .
2. Solve U1 (φ) = η∗
1 and obtain φ∗.
3. Solve U2 (φ∗, θ) = η∗
2 and obtain θ∗.
The above steps 1–3 can be repeated independently to generate
independent samples from the posterior distribution. The samples can be
used to obtain the posterior distribution of the induced parameters.
Kim BPS 17 / 50

Proposed method
Outline procedure
Un η = (η1, η2) ζ = (θ, φ)
π(η)p(Un|η)
p(η|Un) η∗
p(ζ|Un)
E 1-to-1
asymp dist assign
draw Un = η∗
inference
Kim BPS 18 / 50

Asymptotic properties
1 Motivation
2 Setup
3 Proposed method
7 Simulation study
8 Conclusion
Kim BPS 19 / 50

Notation
Let Un(ζ) = U1(φ)T , U2(φ, θ)
T
and ζ = (φ, θ).
Using asymptotic distribution for Un with a ﬂat prior on η(ζ), the
proposed method can be described as the following two-step method:
[Step 1] Generate η∗
from the posterior distribution
p(ζ | Un = 0)
L
−→ N(0, ˆVU ), (10)
where ˆVU is a consistent estimator of Var(Un).
[Step 2] Solve Un (ζ) = η∗
for ζ to obtain ζ∗
.
Kim BPS 20 / 50

Theorem
Theorem
Let ˆζ be the solution to Un(ζ) = 0. Under regularity conditions, the
posterior distribution p(ζ | ˆζ), generated by the two-step method above,
satisﬁes
p(ζ|ˆζ) −→ φˆζ,Var(ˆζ) (ζ) (11)
p lim
n−→∞ Nn(ζ0)
φˆζ,Var(ˆζ) (ζ) dζ = 1, (12)
where φˆζ,Var(ˆζ) (·) is the density of normal distribution with mean ˆζ and
variance Var(ˆζ). Nn (ζ0) is a neighborhood of ζ0.
Result (11) is a convergence of the posterior distribution to normality and
result (12) is the posterior consistency.
Kim BPS 21 / 50

Construct level α confidence region
To construct a level-α Bayesian High Posterior Density (HPC)
confidence region from Monte Carlo samples ζ∗
1 , ζ∗
2 , · · · , ζ∗
M from
p(ζ | ˆζ) with size M, we first compute sample variance as a consistent
estimator of V (ˆζ).
Let dm = φˆζ, ˆV ar(ˆζ) (ζ∗
m), for m = 1, 2, · · · , M. Then we can
approximate k∗(α) by
k∗
M(α) = inf k : M−1
M
m=1
I{dm≥k} ≥ 1 − α
.
We can construct HPC confidence region by
ˆC∗
(α) = ζ : φˆζ, ˆV ar(ˆζ) (ζ) ≥ k∗
M(α) . (13)
Kim BPS 22 / 50

Extension 1: Incorporate auxiliary variables
1 Motivation
2 Setup
3 Proposed method
7 Simulation study
8 Conclusion
Kim BPS 23 / 50

Motivation
We now consider an extension of the proposed method to the problem
of incorporating additional information from the full sample.
Note that the PS estimator applied to µx = E(X) can be computed
by the solution to
n
i=1
δi
π(ˆφ; xi )
(xi − µx ) = 0
which is not necessarily equal to ˆµx,n = n−1 n
i=1 xi .
Including this extra information in propensity score estimation will
improve the eﬃciency of the resulting PS estimator.
Kim BPS 24 / 50

Proposed method
To include such extra information, we may add
U3 (φ, µx ) =
1
n
n
i=1
δi
π (φ; xi )
(xi − µx ) (14)
U4 (µx ) =
1
n
n
i=1
(xi − µx ) (15)
in addition to the original estimating equations.
Note that we cannot directly apply the proposed two-step method in
this case since there are more equations than the parameters.
Kim BPS 25 / 50

Proposed method (Con’t)
To resolve this problem, instead of using two-step method which
involves generating η∗ ﬁrst from (10), we consider a direct method
that generates ψ∗ = (µ∗
x , φ∗, θ∗) directly from the posterior
distribution of ψ = (µx , φ, θ) given the observed data.
Deﬁne
Un(ψ) = UT
1 (φ), U2(φ, θ), UT
3 (φ, µx ), UT
4 (µx ) .
Under some regularity conditions,
[Un|ψ] ∼ N(0, Σ(ψ)/n). (16)
where
Var
√
nUn(ψ) | ψ
P
−→ Σ(ψ). (17)
Kim BPS 26 / 50

Proposed method (Con’t)
Then, if let π(ψ) be a prior distribution of ψ, the posterior
distribution of ψ given Un can be represented as
[ψ|Un] ∝ [Un|ψ]π(ψ) (18)
Note that we can still use approximate normality of Un to play the
role of likelihood function in Bayesian analysis. Note that even if the
prior distribution is normal, the posterior distribution
p (ψ|Un(ψ)) =
g(Un|ψ)π(ψ)
g(Un|ψ)π(ψ)dψ
(19)
is no longer a normal distribution.
A version of Metropolis-Hasting algorithm is developed to generate
the samples from (19).
Kim BPS 27 / 50

Extension 2: Bayesian Model selection
1 Motivation
2 Setup
3 Proposed method
7 Simulation study
8 Conclusion
Kim BPS 28 / 50

Baysian model selection
Idea: If the dimension of X is large and the true response model is
sparse, then such information can be incorporated into the prior
model.
Spike-and-Slab prior
Let X be p dimensional.
The response model parameter φ = (φ1, · · · , φp) is also p-dimensional.
For each φk , k = 1, · · · , p, we can use a mixture distribution for the
prior distribution of φk :
π(φk ) = wk I(φk = 0) + (1 − wk )π1(φk )
where π1(·) is a ﬂat prior and wk is the hyperparameter in the prior
distribution.
Kim BPS 29 / 50

Baysian model selection (Cont’d)
The Spike-and-Slab Gaussian mixture prior can be expressed as
φk | zk ∼ N(0, v0(1 − zk) + v1zk)
zk ∼ Ber(wk)
where v0
.
= 0 and v1 is a huge number such as v1 = 107.
Kim BPS 30 / 50

Remark
I-step can be simpliﬁed as
z∗
k ∼ Ber
wkψ(φ∗
k | 0, v1)
wkψ(φ∗
k | 0, v1) + (1 − wk)ψ(φ∗
k | 0, v0)
,
where ψ(· | µ, v) is the density function of the normal distribution
with mean µ and variance v.
The P-step can also be simpliﬁed if we use ABC. That is, if g(ˆφ | φ)
is used instead of L(φ | x, δ), then the posterior distribution in (20) is
also normal which is easy to generate samples from.
Kim BPS 32 / 50

Simulation study
1 Motivation
2 Setup
3 Proposed method
7 Simulation study
8 Conclusion
Kim BPS 33 / 50

Simulation study
Simulation Study One
To validate the proposed methods and demonstrate the eﬃciency of
FBPS method, we devise a simulation study.
Factors in the experiment
1 Outcome regression model Y |X: linear, quadratic and exponential
functions.
2 Response mechanism: Logistic regression model, Probit model.
Kim BPS 34 / 50

Simulation study
Setup for outcome model
We generate x = (x1, x2)T from N(µ, Σx ), with µx = (2, 8)T
and
Σx =
4 1
1 8
.
For response variable y, we have three mean functions.
Function 1: m1(x) = 2x1 + 3x2 − 20
Function 2: m2(x) = 0.5(x1 − 2)2 + x2 − 2
Function 3: m3(x) = 0.1 exp(0.1x1 − 0.2) + 3x2 − 16
Let m(x) be one of the three mean functions. The ﬁrst mean function
is linear function and other two are quadratic function and
exponential function. We generate the continuous response y from
y = m(x) + e, where e ∼ N(0, |x1| + 1).
Kim BPS 35 / 50

Simulation study
Response model
For true response mechanism, we use logistic regression model and
probit model. For the logistic model, the response indicator functions
δi are independently generated from Bernoulli distribution with
probability
pi (φ0, φ1) =
exp(φ0 + φ1xi1)
1 + exp(φ0 + φ1xi1)
. (21)
For the probit model, the response indicator functions are
independently generated from Bernoulli distribution with probability
pi (φ0, φ1) = Φ(φ0 + φ1xi1) (22)
where Φ(·) is cumulative distribution function of standard normal
distribution.
Set (φ0, φ1) to make the response rate be approximately equal to
70% .
Kim BPS 36 / 50

Simulation study
Assume response model
For sample realization, we independently generate sample with size n
being 500y with M = 2000 times.
For each sample, we apply the true response mechanism to generate
the incomplete sample.
We use the “working” response model as
Pr(δi = 1|xi , yi ) =
exp(φ0 + φ1xi1 + φ2xi2)
1 + exp(φ0 + φ1xi1 + φ2xi2)
=: π(φ; xi ).
For each Monte Carlo sample, we are interested in estimating
population mean of response θ = E(y) and making inference.
Kim BPS 37 / 50

Simulation study
Methods
1. PS: Assume logistic response model π(φ; xi ) and apply Taylor
linearization technique to make inference.
2. BPS: Apply the proposed Bayesian methd to the joint estimating
equations
U1(φ) =
1
n
n
i=1
{δi − π(φ; xi )} (1, xT
i )T
U2(φ, θ) =
1
n
n
i=1
δi
π(φ; xi )
(yi − θ)
Kim BPS 38 / 50

Simulation study
Methods (Con’t)
3. FBPS: Apply the proposed Bayesian method to the joint estimating
equations
U1(φ) =
1
n
n
i=1
{δi − π(φ; xi )} (1, xT
i )T
U2(φ, θ) =
1
n
n
i=1
δi
π(φ; xi )
(yi − θ)
U3(φ, µx ) =
1
n
n
i=1
δi
π(φ; xi )
(xi − µx )
U4(µx ) =
1
n
n
i=1
(xi − µx )
Kim BPS 39 / 50

Simulation study
Simulation result with sample size n = 500.
Table: Simulation result with sample size n = 500. “m” denotes mean function.
“res” represents response model. “c p” is the coverage probability for the
corresponding conﬁdence interval. “CI length” is the average length of the
conﬁdence intervals.
res m method c p CI length res m method c p CI length
Logistic
m1
PS 0.942 1.840
Probit
m1
PS 0.940 1.863
BPS 0.945 1.845 BPS 0.946 1.874
FBPS 0.948 1.780 FBPS 0.946 1.783
m2
PS 0.940 0.887
m2
PS 0.944 0.899
BPS 0.940 0.886 BPS 0.946 0.897
FBPS 0.940 0.798 FBPS 0.941 0.796
m3
PS 0.950 1.566
m3
PS 0.951 1.584
BPS 0.947 1.567 BPS 0.947 1.580
FBPS 0.944 1.524 FBPS 0.947 1.524
Kim BPS 40 / 50

Simulation study
Summary for Simulation One
Both BPS and FBPS methods work well to provide consistent interval
estimators.
Comparing the length of confidence intervals, we can see that the
length of PS is very close to the length of BPS. This confirms the
asymptotic equivalence of BPS and PS.
Overall, FBPS method is the most efficient, since the average length
of its confidence interval is the shortest.
Kim BPS 41 / 50

Simulation study
Simulation Study Two
We now investigate the performance of the proposed sparse
propensity score estimation.
Generate a random sample of size n = 500, {(xi , yi ) : i = 1, 2, . . . , n},
from one of the following two models:
M1 : yi
ind
∼ 2xi1 + 2xi2 + ei ; (23)
M2 : yi
ind
∼ Binomial {20, p(xi )} ; (24)
where p(xi ) = exp(xi3)/ {1 + exp(xi3)}, xi = (xi1, xi2, . . . , xip)T
with
xi1 = 1, xi2, xi3, . . . , xip
iid
∼ N(0, 1), and the errors ei are generated
independently from χ2
3.
Kim BPS 42 / 50

Simulation study
Setup (Con’t)
For i = 1, 2, . . . , n, generate the response indicator of yi from the
following response mechanism:
δi
ind
∼ Ber
exp(xi1 + xi2)
1 + exp(xi1 + xi2)
.
Here, the average response rate is about 0.7.
Kim BPS 43 / 50

Simulation study
Setup (Con’t)
We perform 2, 000 Monte Carlo replications for each p = 50 and 100.
Note that in our setup p controls the amount of sparsity on the
propensity score. As p increases, the propensity score becomes more
sparse.
We are interested in estimating θ = E(Y ), which is the solution of
E {U(θ; X, Y )} = E(Y − θ) = 0.
To assess the variable selection performance, we compute true
positive rate (TPR) and true negative rate (TNR), where TPR is the
proportion of the regression coefficients that are correctly identified as
nonzero and TNR is the proportion of the regression coefficients that
are correctly identified as zero.
Kim BPS 44 / 50

Simulation study
Methods
For each realized sample, we apply the following methods
1. PS: The traditional PS estimate with full covariates. And the
asymptotic normal distribution is used to make inference.
2. TPS: The traditional PS estimate with the true model. We also use
the asymptotic distribution for inference.
3. BSPS: We set w1 = · · · = wp = 0.5, ν0 = 10−7, ν1 = 107, and
σ2
θ = 107 to induce noninformative priors.
Kim BPS 45 / 50

Simulation study
Simulation results
Table: Simulation results for Case 1 (M1): “Bias” is the bias of the point
estimator for θ, “S.E.” represents the standard error of the point estimator,
“E[S.E.]” is the estimated standard error, “CP” represents the coverage
probability of the 95% conﬁdence interval estimate.
p Method Bias S.E. E[S.E.] CP TPR TNR
50 PS 0.012 0.189 0.161 0.923
50 TPS 0.004 0.171 0.168 0.955
50 BSPS -0.003 0.173 0.168 0.953 1.000 0.995
100 PS 0.023 0.235 0.147 0.828
100 TPS 0.007 0.172 0.167 0.947
100 BSPS 0.002 0.174 0.168 0.944 0.998 0.996
Kim BPS 46 / 50

Simulation study
Simulation results
Table: Simulation results for Case 2 (M2): “Bias” is the bias of the point
estimator for θ, “S.E.” represents the standard error of the point estimator,
“E[S.E.]” is the estimated standard error, “CP” represents the coverage
probability of the 95% conﬁdence interval estimate.
p Method Bias S.E. E[S.E.] CP TPR TNR
50 PS 0.009 0.249 0.213 0.914
50 TPS 0.008 0.268 0.260 0.945
50 BSPS 0.008 0.267 0.260 0.946 1.000 0.995
100 PS -0.004 0.285 0.194 0.834
100 TPS -0.005 0.264 0.260 0.949
100 BSPS -0.003 0.264 0.260 0.948 0.998 0.996
Kim BPS 47 / 50

Simulation study
Summary for simulation Two
Including unnecessary covariates into the response model will increase
the variance of the resulting PS estimator.
However, the usual variance estimator ignores this extra variability
and underestimate the true variance.
The proposed method correctly chooses the true model with high
probability and provides valid coverage rates.
Kim BPS 48 / 50

Conclusion
1 Motivation
2 Setup
3 Proposed method
7 Simulation study
8 Conclusion
Kim BPS 49 / 50

Conclusion
Conclusion and future work
Bayesian approach is developed for propensity score estimation.
We can incorporate prior information naturally.
Use of Spike-and-Slab prior for model selection seems to work very
well in the simulation study.
The proposed method is developed under the missing-at-random
(MAR) assumption. We plan to extend the proposed method to
handle Non-MAR case.
Kim BPS 50 / 50

Propensity albert

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Propensity albert

Ähnlich wie Propensity albert (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Propensity albert