Robustness under Independent Contamination Model

Robustness under Independent Contamination

Mike Danilov

November 21, 2009

1 / 17

Traditional robustness
Deﬁnition of contamination
Simple examples
Weighted representation

Independent Contamination
The Idea
Why traditional robust estimates don’t work
Naive approaches
Cell-weighting approach

2 / 17

The Problem (aka Disclaimer) and Terminology

Estimation of mean vector µ and covariance matrix Σ of
supposedly i.i.d. multivariate sample: x1 , . . . , xn ∈ Rp .
Data matrix
  
x1 x11 x12 ... x1p
 x   x21 x22 ... x2p 
 2 
X= . = .

. . . 
 .   .
. . .
. .
. . 
.
xn xn1 xn2 . . . xnp

Vectors xi ∈ Rp – data cases
Values xij ∈ R – data values or cells

3 / 17

Types of error in Statistics
1. Usual statistical error.
Every observation is moderately aﬀected

Xobs = Xmean + e, with e ∼ N (0, σ 2 )
where variance of e deﬁnes the quality of the data.

2. Contamination.
Some observations are ruined:

Xgood , usually
Xobs =
Xhorrible , sometimes.

Typically comes on top of the usual error:

Xgood = Xmean + e.
4 / 17

Mixture contamination model
Observed data come from the mixture distribution
F = (1 − ε)F0 (θ) + εH
F0 (θ) is the distribution of interest
H is an arbitrary unknown nuisance distribution.
Equivalently
X = (1 − B)Xgood + BXhorrible ,
where B is a Bernoulli(ε) indicator.
Estimate T (F ): feed data from F , obtain estimates for θ.
Breakdown point

εBP (T ) = sup sup T (F (θ, ε, H)) < ∞
ε H
that is the maximum ε such that T can still isolate F0 from H.
Maximum achievable (and desirable)
εBP (T ) ≤ 0.5.
5 / 17

Examples: simple robust estimates

Location
Median: x(n/2)
n(1−δ/2)
1
Trimmed mean: x(i) , with δ ∈ (0, 1).
n(1 − δ)
i=nδ/2

Scale
MAD: Median |xi − Median xj |
i j
IQR: x(n/4) − x(3n/4)
Regression
LMS: arg min Median(yi − β xi )2
β i

6 / 17

Examples: multivariate robust estimates
Minimum Covariance Determinant (MCD) by Rousseeuw (1985):
minimize determinant of sample covariance of 50% of data points:
6

Sample Covariance
4

MCD
2

Clean
0
−2
−4
−6

7 / 17

Weighted representation
Many robust estimates can be represented as weighted versions of
familiar estimates
n
i=1 wi xi
ˆ
µ= n
i=1 wi

n
ˆ i=1 wi (xi − µ)(xi
ˆ − µ)
ˆ
Σ= n ,
i=1 wi

with weights depending on the estimates themselves

ˆ ˆ
wi = w(MD(xi ; µ, Σ)),

where Mahalanobis Distances are given by

MD(xi ; µ, Σ) = (xi − µ) Σ−1 (xi − µ).
ˆ ˆ ˆ ˆ ˆ

8 / 17

Contaminated cells not cases
Traditional Contamination Independent Contamination

ε = 10%

q q

9 / 17

Generalized Contamination

Data entry errors, hardware malfunction, etc
Can express as

Xj = (1 − Bj )(XGood )j + Bj (XHorrible )j , for j = 1, . . . , p,

or, in matrix form, as

X = (1 − B)X Good + BX Horrible ,

where B is a vector of Bernoulli r.v.’s
B’s dependence structure is important
Will assume Independent Contamination: all Bj are
independent and independent of X’s.
Also: P[Bj = 1] = ε for simplicity.

10 / 17

Number of clean cases

each case will appear as outlier if diagnosed with MD’s
P[case is clean] = (1 − ε)p
e.g. with ε = 0.05 and p = 20 — only 20% are clean
waste of data
exceeds breakdown point of traditional robust estimates.

11 / 17

Aﬃne-equivariance

Deﬁnition: if data set Y = A + XB, then

ˆ ˆ
µ(Y ) = A + B µ(Y )
ˆ ˆ
Σ(Y ) = B ΣB,

Desirable: easy to study etc
Most “respectable” robust estimates are A-E
Alqallaf et al (2009) have a proof that reasonable A-E
estimates cannot be robust against IC
if know how it behaves on X, then know for Y ; and vice versa

12 / 17

Aﬃne Transformation of Contaminated Data
Original Contaminated Transformed

X → Y = XB

−→

q q

13 / 17

Pairwise approach

P[pair of variables are clean] = (1 − ε)2 (1 − ε)p
ˆ
Estimate all elements Σab , for a, b = 1, . . . , p separately
Problem: multivariate structure is damaged/destroyed
Particular problem: may not be positive-deﬁnite.
May or may not be a problem. Usually is.
Studied to some extent by Alqallaf (2003, PhD thesis)

14 / 17

Detecting cells

Some are obvious: univariate outliers
Some only show up with respect to other cells: structural
outliers
Van Aelst et al (2009) use Stahel-Donoho projections
Little and Smith (1987) used partial Mahalanobis distances:

ˆ ˆ
if MD(x; µ, Σ) is large,
ˆ ˆ
consider MD(x−j ; µ, Σ) for all j = 1, . . . , p.

Mike explores MD-approach and iterative estimation of
covariances in his thesis.

15 / 17

Weighted estimate with cell weights

Van Aelst et al (2009) proposed a weighted estimate, but it is
pairwise and not SPD
Mike knows how to deal with zero weights - remove the values
and treat them as MCAR. Then do MLE via EM, for example.
Proper cell-weighted estimate is still to be developed.

16 / 17

Robustness under Independent Contamination Model

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie Robustness under Independent Contamination Model

Ähnlich wie Robustness under Independent Contamination Model (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Robustness under Independent Contamination Model