SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
Robustness under Independent Contamination

                Mike Danilov


              November 21, 2009




                                             1 / 17
Traditional robustness
   Definition of contamination
   Simple examples
   Weighted representation


Independent Contamination
   The Idea
   Why traditional robust estimates don’t work
   Naive approaches
   Cell-weighting approach




                                                 2 / 17
The Problem (aka Disclaimer) and Terminology


      Estimation of mean vector µ and covariance matrix Σ of
      supposedly i.i.d. multivariate sample: x1 , . . . , xn ∈ Rp .
      Data matrix
                                                        
                     x1     x11            x12   ...   x1p
                    x   x21             x22   ...   x2p 
                    2 
                 X= . = .
                                                           
                                            .     .     . 
                    .   .
                      .      .              .
                                            .     .
                                                  .     . 
                                                        .
                         xn         xn1 xn2 . . .      xnp

      Vectors xi ∈ Rp – data cases
      Values xij ∈ R – data values or cells




                                                                      3 / 17
Types of error in Statistics
     1. Usual statistical error.
        Every observation is moderately affected

                   Xobs = Xmean + e, with e ∼ N (0, σ 2 )
       where variance of e defines the quality of the data.



     2. Contamination.
        Some observations are ruined:

                               Xgood ,       usually
                      Xobs =
                               Xhorrible ,   sometimes.

       Typically comes on top of the usual error:

                            Xgood = Xmean + e.
                                                             4 / 17
Mixture contamination model
      Observed data come from the mixture distribution
                         F = (1 − ε)F0 (θ) + εH
          F0 (θ) is the distribution of interest
          H is an arbitrary unknown nuisance distribution.
      Equivalently
                     X = (1 − B)Xgood + BXhorrible ,
      where B is a Bernoulli(ε) indicator.
      Estimate T (F ): feed data from F , obtain estimates for θ.
          Breakdown point

                     εBP (T ) = sup sup T (F (θ, ε, H)) < ∞
                                ε      H
          that is the maximum ε such that T can still isolate F0 from H.
          Maximum achievable (and desirable)
                                    εBP (T ) ≤ 0.5.
                                                                           5 / 17
Examples: simple robust estimates


      Location
          Median: x(n/2)
                                      n(1−δ/2)
                              1
          Trimmed mean:                          x(i) , with δ ∈ (0, 1).
                           n(1 − δ)
                                      i=nδ/2

      Scale
          MAD: Median |xi − Median xj |
                    i             j
          IQR: x(n/4) − x(3n/4)
      Regression
          LMS: arg min Median(yi − β xi )2
                   β       i




                                                                           6 / 17
Examples: multivariate robust estimates
   Minimum Covariance Determinant (MCD) by Rousseeuw (1985):
   minimize determinant of sample covariance of 50% of data points:
           6


                             Sample Covariance
           4




                 MCD
           2




                Clean
           0
           −2
           −4
           −6




                                                                      7 / 17
Weighted representation
   Many robust estimates can be represented as weighted versions of
   familiar estimates
                                   n
                                   i=1 wi xi
                           ˆ
                           µ=        n
                                     i=1 wi


                           n
                   ˆ       i=1 wi (xi − µ)(xi
                                         ˆ      − µ)
                                                  ˆ
                   Σ=                n                 ,
                                     i=1 wi

   with weights depending on the estimates themselves

                                       ˆ ˆ
                        wi = w(MD(xi ; µ, Σ)),

   where Mahalanobis Distances are given by

                MD(xi ; µ, Σ) = (xi − µ) Σ−1 (xi − µ).
                        ˆ ˆ           ˆ ˆ          ˆ

                                                                      8 / 17
Contaminated cells not cases
  Traditional Contamination             Independent Contamination




                              ε = 10%




       q                                     q




                                                                9 / 17
Generalized Contamination

      Data entry errors, hardware malfunction, etc
      Can express as

       Xj = (1 − Bj )(XGood )j + Bj (XHorrible )j , for j = 1, . . . , p,

      or, in matrix form, as

                    X = (1 − B)X Good + BX Horrible ,

      where B is a vector of Bernoulli r.v.’s
      B’s dependence structure is important
      Will assume Independent Contamination: all Bj are
      independent and independent of X’s.
      Also: P[Bj = 1] = ε for simplicity.


                                                                            10 / 17
Number of clean cases




      each case will appear as outlier if diagnosed with MD’s
      P[case is clean] = (1 − ε)p
      e.g. with ε = 0.05 and p = 20 — only 20% are clean
      waste of data
      exceeds breakdown point of traditional robust estimates.




                                                                 11 / 17
Affine-equivariance


      Definition: if data set Y = A + XB, then

                          ˆ              ˆ
                          µ(Y ) = A + B µ(Y )
                            ˆ          ˆ
                            Σ(Y ) = B ΣB,

      Desirable: easy to study etc
      Most “respectable” robust estimates are A-E
      Alqallaf et al (2009) have a proof that reasonable A-E
      estimates cannot be robust against IC
      if know how it behaves on X, then know for Y ; and vice versa




                                                                      12 / 17
Affine Transformation of Contaminated Data
   Original Contaminated                    Transformed


                           X → Y = XB



                           −→


      q                                 q




                                                          13 / 17
Pairwise approach




      P[pair of variables are clean] = (1 − ε)2        (1 − ε)p
                              ˆ
      Estimate all elements Σab , for a, b = 1, . . . , p separately
      Problem: multivariate structure is damaged/destroyed
      Particular problem: may not be positive-definite.
      May or may not be a problem. Usually is.
      Studied to some extent by Alqallaf (2003, PhD thesis)




                                                                       14 / 17
Detecting cells


       Some are obvious: univariate outliers
       Some only show up with respect to other cells: structural
       outliers
       Van Aelst et al (2009) use Stahel-Donoho projections
       Little and Smith (1987) used partial Mahalanobis distances:

                                   ˆ ˆ
                          if MD(x; µ, Σ) is large,
                                  ˆ ˆ
                consider MD(x−j ; µ, Σ) for all j = 1, . . . , p.

       Mike explores MD-approach and iterative estimation of
       covariances in his thesis.




                                                                     15 / 17
Weighted estimate with cell weights




      Van Aelst et al (2009) proposed a weighted estimate, but it is
      pairwise and not SPD
      Mike knows how to deal with zero weights - remove the values
      and treat them as MCAR. Then do MLE via EM, for example.
      Proper cell-weighted estimate is still to be developed.




                                                                       16 / 17
The End


          17 / 17

Weitere ähnliche Inhalte

Was ist angesagt?

Math 1300: Section 8-3 Conditional Probability, Intersection, and Independence
Math 1300: Section 8-3 Conditional Probability, Intersection, and IndependenceMath 1300: Section 8-3 Conditional Probability, Intersection, and Independence
Math 1300: Section 8-3 Conditional Probability, Intersection, and Independence
Jason Aubrey
 
Lecture on solving1
Lecture on solving1Lecture on solving1
Lecture on solving1
NBER
 
A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3
Mintu246
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
The Statistical and Applied Mathematical Sciences Institute
 
Math 1300: Section 5-2 Systems of Inequalities in two variables
Math 1300: Section 5-2 Systems of Inequalities in two variablesMath 1300: Section 5-2 Systems of Inequalities in two variables
Math 1300: Section 5-2 Systems of Inequalities in two variables
Jason Aubrey
 

Was ist angesagt? (19)

random variables-descriptive and contincuous
random variables-descriptive and contincuousrandom variables-descriptive and contincuous
random variables-descriptive and contincuous
 
Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...
Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...
Third-kind Chebyshev Polynomials Vr(x) in Collocation Methods of Solving Boun...
 
Math 1300: Section 8-3 Conditional Probability, Intersection, and Independence
Math 1300: Section 8-3 Conditional Probability, Intersection, and IndependenceMath 1300: Section 8-3 Conditional Probability, Intersection, and Independence
Math 1300: Section 8-3 Conditional Probability, Intersection, and Independence
 
FEC 512.04
FEC 512.04FEC 512.04
FEC 512.04
 
Pro dist
Pro distPro dist
Pro dist
 
Chapter3 econometrics
Chapter3 econometricsChapter3 econometrics
Chapter3 econometrics
 
Conformable Chebyshev differential equation of first kind
Conformable Chebyshev differential equation of first kindConformable Chebyshev differential equation of first kind
Conformable Chebyshev differential equation of first kind
 
Lecture on solving1
Lecture on solving1Lecture on solving1
Lecture on solving1
 
A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3
 
Qt random variables notes
Qt random variables notesQt random variables notes
Qt random variables notes
 
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
 
7 - Model Assessment and Selection
7 - Model Assessment and Selection7 - Model Assessment and Selection
7 - Model Assessment and Selection
 
msri_up_talk
msri_up_talkmsri_up_talk
msri_up_talk
 
Solvability of Matrix Riccati Inequality Talk Slides
Solvability of Matrix Riccati Inequality Talk SlidesSolvability of Matrix Riccati Inequality Talk Slides
Solvability of Matrix Riccati Inequality Talk Slides
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
 
Math 1300: Section 5-2 Systems of Inequalities in two variables
Math 1300: Section 5-2 Systems of Inequalities in two variablesMath 1300: Section 5-2 Systems of Inequalities in two variables
Math 1300: Section 5-2 Systems of Inequalities in two variables
 
Lesson 6: Limits Involving Infinity (handout)
Lesson 6: Limits Involving Infinity (handout)Lesson 6: Limits Involving Infinity (handout)
Lesson 6: Limits Involving Infinity (handout)
 
A Coq Library for the Theory of Relational Calculus
A Coq Library for the Theory of Relational CalculusA Coq Library for the Theory of Relational Calculus
A Coq Library for the Theory of Relational Calculus
 

Ähnlich wie Robustness under Independent Contamination Model

Intro probability 3
Intro probability 3Intro probability 3
Intro probability 3
Phong Vo
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized trees
Gilles Louppe
 
CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2
zukun
 
Engr 213 midterm 2a sol 2009
Engr 213 midterm 2a sol 2009Engr 213 midterm 2a sol 2009
Engr 213 midterm 2a sol 2009
akabaka12
 

Ähnlich wie Robustness under Independent Contamination Model (20)

Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programming
 
Mesh Processing Course : Geodesic Sampling
Mesh Processing Course : Geodesic SamplingMesh Processing Course : Geodesic Sampling
Mesh Processing Course : Geodesic Sampling
 
Intro probability 3
Intro probability 3Intro probability 3
Intro probability 3
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural Networks
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized trees
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論
 
Generating Chebychev Chaotic Sequence
Generating Chebychev Chaotic SequenceGenerating Chebychev Chaotic Sequence
Generating Chebychev Chaotic Sequence
 
Random Variables
Random VariablesRandom Variables
Random Variables
 
T tests anovas and regression
T tests anovas and regressionT tests anovas and regression
T tests anovas and regression
 
CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2
 
Probability and Statistics : Binomial Distribution notes ppt.pdf
Probability and Statistics : Binomial Distribution notes ppt.pdfProbability and Statistics : Binomial Distribution notes ppt.pdf
Probability and Statistics : Binomial Distribution notes ppt.pdf
 
multivariate normal distribution.pdf
multivariate normal distribution.pdfmultivariate normal distribution.pdf
multivariate normal distribution.pdf
 
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt ms
 
probability assignment help (2)
probability assignment help (2)probability assignment help (2)
probability assignment help (2)
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Engr 213 midterm 2a sol 2009
Engr 213 midterm 2a sol 2009Engr 213 midterm 2a sol 2009
Engr 213 midterm 2a sol 2009
 
Statistical Method In Economics
Statistical Method In EconomicsStatistical Method In Economics
Statistical Method In Economics
 
1 - Linear Regression
1 - Linear Regression1 - Linear Regression
1 - Linear Regression
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Robustness under Independent Contamination Model

  • 1. Robustness under Independent Contamination Mike Danilov November 21, 2009 1 / 17
  • 2. Traditional robustness Definition of contamination Simple examples Weighted representation Independent Contamination The Idea Why traditional robust estimates don’t work Naive approaches Cell-weighting approach 2 / 17
  • 3. The Problem (aka Disclaimer) and Terminology Estimation of mean vector µ and covariance matrix Σ of supposedly i.i.d. multivariate sample: x1 , . . . , xn ∈ Rp . Data matrix    x1 x11 x12 ... x1p  x   x21 x22 ... x2p   2  X= . = .  . . .   .   . . . . . . . .  . xn xn1 xn2 . . . xnp Vectors xi ∈ Rp – data cases Values xij ∈ R – data values or cells 3 / 17
  • 4. Types of error in Statistics 1. Usual statistical error. Every observation is moderately affected Xobs = Xmean + e, with e ∼ N (0, σ 2 ) where variance of e defines the quality of the data. 2. Contamination. Some observations are ruined: Xgood , usually Xobs = Xhorrible , sometimes. Typically comes on top of the usual error: Xgood = Xmean + e. 4 / 17
  • 5. Mixture contamination model Observed data come from the mixture distribution F = (1 − ε)F0 (θ) + εH F0 (θ) is the distribution of interest H is an arbitrary unknown nuisance distribution. Equivalently X = (1 − B)Xgood + BXhorrible , where B is a Bernoulli(ε) indicator. Estimate T (F ): feed data from F , obtain estimates for θ. Breakdown point εBP (T ) = sup sup T (F (θ, ε, H)) < ∞ ε H that is the maximum ε such that T can still isolate F0 from H. Maximum achievable (and desirable) εBP (T ) ≤ 0.5. 5 / 17
  • 6. Examples: simple robust estimates Location Median: x(n/2) n(1−δ/2) 1 Trimmed mean: x(i) , with δ ∈ (0, 1). n(1 − δ) i=nδ/2 Scale MAD: Median |xi − Median xj | i j IQR: x(n/4) − x(3n/4) Regression LMS: arg min Median(yi − β xi )2 β i 6 / 17
  • 7. Examples: multivariate robust estimates Minimum Covariance Determinant (MCD) by Rousseeuw (1985): minimize determinant of sample covariance of 50% of data points: 6 Sample Covariance 4 MCD 2 Clean 0 −2 −4 −6 7 / 17
  • 8. Weighted representation Many robust estimates can be represented as weighted versions of familiar estimates n i=1 wi xi ˆ µ= n i=1 wi n ˆ i=1 wi (xi − µ)(xi ˆ − µ) ˆ Σ= n , i=1 wi with weights depending on the estimates themselves ˆ ˆ wi = w(MD(xi ; µ, Σ)), where Mahalanobis Distances are given by MD(xi ; µ, Σ) = (xi − µ) Σ−1 (xi − µ). ˆ ˆ ˆ ˆ ˆ 8 / 17
  • 9. Contaminated cells not cases Traditional Contamination Independent Contamination ε = 10% q q 9 / 17
  • 10. Generalized Contamination Data entry errors, hardware malfunction, etc Can express as Xj = (1 − Bj )(XGood )j + Bj (XHorrible )j , for j = 1, . . . , p, or, in matrix form, as X = (1 − B)X Good + BX Horrible , where B is a vector of Bernoulli r.v.’s B’s dependence structure is important Will assume Independent Contamination: all Bj are independent and independent of X’s. Also: P[Bj = 1] = ε for simplicity. 10 / 17
  • 11. Number of clean cases each case will appear as outlier if diagnosed with MD’s P[case is clean] = (1 − ε)p e.g. with ε = 0.05 and p = 20 — only 20% are clean waste of data exceeds breakdown point of traditional robust estimates. 11 / 17
  • 12. Affine-equivariance Definition: if data set Y = A + XB, then ˆ ˆ µ(Y ) = A + B µ(Y ) ˆ ˆ Σ(Y ) = B ΣB, Desirable: easy to study etc Most “respectable” robust estimates are A-E Alqallaf et al (2009) have a proof that reasonable A-E estimates cannot be robust against IC if know how it behaves on X, then know for Y ; and vice versa 12 / 17
  • 13. Affine Transformation of Contaminated Data Original Contaminated Transformed X → Y = XB −→ q q 13 / 17
  • 14. Pairwise approach P[pair of variables are clean] = (1 − ε)2 (1 − ε)p ˆ Estimate all elements Σab , for a, b = 1, . . . , p separately Problem: multivariate structure is damaged/destroyed Particular problem: may not be positive-definite. May or may not be a problem. Usually is. Studied to some extent by Alqallaf (2003, PhD thesis) 14 / 17
  • 15. Detecting cells Some are obvious: univariate outliers Some only show up with respect to other cells: structural outliers Van Aelst et al (2009) use Stahel-Donoho projections Little and Smith (1987) used partial Mahalanobis distances: ˆ ˆ if MD(x; µ, Σ) is large, ˆ ˆ consider MD(x−j ; µ, Σ) for all j = 1, . . . , p. Mike explores MD-approach and iterative estimation of covariances in his thesis. 15 / 17
  • 16. Weighted estimate with cell weights Van Aelst et al (2009) proposed a weighted estimate, but it is pairwise and not SPD Mike knows how to deal with zero weights - remove the values and treat them as MCAR. Then do MLE via EM, for example. Proper cell-weighted estimate is still to be developed. 16 / 17
  • 17. The End 17 / 17