SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
cnrs - upmc                    laboratoire d’informatique de paris 6




    Outskewer:
    Using Skewness to Spot Outliers
    in Samples and Time Series
    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien
     e                                    e




    ASONAM 2012
Did you know?


Outlier detection is an important problem to data mining:




                                              source: https://xkcd.com/539/
cnrs - upmc                                                             laboratoire d’informatique de paris 6


                             How to detect outliers?



         • No formal definition, it is a subjective concept.
         • Depends on cases and hypotheses on data.
         • Intuitively: to identify values which deviate remarkably from
              the remainder of values (Grubbs, 1969).




    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    3/27
cnrs - upmc                                                             laboratoire d’informatique de paris 6


               Usual approaches in literature




      Hypothesis: data ∼ normal
                                                                                     Distance data points /
            distribution.
                                                                                       theoretical values.




    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    4/27
cnrs - upmc                                                             laboratoire d’informatique de paris 6


                                      Problem statement



    Most of the time, we can’t make strong assumptions on:
         • the theoretical distribution of values.
         • how the data should evolve over time (time series).


    Thus we want a method which makes no hypothesis on data.




    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    5/27
Our Method
cnrs - upmc                                                             laboratoire d’informatique de paris 6


                                    Skewness coefficient

                                                n                                         x−mean        3
                                   γ=       (n−1)(n−2)                  x∈X          standard deviation
                         density




                                                                             density
                                                                       x                                    x
                                             γ<0                     γ>0
                                            Example of skewed distributions.




    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    7/27
cnrs - upmc                                                             laboratoire d’informatique de paris 6


                                    Skewness coefficient

                                                n                                         x−mean        3
                                   γ=       (n−1)(n−2)                  x∈X          standard deviation
                         density




                                                                             density
                                                                       x                                    x
                                             γ<0                     γ>0
                                            Example of skewed distributions.


    It is sensitive to extremal values (min/max) far from the mean !


    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    7/27
cnrs - upmc                                                             laboratoire d’informatique de paris 6


                                      Skewness signature

    Definition
    Evolution of skewness coefficient γ when extremal values are
    removed one by one from the sample.

   Algorithm
   If γ > 0 then remove max(X ),
                                                                                            1.5




                                                                                     skewness
   Else remove min(X ).                                                                     1.0
                                                                                            0.5
                                                                                            0.0
   Example
                                                                                                    1   2   3   4   5   6   7
   X = {-3, -2, -1, -1, 0, 1, 2, 3, 7}                                                          # extremal values removed
   γ: 1.09, 0.22, 0.17, 0, 0.4, 0, 1.73


    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    8/27
cnrs - upmc                                                             laboratoire d’informatique de paris 6


                          Our method: Outskewer

    Our definition
    Outlier = extremal value which skews a distribution of values.

    Implication
    The removal of these extremal values one by one should reduce
    the skewness of the distribution.

    Implication
    Otherwise, there is no outlier as we define it.




    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    9/27
cnrs - upmc                                                             laboratoire d’informatique de paris 6


               Outskewer : non-relevant cases




    Where extremal values far from the mean are common.
    e.g. Power law distributions




    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    10/27
cnrs - upmc                                                             laboratoire d’informatique de paris 6


                                             Outskewer : p-stability
    Is the signature p-stable?
    p: fraction of extremal values removed.
    p-stable ⇔ |γ| ≤ 0.5 − p, for each p from p to 0.5

                                  1.0                                      q                  0.5         t         T
        cumulative distribution




                                                                       q   q
                                                                       q
                                                                       q
                                                                     qq
                                                                     q
                                  0.8
                                                                     q
                                                                     q
                                                                     q
                                                                     q
                                                                     q
                                                                                              0.4
                                                                    q
                                                                    q
                                                                    q




                                                                                     |skewness|
                                                                    q
                                                                    q
                                                                   q
                                  0.6                              q
                                                                   q                          0.3




                                                                                          |g|
                                                                  qq
                                                                 qq
                                                                qq
                                                                q
                                                                q
                                  0.4                         q
                                                              q
                                                              q
                                                              q
                                                               q
                                                                                              0.2
                                                             qq
                                                            q
                                                            q
                                                           qq
                                  0.2                   qq
                                                      q q
                                                          q
                                                                                              0.1
                                        q         q q
                                        q
                                        q
                                  0.0                                                         0.0
                                        −8   −6     −4    −2      0        2                        0   0.14 0.30
                                                                                                        0.16        0.5
                                                    x                             p
                                             Example: 0.16-stable but not 0.30-stable
    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    11/27
cnrs - upmc                                                             laboratoire d’informatique de paris 6


                               Outskewer : p-stability


    Is the signature p-stable?
    p: fraction of extremal values removed.
    p-stable ⇔ |γ| ≤ 0.5 − p, for each p from p to 0.5
    If yes: there may be outliers.




    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    12/27
cnrs - upmc                                                             laboratoire d’informatique de paris 6


                               Outskewer : p-stability


    Is the signature p-stable?
    p: fraction of extremal values removed.
    p-stable ⇔ |γ| ≤ 0.5 − p, for each p from p to 0.5
    If yes: there may be outliers.
    If no for all p: the skewness coefficient is always too large, thus no
    outlier as we define it can lie in the sample.




    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    12/27
cnrs - upmc                                                                                      laboratoire d’informatique de paris 6


                                           Outskewer : outlier detection
                                                                                |g| area of
                                                                                    outliers
                                                                                                          area of
                                                                                                          potential
                                                                                                                      area with no outlier

                                                                                2.0                       outliers

                                                                                1.5
                       1.0        q    not outlier                          q




                                                                      |skewness|
                                                                        q   q
cumulative frequency




                                                                       qq
                                                                      qq
                       0.8             potential outlier             q
                                                                     q
                                                                     q
                                                                     q
                                                                                1.0
                                                                     q
                                                                     q
                                                                    qq
                                       outlier                     q
                                                                    q
                                                                    q
                                                                    q
                                                                   q
                       0.6                                        q
                                                                  q
                                                                   q
                                                                   q
                                                                 qq
                                                                qq
                                                                q
                                                                                0.5
                                                                q
                       0.4                                    q
                                                               q
                                                               q
                                                               q

                                                                                      t’
                                                              q
                                                              q
                                                             q
                                                             q
                       0.2
                                                                                      T’
                       0.0                                                      0.0
                                −8      −6      −4      −2       0          2                           t                 T
                                                 x                                         0       0.14                   0.5                1
                                                                                                                      p

                             t smallest t-stable value , t smallest value so that |γ| ≤ 0.5 − t
                             T largest T -stable value , T smallest value so that |γ| ≤ 0.5 − T
                             Example: 50 values, including 7 outliers and 5 potential outliers
                             S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
                              e                                    e
                             13/27
cnrs - upmc                                                              laboratoire d’informatique de paris 6


                                     Outskewer : outcome

       Each value of the sample is classified as follows:
qqqqqqqqqqqqqq
              qqqqqqqqqq                                 status
                                                            q      not outlier
                                                                   potential outlier
                                                                   outlier

               2000
       or unknown when the method is not applicable (skewness
       signature never p-stable).




       S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
        e                                    e
       14/27
cnrs - upmc                                                             laboratoire d’informatique de paris 6


                            Extension to time series

    On a sliding window of size w , each value of X is classified w
    times.
    The final class of a value is the one that appears the most.




                                                                                           time




    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    15/27
Experimental Validation
               False positive rate.
                  Regime change.
cnrs - upmc                                                             laboratoire d’informatique de paris 6


                                        False positive rate



         • Normal distribution: 3% for n = 10, 0.01% for n = 100


         • Pareto distribution: 5% for n = 100, 0.01% for n = 1000




    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    17/27
cnrs - upmc                                                                        laboratoire d’informatique de paris 6


                                                       Regime change

                                                                                     Video


         5                                                      5   q   not outlier                                    5   q   not outlier
             q   not outlier                                                                                                                             q
         4                                                      4       potential outlier                              4       potential outlier        q q
                                                                                                                                                         q
                 potential outlier                                                                                                                         q
         3                                                      3       outlier     q     q q
                                                                                                                       3       outlier     q     q q
                                                                                                                                                          q
                                                                                                                                                          q
                 unknown
                 q   q                                                  q   q
                                                                                          q
                                                                                         q q                                   q   q            q
                                                                                                                                                  qq q
                                                                                                                                                    q
                                                                                                                                                      q qq
                                                                                                                                                           q
         2    q
                         q                                      2    q
                                                                        unknown
                                                                              q         q
                                                                                        q
                                                                                           q
                                                                                                                       2    q
                                                                                                                               unknown
                                                                                                                                     q         q
                                                                                                                                               q
                                                                                                                                                   q      q
                    q               q                                      q               q
                                                                                           qq                                     q               q
                                                                                                                                                  qq     q
         1   qq qq
               q qq
                             q
                            qq      q
                              q qq qq                           1   qq qq
                                                                      q qq
                                                                                    q
                                                                                   qq      q
                                                                                     q qq qq                           1   qq qq
                                                                                                                             q qq
                                                                                                                                           q
                                                                                                                                          qq      q
                                                                                                                                            q qq qq           q
                                q q
                                 q                                                     q q
                                                                                        q                                                     q q
                                                                                                                                               q
    x




                                                           x




                                                                                                                  x
                   qqq q q qqq
                    q
                   q q
                           q
                                  q                                       qqq q q qqq
                                                                           q
                                                                          q q
                                                                                  q
                                                                                         q                                       qqq q q qqq
                                                                                                                                  q
                                                                                                                                 q q
                                                                                                                                         q
                                                                                                                                                q
             q                                                      q                                                      q
         0   q qq q q q qqq q  q                                0   q qq q q q qqq q qq       q                        0   q qq q q q qqq q qq       q
               qq q q     q q q qq
                          qq                                          qq q q     q q q qqq
                                                                                 qq                                          qq q q     q q q qqq
                                                                                                                                        qq
                qq       qq q q qq
                q q q qq qq q
             qq qq q qq
                                                                       qq       qq q q qqq
                                                                       q q q qq qq q qq
                                                                    qq qq q qq               q                                qq       qq q q qqq
                                                                                                                              q q q qq qq q qq
                                                                                                                           qq qq q qq               q
        −1     q
              qq q     qq    q q                               −1     q
                                                                     qq q     qq    q q                               −1     q
                                                                                                                            qq q     qq    q q
                         qq                                                     qq                                                     qq
        −2       q qq q
                                q
                                                               −2       q qq q
                                                                                       q     q                        −2       q qq q
                                                                                                                                              q     q


             0         50         100          150   200            0         50         100          150   200            0         50         100          150       200
                                  t                                                      t                                                      t
         5   q   not outlier                q                   5   q   not outlier       q      q
                                                                                                q q                    5                                 q     q
                                                                                                                                                              q q
                                         q q                                                      q                                                     q       q
                                                                                           q q                                                            q q          q     q
         4       potential outlier        qq q
                                          qq q                  4       potential outlier q qqqqq q q
                                                                                                qq q q                 4   q   not outlier               q q q
                                                                                                                                                       q q qqq q q
                                                                                                                                                                     qqq
                                                                                                                                                               q q q qq q    q
                                             qqq                                              q       q
                                                                                                    qqq                                                qq q qq qqq q q q q
                                                                                                                                                                   q       q q
         3       outlier               qq
                                       q q qqq q                3       outlier              qqq qqq q
                                                                                             qq       qq               3       potential outlier        q qq q q q
                                                                                                                                                            qq              q
                                                                                                                                                                  q q q q qqqqq
                                                                                                                                                                         q q q
                             q      qqq
                                   qq q     q q                                     q     qqq
                                                                                         qq q      q q                                           q       qqq
                                                                                                                                                        qq q     q q qqq q q
                 q   q            q q q qqq                             q   q            q q q qqq                              q    q                 q q q qqq       qq q
                                                                                                                                                                        qq
         2    q
                 unknown
                       q         q
                                 q
                                     q
                                        q
                                            qqq                 2    q
                                                                        unknown
                                                                              q         q
                                                                                        q
                                                                                           q
                                                                                              q      q
                                                                                                   qq q
                                                                                                        q
                                                                                                                       2    q
                                                                                                                                        q             q
                                                                                                                                                      q
                                                                                                                                                          q
                                                                                                                                                             q
                                                                                                                                                                      q q
                                                                                                                                                                          qq
                                                                                                                                                                 qq q q q q q
                                                                                                                                                                   q
                    q               qq
                                     q     q                               q               q
                                                                                           qq     q                                q               qq     qq
         1   qq qq
               q qq
                             q
                            qq      q
                              q qq qq          q                1   qq qq
                                                                      q qq
                                                                                    q
                                                                                   qq      q
                                                                                     q qq qq          q                1   qq qq
                                                                                                                             q qq
                                                                                                                                            q
                                                                                                                                           qq      q
                                                                                                                                             q qq qq           q
                                q q
                                 q                                                     q q
                                                                                        q                                                      q q
                                                                                                                                                q                             q
    x




                                                           x




                                                                                                                  x
                   qqq q q qqq
                    q
                   q q
                           q
                                  q                                       qqq q q qqq
                                                                           q
                                                                          q q
                                                                                  q
                                                                                         q                                        qqq q q qqq
                                                                                                                                   q
                                                                                                                                  q q
                                                                                                                                          q
                                                                                                                                                 q
             q                                                      q                                                       q
         0   q qq q q q qqq q qq       q                        0   q qq q q q qqq q qq       q                        0    q qq q q q qqq q qq       q
               qq q q     q q q qqq
                          qq                                          qq q q     q q q qqq
                                                                                 qq                                           qq q q     q q q qqq
                                                                                                                                         qq
                qq       qq q q qqq
                q q q qq qq q qq
             qq qq q qq               q                                qq       qq q q qqq
                                                                       q q q qq qq q qq
                                                                    qq qq q qq               q                                 q
                                                                                                                               q        qq q q qqq
                                                                                                                              q q q qq qq q qq
                                                                                                                           qq qq q qq                q
        −1     q
              qq q     qq    q q                               −1     q
                                                                     qq q     qq    q q                               −1     q
                                                                                                                             qq q     qq    q q
                         qq                                                     qq                                                      qq
        −2       q qq q
                                q     q                        −2       q qq q
                                                                                       q     q                        −2        q qq q
                                                                                                                                               q     q


             0         50         100          150   200            0         50         100          150   200            0         50         100          150      200
                                  t                                                      t                                                      t




    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    18/27
Experimental Results
  French population during the 20th century.
              Logs of a P2P search engine.
cnrs - upmc                                                                                          laboratoire d’informatique de paris 6


                                           French population
                                         during the 20th century
    Number of inhabitants per year
                                                                                                                                                                                                                             qqq
                                                                                                                                                                                                                       qqq
                   60M                                                                                                                                                                                           qqq
                                                                                                                                                                                                           qqq
                                                                                                                                                                                                   qqqqq
                                                                                                                                                                                            qqqq
                                                                                                                                                                                     qqqq
      population




                                                                                                                                                                              qqqq
                                                                                                                                                                 qqq   qqqq
                                                                                                                                                         q   qqq
                   50M                                                                                                                             qqq
                                                                                                                                             qqq
                                                                                                                                        qq
                                                                                                                                   qq
                                                                                                                             qqq
                                                                                                                        qq
                                                                                                                  qqq
                                              q                                                             qqq
                              qqqqqqqqqqqqq                           qqqqq   qqqqqqqqqq
                                                                                                    qqq
                                                                                                        q
                   40M                            qqq
                                                            qq   qqqq                       qqqqq
                                                        q


                          1900                          1920                            1940                                 1960                                      1980                            2000
                                                                                                             Year




    Difference over years
                    1000000
                                   q                             q                      q   q
                     500000                                                                             q   q
                                                                                             qqq qqqqqqq qqq qqqqqqqqqqq                                                                                   status
    ∆population




                                                                     q                                                                            qqqqqqqqqq
                                                            qq        qq   q
                                                                            q                                           qqqqqqqqqqqqqqqqqqqqqqqqqq
                                                                                                q
                                    qqqqqqqqqqqqq q                     qqq        qq
                          0                                      q            qq
                                                                                                                                                                                                            q     not outlier
                    −500000
                                                                                                                                                                                                                  potential outlier
                   −1000000
                   −1500000                                                                                                                                                                                       outlier

                                 1900                   1920                        1940                     1960                              1980                            2000
                                                                                             Year




    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    20/27
cnrs - upmc                                                             laboratoire d’informatique de paris 6


                                Harry Potter on eDonkey
    Number of outliers per day
                       75
    # outliers / day




                                     in theatre          unknown event                 pirate release           outliers
                       0
                       50                                                                                       potential outliers

                                 15 Jul             24 Aug                         12 Oct               1 Dec
                                                                                Date



    Data:
                        • search logs on P2P network eDonkey.
                        • # queries containing “half blood prince” per hour, computed
                            every 10 minutes.
                        • during 28 weeks.
                        • over 205 millions of queries.
                        • for 24.4 millions of IP addresses.

    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    21/27
cnrs - upmc                                                             laboratoire d’informatique de paris 6


                                                 Contributions



    Our method:
         • is non-parametric but for the size of the time window.
         • classifies values only when the statistical conditions are met.
         • is naturally generalized to on-line analysis.




    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    22/27
cnrs - upmc                                                             laboratoire d’informatique de paris 6


                                                      Conclusion


         • Motivation: outlier detection with no hypothesis on data.
         • Method based on the skewness of distributions.
         • Excellent experimental results.
         • Relevant on various data sets.
         • Open source code in R on
              http://outskewer.sebastien.pro




    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    23/27
Questions?
Outskewer: Using Skewness to Spot Outliers
               in Samples and Time Series
            <sebastien.heymann@lip6.fr>
cnrs - upmc                                                             laboratoire d’informatique de paris 6


    Homogeneous / heterogeneous data
    Outlier = unexpected extremal value?

    Extremal values far from the mean?
      • heterogeneous (Pareto, Zipf...): common
      • homogeneous (normal, Laplace...): uncommon

                                               100
                                              10−5
                                       density



                                             10−10
                                             10−15
                                             10−20
                                                        −10           −5             0   5   10
                                                                            x
                      Probability density function of normal and Pareto laws.

    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    25/27
cnrs - upmc                                                              laboratoire d’informatique de paris 6


                                       Skewness signature
    Normal
             2

             1                                                                                   median

             0                                                                                   min
     s(p)




                                                                                                 max
            −1
                                                                                                 q1
            −2
                                                                                                 q3
                 0.0             0.2              0.4              0.6               0.8   1.0
                                                        p

    Pareto
             8
             6                                                                                   median
             4                                                                                   min
     s(p)




             2                                                                                   max
             0                                                                                   q1
            −2                                                                                   q3
                 0.0             0.2              0.4              0.6               0.8   1.0
                                                        p
    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    26/27
cnrs - upmc                                                                    laboratoire d’informatique de paris 6


      Local view of the internet topology
               13000
    Nb nodes




               12000

               11000       outlier   potential outlier   q   not outlier   unknown

                       0                        1000                        2000               3000   4000   5000
                                                                                   Nb rounds




    M. Latapy, C. Magnien and F. Ou´draogo, A Radar for the Internet, in Complex Systems, 20 (1), 23-30, 2011.
                                   e
    S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
     e                                    e
    27/27

Weitere ähnliche Inhalte

Ähnlich wie Outskewer: Using Skewness to Spot Outliers in Samples and Time Series

Monte Carlo Berkeley.pptx
Monte Carlo Berkeley.pptxMonte Carlo Berkeley.pptx
Monte Carlo Berkeley.pptxHaibinSu2
 
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論Naoki Hayashi
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...Umberto Picchini
 
Accuracy
AccuracyAccuracy
Accuracyesraz
 
Basic statistics
Basic statisticsBasic statistics
Basic statisticsdhwhite
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesXavier Rafael Palou
 
Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic netKyusonLim
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validationStéphane Canu
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimationFabian Pedregosa
 
Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)Adrian Aley
 
Whitcher Ismrm 2009
Whitcher Ismrm 2009Whitcher Ismrm 2009
Whitcher Ismrm 2009bwhitcher
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data AnalysisNBER
 
Prob and statistics models for outlier detection
Prob and statistics models for outlier detectionProb and statistics models for outlier detection
Prob and statistics models for outlier detectionTrilochan Panigrahi
 

Ähnlich wie Outskewer: Using Skewness to Spot Outliers in Samples and Time Series (20)

Monte Carlo Berkeley.pptx
Monte Carlo Berkeley.pptxMonte Carlo Berkeley.pptx
Monte Carlo Berkeley.pptx
 
Higham
HighamHigham
Higham
 
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...
 
Slides toulouse
Slides toulouseSlides toulouse
Slides toulouse
 
Accuracy
AccuracyAccuracy
Accuracy
 
Basic statistics
Basic statisticsBasic statistics
Basic statistics
 
Dataanalysis2
Dataanalysis2Dataanalysis2
Dataanalysis2
 
Interview Preparation
Interview PreparationInterview Preparation
Interview Preparation
 
Basic stat review
Basic stat reviewBasic stat review
Basic stat review
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniques
 
Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic net
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimation
 
Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)
 
Whitcher Ismrm 2009
Whitcher Ismrm 2009Whitcher Ismrm 2009
Whitcher Ismrm 2009
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
Input analysis
Input analysisInput analysis
Input analysis
 
Prob and statistics models for outlier detection
Prob and statistics models for outlier detectionProb and statistics models for outlier detection
Prob and statistics models for outlier detection
 

Mehr von Sébastien

Gephi short introduction
Gephi short introductionGephi short introduction
Gephi short introductionSébastien
 
Gephi : dynamic features
Gephi : dynamic featuresGephi : dynamic features
Gephi : dynamic featuresSébastien
 
Réseau thématique Analyse Exploratoire de Données pour les Réseaux Dynamiques
Réseau thématique Analyse Exploratoire de Données pour les Réseaux DynamiquesRéseau thématique Analyse Exploratoire de Données pour les Réseaux Dynamiques
Réseau thématique Analyse Exploratoire de Données pour les Réseaux DynamiquesSébastien
 
Conclusion du cours Exploration du Web
Conclusion du cours Exploration du WebConclusion du cours Exploration du Web
Conclusion du cours Exploration du WebSébastien
 
Introduction à l'exploration du Web
Introduction à l'exploration du WebIntroduction à l'exploration du Web
Introduction à l'exploration du WebSébastien
 
WebCSTI Rencontres OCIM 2009
WebCSTI Rencontres OCIM 2009WebCSTI Rencontres OCIM 2009
WebCSTI Rencontres OCIM 2009Sébastien
 
IC05 2008 - Le Web, objet de science?
IC05 2008 - Le Web, objet de science?IC05 2008 - Le Web, objet de science?
IC05 2008 - Le Web, objet de science?Sébastien
 
Des traces d'usages aux patterns relationnels : la construction technologique...
Des traces d'usages aux patterns relationnels : la construction technologique...Des traces d'usages aux patterns relationnels : la construction technologique...
Des traces d'usages aux patterns relationnels : la construction technologique...Sébastien
 

Mehr von Sébastien (13)

Gephi short introduction
Gephi short introductionGephi short introduction
Gephi short introduction
 
Gephi : dynamic features
Gephi : dynamic featuresGephi : dynamic features
Gephi : dynamic features
 
Réseau thématique Analyse Exploratoire de Données pour les Réseaux Dynamiques
Réseau thématique Analyse Exploratoire de Données pour les Réseaux DynamiquesRéseau thématique Analyse Exploratoire de Données pour les Réseaux Dynamiques
Réseau thématique Analyse Exploratoire de Données pour les Réseaux Dynamiques
 
Conclusion du cours Exploration du Web
Conclusion du cours Exploration du WebConclusion du cours Exploration du Web
Conclusion du cours Exploration du Web
 
Introduction à l'exploration du Web
Introduction à l'exploration du WebIntroduction à l'exploration du Web
Introduction à l'exploration du Web
 
Diseasome
DiseasomeDiseasome
Diseasome
 
WebCSTI Rencontres OCIM 2009
WebCSTI Rencontres OCIM 2009WebCSTI Rencontres OCIM 2009
WebCSTI Rencontres OCIM 2009
 
IC05 cours 4
IC05 cours 4IC05 cours 4
IC05 cours 4
 
IC05 cours 3
IC05 cours 3IC05 cours 3
IC05 cours 3
 
IC05 cours 2
IC05 cours 2IC05 cours 2
IC05 cours 2
 
IC05 cours 1
IC05 cours 1IC05 cours 1
IC05 cours 1
 
IC05 2008 - Le Web, objet de science?
IC05 2008 - Le Web, objet de science?IC05 2008 - Le Web, objet de science?
IC05 2008 - Le Web, objet de science?
 
Des traces d'usages aux patterns relationnels : la construction technologique...
Des traces d'usages aux patterns relationnels : la construction technologique...Des traces d'usages aux patterns relationnels : la construction technologique...
Des traces d'usages aux patterns relationnels : la construction technologique...
 

Kürzlich hochgeladen

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Kürzlich hochgeladen (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Outskewer: Using Skewness to Spot Outliers in Samples and Time Series

  • 1. cnrs - upmc laboratoire d’informatique de paris 6 Outskewer: Using Skewness to Spot Outliers in Samples and Time Series S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien e e ASONAM 2012
  • 2. Did you know? Outlier detection is an important problem to data mining: source: https://xkcd.com/539/
  • 3. cnrs - upmc laboratoire d’informatique de paris 6 How to detect outliers? • No formal definition, it is a subjective concept. • Depends on cases and hypotheses on data. • Intuitively: to identify values which deviate remarkably from the remainder of values (Grubbs, 1969). S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 3/27
  • 4. cnrs - upmc laboratoire d’informatique de paris 6 Usual approaches in literature Hypothesis: data ∼ normal Distance data points / distribution. theoretical values. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 4/27
  • 5. cnrs - upmc laboratoire d’informatique de paris 6 Problem statement Most of the time, we can’t make strong assumptions on: • the theoretical distribution of values. • how the data should evolve over time (time series). Thus we want a method which makes no hypothesis on data. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 5/27
  • 7. cnrs - upmc laboratoire d’informatique de paris 6 Skewness coefficient n x−mean 3 γ= (n−1)(n−2) x∈X standard deviation density density x x γ<0 γ>0 Example of skewed distributions. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 7/27
  • 8. cnrs - upmc laboratoire d’informatique de paris 6 Skewness coefficient n x−mean 3 γ= (n−1)(n−2) x∈X standard deviation density density x x γ<0 γ>0 Example of skewed distributions. It is sensitive to extremal values (min/max) far from the mean ! S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 7/27
  • 9. cnrs - upmc laboratoire d’informatique de paris 6 Skewness signature Definition Evolution of skewness coefficient γ when extremal values are removed one by one from the sample. Algorithm If γ > 0 then remove max(X ), 1.5 skewness Else remove min(X ). 1.0 0.5 0.0 Example 1 2 3 4 5 6 7 X = {-3, -2, -1, -1, 0, 1, 2, 3, 7} # extremal values removed γ: 1.09, 0.22, 0.17, 0, 0.4, 0, 1.73 S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 8/27
  • 10. cnrs - upmc laboratoire d’informatique de paris 6 Our method: Outskewer Our definition Outlier = extremal value which skews a distribution of values. Implication The removal of these extremal values one by one should reduce the skewness of the distribution. Implication Otherwise, there is no outlier as we define it. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 9/27
  • 11. cnrs - upmc laboratoire d’informatique de paris 6 Outskewer : non-relevant cases Where extremal values far from the mean are common. e.g. Power law distributions S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 10/27
  • 12. cnrs - upmc laboratoire d’informatique de paris 6 Outskewer : p-stability Is the signature p-stable? p: fraction of extremal values removed. p-stable ⇔ |γ| ≤ 0.5 − p, for each p from p to 0.5 1.0 q 0.5 t T cumulative distribution q q q q qq q 0.8 q q q q q 0.4 q q q |skewness| q q q 0.6 q q 0.3 |g| qq qq qq q q 0.4 q q q q q 0.2 qq q q qq 0.2 qq q q q 0.1 q q q q q 0.0 0.0 −8 −6 −4 −2 0 2 0 0.14 0.30 0.16 0.5 x p Example: 0.16-stable but not 0.30-stable S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 11/27
  • 13. cnrs - upmc laboratoire d’informatique de paris 6 Outskewer : p-stability Is the signature p-stable? p: fraction of extremal values removed. p-stable ⇔ |γ| ≤ 0.5 − p, for each p from p to 0.5 If yes: there may be outliers. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 12/27
  • 14. cnrs - upmc laboratoire d’informatique de paris 6 Outskewer : p-stability Is the signature p-stable? p: fraction of extremal values removed. p-stable ⇔ |γ| ≤ 0.5 − p, for each p from p to 0.5 If yes: there may be outliers. If no for all p: the skewness coefficient is always too large, thus no outlier as we define it can lie in the sample. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 12/27
  • 15. cnrs - upmc laboratoire d’informatique de paris 6 Outskewer : outlier detection |g| area of outliers area of potential area with no outlier 2.0 outliers 1.5 1.0 q not outlier q |skewness| q q cumulative frequency qq qq 0.8 potential outlier q q q q 1.0 q q qq outlier q q q q q 0.6 q q q q qq qq q 0.5 q 0.4 q q q q t’ q q q q 0.2 T’ 0.0 0.0 −8 −6 −4 −2 0 2 t T x 0 0.14 0.5 1 p t smallest t-stable value , t smallest value so that |γ| ≤ 0.5 − t T largest T -stable value , T smallest value so that |γ| ≤ 0.5 − T Example: 50 values, including 7 outliers and 5 potential outliers S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 13/27
  • 16. cnrs - upmc laboratoire d’informatique de paris 6 Outskewer : outcome Each value of the sample is classified as follows: qqqqqqqqqqqqqq qqqqqqqqqq status q not outlier potential outlier outlier 2000 or unknown when the method is not applicable (skewness signature never p-stable). S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 14/27
  • 17. cnrs - upmc laboratoire d’informatique de paris 6 Extension to time series On a sliding window of size w , each value of X is classified w times. The final class of a value is the one that appears the most. time S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 15/27
  • 18. Experimental Validation False positive rate. Regime change.
  • 19. cnrs - upmc laboratoire d’informatique de paris 6 False positive rate • Normal distribution: 3% for n = 10, 0.01% for n = 100 • Pareto distribution: 5% for n = 100, 0.01% for n = 1000 S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 17/27
  • 20. cnrs - upmc laboratoire d’informatique de paris 6 Regime change Video 5 5 q not outlier 5 q not outlier q not outlier q 4 4 potential outlier 4 potential outlier q q q potential outlier q 3 3 outlier q q q 3 outlier q q q q q unknown q q q q q q q q q q qq q q q qq q 2 q q 2 q unknown q q q q 2 q unknown q q q q q q q q q qq q q qq q 1 qq qq q qq q qq q q qq qq 1 qq qq q qq q qq q q qq qq 1 qq qq q qq q qq q q qq qq q q q q q q q q q q x x x qqq q q qqq q q q q q qqq q q qqq q q q q q qqq q q qqq q q q q q q q q 0 q qq q q q qqq q q 0 q qq q q q qqq q qq q 0 q qq q q q qqq q qq q qq q q q q q qq qq qq q q q q q qqq qq qq q q q q q qqq qq qq qq q q qq q q q qq qq q qq qq q qq qq qq q q qqq q q q qq qq q qq qq qq q qq q qq qq q q qqq q q q qq qq q qq qq qq q qq q −1 q qq q qq q q −1 q qq q qq q q −1 q qq q qq q q qq qq qq −2 q qq q q −2 q qq q q q −2 q qq q q q 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 t t t 5 q not outlier q 5 q not outlier q q q q 5 q q q q q q q q q q q q q q q 4 potential outlier qq q qq q 4 potential outlier q qqqqq q q qq q q 4 q not outlier q q q q q qqq q q qqq q q q qq q q qqq q q qqq qq q qq qqq q q q q q q q 3 outlier qq q q qqq q 3 outlier qqq qqq q qq qq 3 potential outlier q qq q q q qq q q q q q qqqqq q q q q qqq qq q q q q qqq qq q q q q qqq qq q q q qqq q q q q q q q qqq q q q q q qqq q q q q q qqq qq q qq 2 q unknown q q q q q qqq 2 q unknown q q q q q q qq q q 2 q q q q q q q q qq qq q q q q q q q qq q q q q qq q q qq qq 1 qq qq q qq q qq q q qq qq q 1 qq qq q qq q qq q q qq qq q 1 qq qq q qq q qq q q qq qq q q q q q q q q q q q x x x qqq q q qqq q q q q q qqq q q qqq q q q q q qqq q q qqq q q q q q q q q 0 q qq q q q qqq q qq q 0 q qq q q q qqq q qq q 0 q qq q q q qqq q qq q qq q q q q q qqq qq qq q q q q q qqq qq qq q q q q q qqq qq qq qq q q qqq q q q qq qq q qq qq qq q qq q qq qq q q qqq q q q qq qq q qq qq qq q qq q q q qq q q qqq q q q qq qq q qq qq qq q qq q −1 q qq q qq q q −1 q qq q qq q q −1 q qq q qq q q qq qq qq −2 q qq q q q −2 q qq q q q −2 q qq q q q 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 t t t S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 18/27
  • 21. Experimental Results French population during the 20th century. Logs of a P2P search engine.
  • 22. cnrs - upmc laboratoire d’informatique de paris 6 French population during the 20th century Number of inhabitants per year qqq qqq 60M qqq qqq qqqqq qqqq qqqq population qqqq qqq qqqq q qqq 50M qqq qqq qq qq qqq qq qqq q qqq qqqqqqqqqqqqq qqqqq qqqqqqqqqq qqq q 40M qqq qq qqqq qqqqq q 1900 1920 1940 1960 1980 2000 Year Difference over years 1000000 q q q q 500000 q q qqq qqqqqqq qqq qqqqqqqqqqq status ∆population q qqqqqqqqqq qq qq q q qqqqqqqqqqqqqqqqqqqqqqqqqq q qqqqqqqqqqqqq q qqq qq 0 q qq q not outlier −500000 potential outlier −1000000 −1500000 outlier 1900 1920 1940 1960 1980 2000 Year S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 20/27
  • 23. cnrs - upmc laboratoire d’informatique de paris 6 Harry Potter on eDonkey Number of outliers per day 75 # outliers / day in theatre unknown event pirate release outliers 0 50 potential outliers 15 Jul 24 Aug 12 Oct 1 Dec Date Data: • search logs on P2P network eDonkey. • # queries containing “half blood prince” per hour, computed every 10 minutes. • during 28 weeks. • over 205 millions of queries. • for 24.4 millions of IP addresses. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 21/27
  • 24. cnrs - upmc laboratoire d’informatique de paris 6 Contributions Our method: • is non-parametric but for the size of the time window. • classifies values only when the statistical conditions are met. • is naturally generalized to on-line analysis. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 22/27
  • 25. cnrs - upmc laboratoire d’informatique de paris 6 Conclusion • Motivation: outlier detection with no hypothesis on data. • Method based on the skewness of distributions. • Excellent experimental results. • Relevant on various data sets. • Open source code in R on http://outskewer.sebastien.pro S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 23/27
  • 26. Questions? Outskewer: Using Skewness to Spot Outliers in Samples and Time Series <sebastien.heymann@lip6.fr>
  • 27. cnrs - upmc laboratoire d’informatique de paris 6 Homogeneous / heterogeneous data Outlier = unexpected extremal value? Extremal values far from the mean? • heterogeneous (Pareto, Zipf...): common • homogeneous (normal, Laplace...): uncommon 100 10−5 density 10−10 10−15 10−20 −10 −5 0 5 10 x Probability density function of normal and Pareto laws. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 25/27
  • 28. cnrs - upmc laboratoire d’informatique de paris 6 Skewness signature Normal 2 1 median 0 min s(p) max −1 q1 −2 q3 0.0 0.2 0.4 0.6 0.8 1.0 p Pareto 8 6 median 4 min s(p) 2 max 0 q1 −2 q3 0.0 0.2 0.4 0.6 0.8 1.0 p S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 26/27
  • 29. cnrs - upmc laboratoire d’informatique de paris 6 Local view of the internet topology 13000 Nb nodes 12000 11000 outlier potential outlier q not outlier unknown 0 1000 2000 3000 4000 5000 Nb rounds M. Latapy, C. Magnien and F. Ou´draogo, A Radar for the Internet, in Complex Systems, 20 (1), 23-30, 2011. e S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 27/27