Outskewer: Using Skewness to Spot Outliers in Samples and Time Series
1. cnrs - upmc laboratoire d’informatique de paris 6
Outskewer:
Using Skewness to Spot Outliers
in Samples and Time Series
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien
e e
ASONAM 2012
2. Did you know?
Outlier detection is an important problem to data mining:
source: https://xkcd.com/539/
3. cnrs - upmc laboratoire d’informatique de paris 6
How to detect outliers?
• No formal definition, it is a subjective concept.
• Depends on cases and hypotheses on data.
• Intuitively: to identify values which deviate remarkably from
the remainder of values (Grubbs, 1969).
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
3/27
4. cnrs - upmc laboratoire d’informatique de paris 6
Usual approaches in literature
Hypothesis: data ∼ normal
Distance data points /
distribution.
theoretical values.
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
4/27
5. cnrs - upmc laboratoire d’informatique de paris 6
Problem statement
Most of the time, we can’t make strong assumptions on:
• the theoretical distribution of values.
• how the data should evolve over time (time series).
Thus we want a method which makes no hypothesis on data.
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
5/27
7. cnrs - upmc laboratoire d’informatique de paris 6
Skewness coefficient
n x−mean 3
γ= (n−1)(n−2) x∈X standard deviation
density
density
x x
γ<0 γ>0
Example of skewed distributions.
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
7/27
8. cnrs - upmc laboratoire d’informatique de paris 6
Skewness coefficient
n x−mean 3
γ= (n−1)(n−2) x∈X standard deviation
density
density
x x
γ<0 γ>0
Example of skewed distributions.
It is sensitive to extremal values (min/max) far from the mean !
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
7/27
9. cnrs - upmc laboratoire d’informatique de paris 6
Skewness signature
Definition
Evolution of skewness coefficient γ when extremal values are
removed one by one from the sample.
Algorithm
If γ > 0 then remove max(X ),
1.5
skewness
Else remove min(X ). 1.0
0.5
0.0
Example
1 2 3 4 5 6 7
X = {-3, -2, -1, -1, 0, 1, 2, 3, 7} # extremal values removed
γ: 1.09, 0.22, 0.17, 0, 0.4, 0, 1.73
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
8/27
10. cnrs - upmc laboratoire d’informatique de paris 6
Our method: Outskewer
Our definition
Outlier = extremal value which skews a distribution of values.
Implication
The removal of these extremal values one by one should reduce
the skewness of the distribution.
Implication
Otherwise, there is no outlier as we define it.
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
9/27
11. cnrs - upmc laboratoire d’informatique de paris 6
Outskewer : non-relevant cases
Where extremal values far from the mean are common.
e.g. Power law distributions
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
10/27
12. cnrs - upmc laboratoire d’informatique de paris 6
Outskewer : p-stability
Is the signature p-stable?
p: fraction of extremal values removed.
p-stable ⇔ |γ| ≤ 0.5 − p, for each p from p to 0.5
1.0 q 0.5 t T
cumulative distribution
q q
q
q
qq
q
0.8
q
q
q
q
q
0.4
q
q
q
|skewness|
q
q
q
0.6 q
q 0.3
|g|
qq
qq
qq
q
q
0.4 q
q
q
q
q
0.2
qq
q
q
qq
0.2 qq
q q
q
0.1
q q q
q
q
0.0 0.0
−8 −6 −4 −2 0 2 0 0.14 0.30
0.16 0.5
x p
Example: 0.16-stable but not 0.30-stable
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
11/27
13. cnrs - upmc laboratoire d’informatique de paris 6
Outskewer : p-stability
Is the signature p-stable?
p: fraction of extremal values removed.
p-stable ⇔ |γ| ≤ 0.5 − p, for each p from p to 0.5
If yes: there may be outliers.
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
12/27
14. cnrs - upmc laboratoire d’informatique de paris 6
Outskewer : p-stability
Is the signature p-stable?
p: fraction of extremal values removed.
p-stable ⇔ |γ| ≤ 0.5 − p, for each p from p to 0.5
If yes: there may be outliers.
If no for all p: the skewness coefficient is always too large, thus no
outlier as we define it can lie in the sample.
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
12/27
15. cnrs - upmc laboratoire d’informatique de paris 6
Outskewer : outlier detection
|g| area of
outliers
area of
potential
area with no outlier
2.0 outliers
1.5
1.0 q not outlier q
|skewness|
q q
cumulative frequency
qq
qq
0.8 potential outlier q
q
q
q
1.0
q
q
qq
outlier q
q
q
q
q
0.6 q
q
q
q
qq
qq
q
0.5
q
0.4 q
q
q
q
t’
q
q
q
q
0.2
T’
0.0 0.0
−8 −6 −4 −2 0 2 t T
x 0 0.14 0.5 1
p
t smallest t-stable value , t smallest value so that |γ| ≤ 0.5 − t
T largest T -stable value , T smallest value so that |γ| ≤ 0.5 − T
Example: 50 values, including 7 outliers and 5 potential outliers
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
13/27
16. cnrs - upmc laboratoire d’informatique de paris 6
Outskewer : outcome
Each value of the sample is classified as follows:
qqqqqqqqqqqqqq
qqqqqqqqqq status
q not outlier
potential outlier
outlier
2000
or unknown when the method is not applicable (skewness
signature never p-stable).
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
14/27
17. cnrs - upmc laboratoire d’informatique de paris 6
Extension to time series
On a sliding window of size w , each value of X is classified w
times.
The final class of a value is the one that appears the most.
time
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
15/27
19. cnrs - upmc laboratoire d’informatique de paris 6
False positive rate
• Normal distribution: 3% for n = 10, 0.01% for n = 100
• Pareto distribution: 5% for n = 100, 0.01% for n = 1000
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
17/27
21. Experimental Results
French population during the 20th century.
Logs of a P2P search engine.
22. cnrs - upmc laboratoire d’informatique de paris 6
French population
during the 20th century
Number of inhabitants per year
qqq
qqq
60M qqq
qqq
qqqqq
qqqq
qqqq
population
qqqq
qqq qqqq
q qqq
50M qqq
qqq
qq
qq
qqq
qq
qqq
q qqq
qqqqqqqqqqqqq qqqqq qqqqqqqqqq
qqq
q
40M qqq
qq qqqq qqqqq
q
1900 1920 1940 1960 1980 2000
Year
Difference over years
1000000
q q q q
500000 q q
qqq qqqqqqq qqq qqqqqqqqqqq status
∆population
q qqqqqqqqqq
qq qq q
q qqqqqqqqqqqqqqqqqqqqqqqqqq
q
qqqqqqqqqqqqq q qqq qq
0 q qq
q not outlier
−500000
potential outlier
−1000000
−1500000 outlier
1900 1920 1940 1960 1980 2000
Year
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
20/27
23. cnrs - upmc laboratoire d’informatique de paris 6
Harry Potter on eDonkey
Number of outliers per day
75
# outliers / day
in theatre unknown event pirate release outliers
0
50 potential outliers
15 Jul 24 Aug 12 Oct 1 Dec
Date
Data:
• search logs on P2P network eDonkey.
• # queries containing “half blood prince” per hour, computed
every 10 minutes.
• during 28 weeks.
• over 205 millions of queries.
• for 24.4 millions of IP addresses.
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
21/27
24. cnrs - upmc laboratoire d’informatique de paris 6
Contributions
Our method:
• is non-parametric but for the size of the time window.
• classifies values only when the statistical conditions are met.
• is naturally generalized to on-line analysis.
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
22/27
25. cnrs - upmc laboratoire d’informatique de paris 6
Conclusion
• Motivation: outlier detection with no hypothesis on data.
• Method based on the skewness of distributions.
• Excellent experimental results.
• Relevant on various data sets.
• Open source code in R on
http://outskewer.sebastien.pro
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
23/27
27. cnrs - upmc laboratoire d’informatique de paris 6
Homogeneous / heterogeneous data
Outlier = unexpected extremal value?
Extremal values far from the mean?
• heterogeneous (Pareto, Zipf...): common
• homogeneous (normal, Laplace...): uncommon
100
10−5
density
10−10
10−15
10−20
−10 −5 0 5 10
x
Probability density function of normal and Pareto laws.
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
25/27
28. cnrs - upmc laboratoire d’informatique de paris 6
Skewness signature
Normal
2
1 median
0 min
s(p)
max
−1
q1
−2
q3
0.0 0.2 0.4 0.6 0.8 1.0
p
Pareto
8
6 median
4 min
s(p)
2 max
0 q1
−2 q3
0.0 0.2 0.4 0.6 0.8 1.0
p
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
26/27
29. cnrs - upmc laboratoire d’informatique de paris 6
Local view of the internet topology
13000
Nb nodes
12000
11000 outlier potential outlier q not outlier unknown
0 1000 2000 3000 4000 5000
Nb rounds
M. Latapy, C. Magnien and F. Ou´draogo, A Radar for the Internet, in Complex Systems, 20 (1), 23-30, 2011.
e
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
27/27