This document discusses methods for clustering random walks. It introduces the GNPR (Generic Non-Parametric Representation) method for defining a distance between two random walks that separates dependence and distribution information. The GNPR method is shown to outperform standard approaches on synthetic datasets containing different clusters based on distribution and dependence. The GNPR method is also used to cluster credit default swaps, identifying a cluster of "Western sovereigns". The document concludes that GNPR is an effective way to deal with dependence and distribution information separately without losing information.
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
Clustering Financial Time Series using their Correlations and their Distributions
1. Introduction
How to define a distance between two random walks?
Applications
Conclusion
How to cluster random walks?
Paris Machine Learning #5 Season 2: Time Series and FinTech
Philippe Donnat1 Gautier Marti1,2
Frank Nielsen2 Philippe Very1
1Hellebore Capital Management
2Ecole Polytechnique
th January
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
2. Introduction
How to define a distance between two random walks?
Applications
Conclusion
1 Introduction
Data Science for the CDS market
How to group random walks?
What is a clustering program?
2 How to define a distance between two random walks?
Standard approach on time series
Comovements and distributions
GNPR: the best of both worlds
3 Applications
Results on synthetic datasets
Clustering Credit Default Swaps
4 Conclusion
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
3. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Data Science for the CDS market
How to group random walks?
What is a clustering program?
1 Introduction
Data Science for the CDS market
How to group random walks?
What is a clustering program?
2 How to define a distance between two random walks?
Standard approach on time series
Comovements and distributions
GNPR: the best of both worlds
3 Applications
Results on synthetic datasets
Clustering Credit Default Swaps
4 Conclusion
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
4. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Data Science for the CDS market
How to group random walks?
What is a clustering program?
Hellebore Capital Management & Data Science
Current R&D projects in Data Science:
Data mining: parsing & natural language processing
Inference: incomplete data sources
Portfolio & Risk analysis: understanding joint behaviours
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
5. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Data Science for the CDS market
How to group random walks?
What is a clustering program?
Do you see clusters?
Random walks
French banks and building materials
CDS over 2006-2010
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
6. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Data Science for the CDS market
How to group random walks?
What is a clustering program?
Do you see clusters?
Random walks
French banks and building materials
CDS over 2006-2010
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
7. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Data Science for the CDS market
How to group random walks?
What is a clustering program?
Do you see clusters?
Random walks
French banks and building materials
CDS over 2006-2010
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
8. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Data Science for the CDS market
How to group random walks?
What is a clustering program?
Do you see clusters?
Random walks
French banks and building materials
CDS over 2006-2010
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
9. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Data Science for the CDS market
How to group random walks?
What is a clustering program?
Do you see clusters?
Random walks
French banks and building materials
CDS over 2006-2010
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
10. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Data Science for the CDS market
How to group random walks?
What is a clustering program?
Do you see clusters?
Random walks
French banks and building materials
CDS over 2006-2010
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
11. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Data Science for the CDS market
How to group random walks?
What is a clustering program?
Do you see clusters?
Random walks
French banks and building materials
CDS over 2006-2015
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
12. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Data Science for the CDS market
How to group random walks?
What is a clustering program?
Do you see clusters?
Random walks
French banks and building materials
CDS over 2006-2015
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
13. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Data Science for the CDS market
How to group random walks?
What is a clustering program?
What is a clustering program?
Definition
Clustering is the task of grouping a set of objects in such a way
that objects in the same group (cluster) are more similar to each
other than those in different groups.
Definition
We aim at finding K groups by positioning K group centers
{c1, . . . , cK } such that data points {x1, . . . , xn} minimize
min
c1,...,cK
n
i=1
K
min
j=1
d(xi , cj )2
But, what is the distance d between two random walks?
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
14. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Standard approach on time series
Comovements and distributions
GNPR: the best of both worlds
1 Introduction
Data Science for the CDS market
How to group random walks?
What is a clustering program?
2 How to define a distance between two random walks?
Standard approach on time series
Comovements and distributions
GNPR: the best of both worlds
3 Applications
Results on synthetic datasets
Clustering Credit Default Swaps
4 Conclusion
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
15. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Standard approach on time series
Comovements and distributions
GNPR: the best of both worlds
Naive distance between two random walks
random walks Y ,Y
d·
−→
increments X,X covariance scatterplot
X = (y2 − y1, . . . , yT − yT−1)
X,X points in RT : ||X − X ||2 = T−1
i=1 (Xi − Xi )2
apply normalizations: e.g. (X − µ)/σ, (X − min)/(max − min)
capture rather well comovements
drawbacks: not robust to outliers, blind to signal shape
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
16. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Standard approach on time series
Comovements and distributions
GNPR: the best of both worlds
Our approach: split comovements and distributions
?
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
17. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Standard approach on time series
Comovements and distributions
GNPR: the best of both worlds
GNPR: A suitable representation
Definition
GNPR (Generic Non-Parametric Representation) projection:
T : VN
→ UN
× GN
(1)
X → (GX (X), GX )
GX : x → P[X ≤ x] cumulative distribution function
GX (X) ∼ U[0, 1]
1
T rank(Xt) = 1
T k≤T 1{Xk ≤ Xt} →T∞ P[X ≤ Xt] = GX (Xt)
Property
T is a bijection.
N.B. It replicates Sklar’s theorem, the seminal result of Copula Theory.
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
18. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Standard approach on time series
Comovements and distributions
GNPR: the best of both worlds
A distance dθ leveraging GNPR
Definition
Let (X, Y ) ∈ V2. Let GX , GY be vectors of marginal cdf. Let
θ ∈ [0, 1]. We define the following distance
d2
θ (X, Y ) = θd2
1 (GX (X), GY (Y )) + (1 − θ)d2
0 (GX , GY ), (2)
where
d2
1 (GX (X), GY (Y )) = 3E[|GX (X) − GY (Y )|2
], (3)
and
d2
0 (GX , GY ) =
1
2 R
dGX
dλ
−
dGY
dλ
2
dλ. (4)
d0 Hellinger;
d1 = (1 − ρS )/2, with ρS the rank correlation between X and Y .
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
19. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Standard approach on time series
Comovements and distributions
GNPR: the best of both worlds
GNPR θ = 1: Increase of correlation
Correlation
Density
−0.2 0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.53.03.5
Pearson Correlation
Spearman Correlation
Distribution of Correlations
10% more correlation, in average, using GNPR θ = 1 (rank statistics)
instead of standard correlation
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
20. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Standard approach on time series
Comovements and distributions
GNPR: the best of both worlds
GNPR θ = 0: Find distribution peculiarities
Parametric modelling: Real-life CDS variations:
Which distribution?
Nokia
−310 −280 −250 −220 −190 −160 −130 −100 −70 −40 −10 10 30 50 70 90 110 130 150 170 190
Distribution Nokia
IncrementLogDensity
5e−051e−042e−045e−041e−032e−035e−031e−022e−02
020040060080010001200
Nokia 5Y CDS
Time
CDSSpread
Jan−2006 Oct−2006 Jul−2007 Apr−2008 Feb−2009 Nov−2009 Sep−2010 Jul−2011 Apr−2012 Jan−2013 Oct−2013 Jul−2014
Telecom Italia
−62 −56 −50 −44 −38 −32 −26 −20 −14 −8 −2 2 6 10 16 22 28 34 40 46 52 58 64
Distribution Telecom Italia
Increment
LogDensity
2e−045e−041e−032e−035e−031e−022e−025e−021e−01
100200300400500
Telecom Italia 5Y CDS
Time
CDSSpread
Jan−2006 Dec−2006 Nov−2007 Oct−2008 Sep−2009 Jul−2010 Jun−2011 May−2012 Apr−2013 Mar−2014
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
21. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Results on synthetic datasets
Clustering Credit Default Swaps
1 Introduction
Data Science for the CDS market
How to group random walks?
What is a clustering program?
2 How to define a distance between two random walks?
Standard approach on time series
Comovements and distributions
GNPR: the best of both worlds
3 Applications
Results on synthetic datasets
Clustering Credit Default Swaps
4 Conclusion
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
22. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Results on synthetic datasets
Clustering Credit Default Swaps
Description of the testing datasets
We define some interesting test case datasets to study:
distribution clustering (dataset A),
dependence clustering (dataset B),
a mix of both (dataset C).
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
23. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Results on synthetic datasets
Clustering Credit Default Swaps
Results: GNPR works!
Adjusted Rand Index
Representation Algorithm A B C
X Ward 0 0.94 0.42
(X − µX )/σX Ward 0 0.94 0.42
(X − min)/(max − min) Ward 0 0.48 0.45
GNPR θ = 0 Ward 1 0 0.47
GNPR θ = 1 Ward 0 0.91 0.72
GNPR θ Ward 1 0.92 1
X k-means++ 0 0.90 0.44
(X − µX )/σX k-means++ 0 0.91 0.45
(X − min)/(max − min) k-means++ 0.11 0.55 0.47
GNPR θ = 0 k-means++ 1 0 0.53
GNPR θ = 1 k-means++ 0.06 0.99 0.80
GNPR θ k-means++ 1 0.99 1
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
24. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Results on synthetic datasets
Clustering Credit Default Swaps
HCMapper: Compare Hierarchical Clustering
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
25. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Results on synthetic datasets
Clustering Credit Default Swaps
HCMapper: Compare Hierarchical Clustering
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
26. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Results on synthetic datasets
Clustering Credit Default Swaps
“Western sovereigns”cluster
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
27. Introduction
How to define a distance between two random walks?
Applications
Conclusion
1 Introduction
Data Science for the CDS market
How to group random walks?
What is a clustering program?
2 How to define a distance between two random walks?
Standard approach on time series
Comovements and distributions
GNPR: the best of both worlds
3 Applications
Results on synthetic datasets
Clustering Credit Default Swaps
4 Conclusion
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
28. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Conclusion: Take Home Message
GNPR is a way to deal separately with
dependence information,
distribution information,
without loosing any.
Avenue for research:
better aggregation: generalized means?
consistency proof?
any idea of interesting random walks outside finance?
Check out www.datagrapple.com
to follow our R&D stuff and news about the CDS market!
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
29. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Internships
If interested, please contact
gautier.marti@helleborecapital.com
philippe.very@helleborecapital.com
for further details and application.
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
30. Introduction
How to define a distance between two random walks?
Applications
Conclusion
References I
Shai Ben-David and Ulrike Von Luxburg, Relating clustering
stability to properties of cluster boundaries., COLT, vol. 2008,
2008, pp. 379–390.
Shai Ben-David, Ulrike Von Luxburg, and D´avid P´al, A sober
look at clustering stability, Learning theory, Springer, 2006,
pp. 5–19.
Asa Ben-Hur, Andre Elisseeff, and Isabelle Guyon, A stability
based method for discovering structure in clustered data,
Pacific symposium on biocomputing, vol. 7, 2001, pp. 6–17.
Tilman Lange, Volker Roth, Mikio L Braun, and Joachim M
Buhmann, Stability-based validation of clustering solutions,
Neural computation 16 (2004), no. 6, 1299–1323.
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
31. Introduction
How to define a distance between two random walks?
Applications
Conclusion
How to find θ ?
Clustering stability using perturbations due to:
bootstrap
time-sliding window
draws from an oracle
stability pros [BHEG01, LRBB04]
stability cons [BDVLP06, BDVL08]
theta
T
accuracy
Accuracy landscape
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.2 0.4 0.6 0.8 1.0
ARI
θ
Accuracy and stability of clustering using GNPR
Accuracy
Bootstrap stability
Time stability
Cross-datasets stability
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
32. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Properties of dθ
Property
∀θ ∈ [0, 1], 0 ≤ dθ ≤ 1
Property
For 0 < θ < 1, dθ is a metric
For θ ∈ {0, 1},
U ∼ U[0, 1] = 1 − U, yet d0(U, 1 − U) = 0
V = 2V , yet d1(V , 2V ) = 0
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
33. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Description of the testing datasets
For 1 ≤ i ≤ N = pK S
s=1 Ks, we define
Xi =
S
s=1
Ks
k=1
βs
k,i Y s
k +
K
k=1
αk,i Zi
k, (5)
where
a) αk,i = 1, if i ≡ k − 1 (mod K), 0 otherwise;
b) βs
k ∈ [0, 1],
c) βs
k,i = βs
k, if iKs/N = k, 0 otherwise.
(Xi )N
i=1 are partitioned into Q = K S
s=1 Ks clusters of p random
variables each.
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
34. Introduction
How to define a distance between two random walks?
Applications
Conclusion
OTC data flow processing
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
35. Introduction
How to define a distance between two random walks?
Applications
Conclusion
Conjecture: consistency of clustering random walks
200
400
600
0 500 1000 1500 2000
T
N
0.4
0.6
0.8
1.0
ARI
0.0
0.2
0.4
0.6
0.8
1.0
0 500 1000 1500 2000
ARI
T
Clustering convergence to the ground-truth partition
Clustering distribution θ = 0
Clustering dependence θ = 1
Clustering total information θ = θ*
Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?