Talk given at CMStatistics 2016 (http://cmstatistics.org/CMStatistics2016/).
The standard methodology for clustering financial time series is quite brittle to outliers / heavy-tails for many reasons: Single Linkage / MST suffers from the chaining phenomenon; Pearson correlation coefficient is relevant for Gaussian distributions which is usually not the case for financial returns (especially for credit derivatives). At Hellebore Capital Ltd, we strive to improve the methodology and to ground it. We think that stability is a paramount property to verify, which is closely linked to statistical convergence rates of the methodologies (combination of clustering algorithms and dependence estimators). This gives us a model selection criterion: The best clustering methodology is the methodology that can reach a given 'accuracy' with the minimum sample size.
Clustering CDS: algorithms, distances, stability and convergence rates
1. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Clustering CDS: algorithms, distances,
stability and convergence rates
CMStatistics 2016, University of Seville, Spain
Gautier Marti, Frank Nielsen, Philippe Donnat
HELLEBORECAPITAL
December 9, 2016
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
2. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
1 Introduction
2 The standard methodology
3 Exploring dependence between returns
4 Copula-based dependence coeļ¬cients (clustering distances)
5 Empirical convergence rates
6 Beyond dependence: a (copula,margins) representation
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
3. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Introduction
Goal: Finding groups of āhomogeneousā assets that can help to:
ā¢ build alternative measures of risk,
ā¢ elaborate trading strategies. . .
But, we need a high conļ¬dence in these clusters (networks).
So, we need appropriate AND fast converging methodologies [8]:
to be consistent yet eļ¬cient (biasāvariance tradeoļ¬),
to avoid non-stationarity of the time series (too large sample).
A good model selection criterion:
Minimum sample size to reach a given āaccuracyā.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
4. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
1 Introduction
2 The standard methodology
3 Exploring dependence between returns
4 Copula-based dependence coeļ¬cients (clustering distances)
5 Empirical convergence rates
6 Beyond dependence: a (copula,margins) representation
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
5. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
The standard methodology - description
The methodology widely adopted in empirical studies: [7].
Let N be the number of assets.
Let Pi (t) be the price at time t of asset i, 1 ā¤ i ā¤ N.
Let ri (t) be the log-return at time t of asset i:
ri (t) = log Pi (t) ā log Pi (t ā 1).
For each pair i, j of assets, compute their correlation:
Ļij =
ri rj ā ri rj
( r2
i ā ri
2) r2
j ā rj
2
.
Convert the correlation coeļ¬cients Ļij into distances:
dij = 2(1 ā Ļij ).
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
6. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
The standard methodology - description
From all the distances dij , compute a minimum spanning tree:
Figure: A minimum spanning tree of stocks (from [1]); stocks from the
same industry (represented by color) tend to cluster together
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
7. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
The standard methodology - limitations
ā¢ MST clustering equivalent to Single Linkage clustering:
ā¢ chaining phenomenon
ā¢ not stable to noise / small perturbations [11]
ā¢ Use of the Pearson correlation:
ā¢ can take value 0 whereas variables are strongly dependent
ā¢ not invariant to variable monotone transformations
ā¢ not robust to outliers
Is it still useful for ļ¬nancial time series? stocks? CDS??!
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
8. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
The standard methodology - limitations
ā¢ MST clustering equivalent to Single Linkage clustering:
ā¢ chaining phenomenon
ā¢ not stable to noise / small perturbations [11]
ā¢ Use of the Pearson correlation:
ā¢ can take value 0 whereas variables are strongly dependent
ā¢ not invariant to variables monotone transformations
ā¢ not robust to outliers
Is it still useful for ļ¬nancial time series? stocks? CDS??!
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
9. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
1 Introduction
2 The standard methodology
3 Exploring dependence between returns
4 Copula-based dependence coeļ¬cients (clustering distances)
5 Empirical convergence rates
6 Beyond dependence: a (copula,margins) representation
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
10. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Copulas
Sklarās Theorem [13]
For (Xi , Xj ) having continuous marginal cdfs FXi
, FXj
, its joint cumulative
distribution F is uniquely expressed as
F(Xi , Xj ) = C(FXi
(Xi ), FXj
(Xj )),
where C is known as the copula of (Xi , Xj ).
Copulaās uniform marginals jointly encode all the dependence.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
11. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
From ranks to empirical copula
ri , rj are the rank statistics of Xi , Xj respectively, i.e. rt
i is the rank
of Xt
i in {X1
i , . . . , XT
i }: rt
i = T
k=1 1{Xk
i ā¤ Xt
i }.
Deheuvelsā empirical copula [3]
Any copula ĖC deļ¬ned on the lattice L = {( ti
T ,
tj
T ) : ti , tj = 0, . . . , T} by
ĖC( ti
T ,
tj
T ) = 1
T
T
t=1 1{rt
i ā¤ ti , rt
j ā¤ tj } is an empirical copula.
ĖC is a consistent estimator of C with uniform convergence [4].
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
12. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Clustering of bivariate empirical copulas
Generate the N
2 bivariate empirical copulas
Find clusters of copulas using optimal transport [10, 9]
Compute and display the clustersā centroids [2]
Some code available at www.datagrapple.com/Tech.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
13. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Copula-centers for stocks (CAC 40)
Figure: Stocks: More mass in the bottom-left corner, i.e. lower tail
dependence. Stock prices tend to plummet together.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
14. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Copula-centers for Credit Default Swaps (XO index)
Figure: Credit default swaps: More mass in the top-right corner, i.e.
upper tail dependence. Insurance cost against entitiesā default tends to
soar in stressed market.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
15. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
1 Introduction
2 The standard methodology
3 Exploring dependence between returns
4 Copula-based dependence coeļ¬cients (clustering distances)
5 Empirical convergence rates
6 Beyond dependence: a (copula,margins) representation
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
16. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Dependence as relative distances between copulas
C copula of (Xi , Xj ),
|u ā v|/
ā
2 distance between (u, v) to the diagonal
Spearmanās ĻS :
ĻS (Xi , Xj ) = 12
1
0
1
0
(C(u, v) ā uv)dudv
= 1 ā 6
1
0
1
0
(u ā v)2
dC(u, v)
Many correlation coeļ¬cients can be expressed as distances to the
FrĀ“echetāHoeļ¬ding bounds or the independence [6]. Some are explicitely
built this way (e.g. [12, 5, 9]).
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
17. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
A metric space for copulas: Optimal Transport
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
18. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
The Target/Forget Dependence Coeļ¬cient (TFDC)
Now, we can deļ¬ne our bespoke dependence coeļ¬cient:
Build the forget-dependence copulas {CF
l }l
Build the target-dependence copulas {CT
k }k
Compute the empirical copula Cij from xi , xj
TFDC(Cij ) =
minl D(CF
l , Cij )
minl D(CF
l , Cij ) + mink D(Cij , CT
k )
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
19. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Spearman vs. TFDC
0.0 0.2 0.4 0.6 0.8 1.0
discontinuity position a
0.0
0.2
0.4
0.6
0.8
1.0
Estimatedpositivedependence
Spearman & TFDC values as a function of a
TFDC
Spearman
Figure: Empirical copulas for (X, Y ) where
X = Z1{Z < a} + X 1{Z > a},
Y = Z1{Z < a + 0.25} + Y 1{Z > a + 0.25}, a = 0, 0.05, . . . , 0.95, 1,
and where Z is uniform on [0, 1] and X , Y are independent noises (left).
TFDC and Spearman coeļ¬cients estimated between X and Y as a
function of a (right).
For a = 0.75, Spearman coeļ¬cient yields a negative value, yet X = Y
over [0, a].
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
20. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
1 Introduction
2 The standard methodology
3 Exploring dependence between returns
4 Copula-based dependence coeļ¬cients (clustering distances)
5 Empirical convergence rates
6 Beyond dependence: a (copula,margins) representation
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
21. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Process: Recovering a simulated ground-truth [8]
A simulation & benchmark process that needs to be reļ¬ned:
Extract (using a large sample) a ļ¬ltered correlation matrix R
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
22. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Process: Recovering a simulated ground-truth [8]
A simulation & benchmark process that needs to be reļ¬ned:
Generate samples of size T = 10, . . . , 20, . . . from a relevant
distribution (parameterized by R)
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
23. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Process: Recovering a simulated ground-truth [8]
A simulation & benchmark process that needs to be reļ¬ned:
Compute the ratio of the number of correct clustering
obtained over the number of trials as a function of T
100 200 300 400 500
Sample size
0.0
0.2
0.4
0.6
0.8
1.0
Score
Empirical rates of convergence for Single Linkage
Gaussian - Pearson
Gaussian - Spearman
Student - Pearson
Student - Spearman
100 200 300 400 500
Sample size
0.0
0.2
0.4
0.6
0.8
1.0
Score
Empirical rates of convergence for Average Linkage
Gaussian - Pearson
Gaussian - Spearman
Student - Pearson
Student - Spearman
100 200 300 400 500
Sample size
0.0
0.2
0.4
0.6
0.8
1.0
Score
Empirical rates of convergence for Ward
Gaussian - Pearson
Gaussian - Spearman
Student - Pearson
Student - Spearman
A full comparative study will be posted online at www.datagrapple.com/Tech.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
24. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
1 Introduction
2 The standard methodology
3 Exploring dependence between returns
4 Copula-based dependence coeļ¬cients (clustering distances)
5 Empirical convergence rates
6 Beyond dependence: a (copula,margins) representation
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
26. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Ricardo Coelho, Przemyslaw Repetowicz, Stefan Hutzler, and
Peter Richmond.
Investigation of Cluster Structure in the London Stock
Exchange.
Marco Cuturi and Arnaud Doucet.
Fast computation of wasserstein barycenters.
In Proceedings of the 31th International Conference on
Machine Learning, ICML 2014, Beijing, China, 21-26 June
2014, pages 685ā693, 2014.
Paul Deheuvels.
La fonction de dĀ“ependance empirique et ses propriĀ“etĀ“es. un test
non paramĀ“etrique dāindĀ“ependance.
Acad. Roy. Belg. Bull. Cl. Sci.(5), 65(6):274ā292, 1979.
Paul Deheuvels.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
27. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
A non-parametric test for independence.
Publications de lāInstitut de Statistique de lāUniversitĀ“e de
Paris, 26:29ā50, 1981.
Fabrizio Durante and Roberta Pappada.
Cluster analysis of time series via kendall distribution.
In Strengthening Links Between Data Analysis and Soft
Computing, pages 209ā216. Springer, 2015.
Eckhard Liebscher et al.
Copula-based dependence measures.
Dependence Modeling, 2(1):49ā64, 2014.
Rosario N Mantegna.
Hierarchical structure in ļ¬nancial markets.
The European Physical Journal B-Condensed Matter and
Complex Systems, 11(1):193ā197, 1999.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
28. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Gautier Marti, SĀ“ebastien Andler, Frank Nielsen, and Philippe
Donnat.
Clustering ļ¬nancial time series: How long is enough?
Proceedings of the Twenty-Fifth International Joint
Conference on Artiļ¬cial Intelligence, IJCAI 2016, New York,
NY, USA, 9-15 July 2016, pages 2583ā2589, 2016.
Gautier Marti, Sebastien Andler, Frank Nielsen, and Philippe
Donnat.
Exploring and measuring non-linear correlations: Copulas,
lightspeed transportation and clustering.
NIPS 2016 Time Series Workshop, 55, 2016.
Gautier Marti, SĀ“ebastien Andler, Frank Nielsen, and Philippe
Donnat.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
29. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Optimal transport vs. ļ¬sher-rao distance between copulas for
clustering multivariate time series.
In IEEE Statistical Signal Processing Workshop, SSP 2016,
Palma de Mallorca, Spain, June 26-29, 2016, pages 1ā5, 2016.
Gautier Marti, Philippe Very, Philippe Donnat, and Frank
Nielsen.
A proposal of a methodological framework with experimental
guidelines to investigate clustering stability on ļ¬nancial time
series.
In 14th IEEE International Conference on Machine Learning
and Applications, ICMLA 2015, Miami, FL, USA, December
9-11, 2015, pages 32ā37, 2015.
BarnabĀ“as PĀ“oczos, Zoubin Ghahramani, and Jeļ¬ G. Schneider.
Copula-based kernel dependency measures.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
30. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coeļ¬cients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
In Proceedings of the 29th International Conference on
Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June
26 - July 1, 2012, 2012.
A Sklar.
Fonctions de rĀ“epartition `a n dimensions et leurs marges.
UniversitĀ“e Paris 8, 1959.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r