SlideShare ist ein Scribd-Unternehmen logo
1 von 9
Downloaden Sie, um offline zu lesen
HAL Id: cel-01620878
https://hal.archives-ouvertes.fr/cel-01620878
Submitted on 24 Oct 2017
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Symbolic Data Analysis: Regression on Gaussian
symbols
Axel Aronio de Romblay
To cite this version:
Axel Aronio de Romblay. Symbolic Data Analysis: Regression on Gaussian symbols. Master. France.
2015. cel-01620878
SYMBOLIC DATA ANALYSIS:
Regression on Gaussian symbols
Axel Aronio de Romblay
Telecom Paristech
axel.aronio-de-romblay@telecom-paristech.fr
December 2015
1 Abstract
In classical data analysis, data are single values. This is the case if you consider a dataset of n patients
which age and size you know. But what if you record the blood pressure or the weight of each patient during a
day ? Then, for each patient, you do not have a single-valued data but a set of values since the blood pressure
or the weight are not constant during the day.
Suppose now that you do not want to record blood pressure a thousand times for each patient and to store
it into a database because your memory space is limited. Therefore, you need to aggregate each set of values
into symbols: intervals (lower and upper bounds only), box plots, histograms or even distributions (distribution
law with mean and variance)...
Thus, the issue is to adapt classical statistical tools to symbolic data analysis. More precisely, this article
is aimed at proposing a method to fit a regression on Gaussian distributions. This paper is divided as follows:
first, it presents the computation of the maximum likelihood estimator and then it compares the new approach
with the usual least squares regression.
KEY WORDS: Symbolic data, Big Data, Linear regression, Maximum likelihood, Gaussian symbols
2 Introduction
Symbolic data were first introduced by Diday (1987). A realization of a symbolic-valued variable may
take a finite or an infinite set of values in Rp
. A random variable that takes a finite set of either quantitative
or qualitative values is called a multi-valued variable. A variable that takes an infinite set of numerical values
ranging from a low to a high value is called an interval-valued variable. A more complex symbolic-valued variable
may have weights, probabilities, capacities, credibilities or even a distribution associated with its values. This
class variable is called modal-valued. Classical data are a special case of symbolic data where the internal
distribution puts the probability 1 on a single point value in Rp
.
Symbolic data analysis offers a solution to the massive data problems, especially when the entities of
interest are groups or classes of observations. Recent advances in technology enable storage of extremely large
databases. Larger datasets provide more information about the subjects of interest. However, performing even
simple exploratory procedures on these datasets requires a lot of computing power. As a result, much research
effort in recent years has been steered toward finding more efficient methods to accommodate large datasets.
In many situations, characteristics of classes of observations are of higher interest to an analyst than those of
individual observations. In these cases, individual observations can be aggregated into classes of observations.
A series of papers, Diday (1995), Diday et al. (1996), Emilion (1997) and Diday and Emilion (1996, 1998),
established mathematical probabilistic framework for classes of symbolic data. These results were extended to
Galois lattices in the symbolic data context in Diday and Emilion (1997, 1998, 2003) and Brito and Polaillon
(2005). Beyond those fundamental results, a few recent papers (2010) have presented likelihood functions for
symbolic data but none of them came up with concrete results or closed form solutions.
Thus, in this paper, we mainly concentrate on trying to have an expression of the symbolic likelihood that
is valid and usable. Furthermore, we focus on Gaussian symbolic data which is very common. Moreover, some
actual work refers to uniform distributions but struggle to get a workable solution.
1
3 General settings
Let X a dataset of n observations and y the corresponding target:
X =



x1
...
xn


 , xi ∈ Rp
, i = 1 . . . n
y =



y1
...
yn


 , yi ∈ R, i = 1 . . . n
We suppose y = X β + , ∼ N(0, σ2
In) and we want to estimate θ = (β, σ2
), β ∈ Rp
and σ ∈ R
4 Gaussian symbols
4.1 Definitions
Here, we consider the case where each observation xi is no more a point in Rp
but a p dimensional normal
distribution with known parameters µi and Σi :
yi|xi ∼ N(xi β, σ2
)
xi ∼ N(µi, Σi)
Furthermore, we suppose that both samples yi|xi and xi are iid.
4.2 Symbolic likelihood
First, let us compute the density of (y,X) or (y, x1, . . . , xn) :
p(y, X) = p(y|X)p(X)
where each term is given by:
p(X) =
n
i=1
(2π)− p
2 |Σi|− 1
2 exp −
1
2
(xi − µi) Σ−1
i (xi − µi) (1)
p(y|X) =
n
i=1
(2π)− 1
2 (σ2
)− 1
2 exp −
1
2
(yi − xi β)2
σ2
(2)
Thus, we can write the joint distribution as:
p(y, X) = C.exp −
1
2
||y − Xβ||2
σ2
+
n
i=1
 xi|Σ−1
i xi  (3)
using the following notations: C = (2π)−
n(p+1)
2 σ−n n
i=1 |Σi|− 1
2 and xi = xi − µi
Finally, we can write the log-likelihood:
l(θ, µi, Σi, yi) = log(p(y)) = log
x1
. . .
xn
p(y, X)dx1 . . . dxn (4)
Indeed, from (4), we can easily notice that y is a normal distribution with parameters µy and Σy.
E[y] = E[Xβ] = Mβ (5)
with : M =



µ1
...
µn



2
If we suppose that X and are independents:
Σy = V ar(y) = V ar(Xβ) + V ar( )
= E[Xβ.(Xβ) ] − E[Xβ].E[Xβ] + σ2
In
where:
E[Xβ.(Xβ) ] = E[((xi β)(xj β))i,j] = E[(β (xixj )β)i,j]
= (β (Cov(xi, xj) + µiµj )β)i,j
= (β δi,jΣiβ)i,j + Mββ M
As a conclusion:
Σy = (β δi,jΣiβ)i,j + σ2
In =



β Σ1β + σ2
(0)
...
(0) β Σnβ + σ2


 (6)
And the log likelihood is given by :
l(θ, µi, Σi, yi) = −
n
2
log(2π) −
1
2
n
i=1
log(β Σiβ + σ2
) −
1
2
(y − Mβ) Σ−1
y (y − Mβ) (7)
4.3 Maximum Likelihood Estimator
Let us see if we can get a closed form for βMLE by computing the partial derivative of the log-likelihood with
respect to β :
We write: l1(β) =
n
i=1 log(β Σiβ + σ2
) and l2(β) = (y − Mβ) Σ−1
y (y − Mβ)
∂l1
∂β
(β) = 2
n
i=1
Σiβ
β Σiβ + σ2
∂l2
∂β
(β) =
n
i=1
∂
∂β
(yi−  µi|β )2
β Σiβ + σ2
=
n
i=1
−
2(yi−  µi|β )2
Σiβ
(β Σiβ + σ2)2
−
2(yi−  µi|β )µi
β Σiβ + σ2
Now, if we gather the two terms and factorize, we can get:
∂l
∂β
(θ, µi, Σi, yi) = 0 ⇐⇒
n
i=1
1
β Σiβ + σ2
yiµi − 1 −
yi
2
β Σiβ + σ2
Σiβ = 0 (8)
where: yi−  µi|β = yi − yi = yi
With the general settings (Σi = 0), we cannot get a closed form for βMLE. Besides, we do not know σ2
. Let us
remind the expression of the partial derivative of the log likelihood with respect to β :
∂l
∂β
(θ, µi, Σi, yi) =
n
i=1
1
β Σiβ + σ2
yiµi − 1 −
yi
2
β Σiβ + σ2
Σiβ (9)
3
Now we compute the partial derivative of the log likelihood with respect to σ2
:
∂l1
∂σ2
(σ2
) =
n
i=1
1
β Σiβ + σ2
∂l2
∂σ2
(σ2
) = −
n
i=1
yi
β Σiβ + σ2
2
Thus:
∂l
∂σ2
(θ, µi, Σi, yi) =
1
2
n
i=1
1
β Σiβ + σ2
yi
2
β Σiβ + σ2
− 1 (10)
To conclude, we have:
l(θ) =
n
i=1
l(i)
(θ) (11)
with:
l(i)
(θ) =




1
β Σiβ+σ2 yiµi − 1 − yi
2
β Σiβ+σ2 Σiβ
1
2
1
β Σiβ+σ2
yi
2
β Σiβ+σ2 − 1



 (12)
4.4 Comments
• Finally, we do have an explicit formula for the gradient of the log likelihood, which is a sum of n vector
in a p+1 dimensional space. Thus, with (11) or (12), we can run a full-batch or a stochastic gradient
descent algorithm to get an approximation of θMLE.
• We also notice that we can easily recover the ’usual’ least square regression case where Σi → 0 and
xi ↔ µi :
n
i=1
1
σ2
yiµi = 0 ⇐⇒ (y − Xβ) X = 0 (13)
1
2σ2
n
i=1
yi
2
σ2
− 1 = 0 ⇐⇒ −
n
2σ2
+
1
2σ4
(y − Xβ) (y − Xβ) = 0 (14)
• We supposed that there is no intercept. Indeed, we can get rid of it easily if we first apply the following
transformation:
xi ∼ N
µi
1
, Σi
...
. . . 0
and β ∈ Rp+1
• After estimating β, we can deduce an unbiased prediction: yj = µj β
• We expect σ2 → +∞ when Σi → +∞. Indeed, if we plot the log-likelihood as a function of σ2
for
different values of Σi (we suppose that the Σi are equals for all i: see more details in Appendix), we get
this result:
4
Figure 1: log-likelihoods for different Σi
• From (6), we can see that the variance of yi depends on Σi but especially on β which makes sense.
Indeed, if p is large and each coordinate is large, the locations of the yi are not accurate. This can lead
to an issue when fitting the regression since we can have overlapping symbols...
4.5 Results
In this section, we simulate different datasets and we compare the maximum likelihood estimation with
the naive solution of fitting a regression on (µi, yi). We remind the settings: we get (µi, Σi, yi) and we want to
estimate β (and σ2
). Let us note m, Σ, s1 and s2 the following parameters:
• M = uniform(0,m,(n,p)) : each coordinate of M (see definition in section 4.2) is generated uniformly
between 0 and m.
• Σ = [Σi for i in range(n)] = [np.diag(uniform(s1,s2,p)) for i in range(n)] : we suppose here that within a
symbol the p features are independents and we generate a diagonal matrix of each Σi where the diagonal
coefficients are uniformly taken between [s1, s2].
4.5.1 Dataset 1 (reference dataset): n=100, p=5, m=100, [s1, s2]=[0.1, 1], σ2
=0.5, β ∈[-2,2]
Figure 2: Convergence to the ML estimator Figure 3: MSE for each method
5
4.5.2 Dataset 2: n=100, p=5, m=100, [s1, s2]=[0.1, 1], σ2
=0.5, β ∈[-0.1,0.1]
Figure 4: Convergence to the ML estimator Figure 5: MSE for each method
4.5.3 Dataset 3: n=100, p=5, m=100, [s1, s2]=[0.1, 1], σ2
=5, β ∈[-2,2]
Figure 6: Convergence to the ML estimator Figure 7: MSE for each method
4.5.4 Dataset 4: n=1000, p=5, m=100, [s1, s2]=[0.1, 1], σ2
=0.5, β ∈[-2,2]
Figure 8: Convergence to the ML estimator Figure 9: MSE for each method
6
4.5.5 Dataset 5: n=100, p=5, m=100, [s1, s2]=[1, 5], σ2
=0.5, β ∈[-2,2]
Figure 10: Convergence to the ML estimator Figure 11: MSE for each method
4.5.6 Dataset 6: n=100, p=20, m=100, [s1, s2]=[0.1, 1], σ2
=0.5, β ∈[-2,2]
Figure 12: Convergence to the ML estimator Figure 13: MSE for each method
5 Conclusion
From these results, we can see that the Maximum Likelihood Estimator is always more accurate than
fitting a naive least squares regression on the (µi, yi). Nevertheless, the computation time is much longer with
a simple gradient descent even if you get an ’appropriate’ learning rate and you start with a ’good’
initialization. To face this issue, it is highly recommanded to start with β0 = (M M)−1
M y. Thus, it will
take just a few seconds to get an ’accurate’ approximation of βMLE. Besides, from the graphics, we can notice
that:
• When the quantity β Σiβ + σ2
is ’large’ (cf Figure 5, Figure 7 and Figure 11), the MLE is highly more
accurate than the naive estimator. Indeed, the naive estimator does not take into account the internal
variation inside a symbol and thus it is very biased. But in this case, we need to run more iterations.
• When n is ’large’ (cf Figure 9): we need to run less iterations but each iteration takes more time. The
more you get symbols, the more accurate are the two estimators. Here, it is preferable to run a
stochastic gradient descent algorithm: it will be definitely faster but a bit less accurate (unless you run
some more iterations).
• When p is ’large’ (cf Figure 13): the difference between both estimators is negligible. The MLE
approximation needs a lot of iterations and subsampling is not a good solution to speed up the
convergence.
7
6 Acknowledgment
I would like to thank my supervisor Scott Sisson who guided me through my research exchange and who
helped me. I also would like to thank all the administration team at UNSW for their welcome and their useful
help to set up my office.
7 References
[1] Billard, L. and Diday, E. (2000). Regression Analysis for Interval-Valued Data. Data analysis,
Classification, and Related Methods (eds. H.A.L. Kiers, J.-P. Rassoon, P.J.F. Groenen, and M. Schader).
Springer-Verlag, Berlin, 369-374.
[2] Billard, L. and Diday, E. (2007). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley,
Chichester.6
[3] de Carvalho F.A.T., Lima Neto, E.A. and Tenorio, C.P. (2004). A New Method to Fit a Linear Regression
Model for Interval-valued Data. Lecture Notes in Computer Science, KI2004 Advances in Artifical Inteligence.
Springer-Verlag, 295-306.
[4] Le-Rademacher, J. and Billard, L. (2010). Likelihood Functions and Some Maximum Likelihood Estimators
for Symbolic Data. Journal of Statistical Planning and Inference, submitted.7
[5] Lima Neto, E.A. and de Carvalho F.A.T. (2010). Constrained Linear Regression Models for Symbolic
Interval-valued Variables. Computational Statistics Data Analysis, 54(2), 333-347.
8 Appendix
(*) Figure 1 has been plotted with these parameters: n=500, p=5, m=10, s1=s2, σ2
=0.5, β ∈[-2,2]
8

Weitere ähnliche Inhalte

Was ist angesagt?

Introductory maths analysis chapter 09 official
Introductory maths analysis   chapter 09 officialIntroductory maths analysis   chapter 09 official
Introductory maths analysis chapter 09 official
Evert Sandye Taasiringan
 
Discrete-Chapter 11 Graphs Part I
Discrete-Chapter 11 Graphs Part IDiscrete-Chapter 11 Graphs Part I
Discrete-Chapter 11 Graphs Part I
Wongyos Keardsri
 
Discrete-Chapter 11 Graphs Part III
Discrete-Chapter 11 Graphs Part IIIDiscrete-Chapter 11 Graphs Part III
Discrete-Chapter 11 Graphs Part III
Wongyos Keardsri
 

Was ist angesagt? (19)

SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMSSOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
 
Lecture2 xing
Lecture2 xingLecture2 xing
Lecture2 xing
 
A common fixed point of integral type contraction in generalized metric spacess
A  common fixed point of integral type contraction in generalized metric spacessA  common fixed point of integral type contraction in generalized metric spacess
A common fixed point of integral type contraction in generalized metric spacess
 
Graph theory concepts complex networks presents-rouhollah nabati
Graph theory concepts   complex networks presents-rouhollah nabatiGraph theory concepts   complex networks presents-rouhollah nabati
Graph theory concepts complex networks presents-rouhollah nabati
 
Mtc ssample05
Mtc ssample05Mtc ssample05
Mtc ssample05
 
Introductory maths analysis chapter 09 official
Introductory maths analysis   chapter 09 officialIntroductory maths analysis   chapter 09 official
Introductory maths analysis chapter 09 official
 
On the Jensen-Shannon symmetrization of distances relying on abstract means
On the Jensen-Shannon symmetrization of distances relying on abstract meansOn the Jensen-Shannon symmetrization of distances relying on abstract means
On the Jensen-Shannon symmetrization of distances relying on abstract means
 
Interpreting Multiple Regression via an Ellipse Inscribed in a Square Extensi...
Interpreting Multiple Regressionvia an Ellipse Inscribed in a Square Extensi...Interpreting Multiple Regressionvia an Ellipse Inscribed in a Square Extensi...
Interpreting Multiple Regression via an Ellipse Inscribed in a Square Extensi...
 
Discrete-Chapter 11 Graphs Part I
Discrete-Chapter 11 Graphs Part IDiscrete-Chapter 11 Graphs Part I
Discrete-Chapter 11 Graphs Part I
 
Discrete-Chapter 11 Graphs Part III
Discrete-Chapter 11 Graphs Part IIIDiscrete-Chapter 11 Graphs Part III
Discrete-Chapter 11 Graphs Part III
 
(α ψ)- Construction with q- function for coupled fixed point
(α   ψ)-  Construction with q- function for coupled fixed point(α   ψ)-  Construction with q- function for coupled fixed point
(α ψ)- Construction with q- function for coupled fixed point
 
Module 1 linear functions
Module 1   linear functionsModule 1   linear functions
Module 1 linear functions
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 
Introductory maths analysis chapter 13 official
Introductory maths analysis   chapter 13 officialIntroductory maths analysis   chapter 13 official
Introductory maths analysis chapter 13 official
 
Introductory maths analysis chapter 17 official
Introductory maths analysis   chapter 17 officialIntroductory maths analysis   chapter 17 official
Introductory maths analysis chapter 17 official
 
Graph theory
Graph theoryGraph theory
Graph theory
 
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
 
Introductory maths analysis chapter 02 official
Introductory maths analysis   chapter 02 officialIntroductory maths analysis   chapter 02 official
Introductory maths analysis chapter 02 official
 
Introductory maths analysis chapter 11 official
Introductory maths analysis   chapter 11 officialIntroductory maths analysis   chapter 11 official
Introductory maths analysis chapter 11 official
 

Ähnlich wie Regression on gaussian symbols

Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Umberto Picchini
 
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Mengxi Jiang
 
The linear regression model: Theory and Application
The linear regression model: Theory and ApplicationThe linear regression model: Theory and Application
The linear regression model: Theory and Application
University of Salerno
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)
NYversity
 

Ähnlich wie Regression on gaussian symbols (20)

Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdf
 
Linearization
LinearizationLinearization
Linearization
 
DETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTDETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECT
 
Cs229 notes9
Cs229 notes9Cs229 notes9
Cs229 notes9
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
Chapter1
Chapter1Chapter1
Chapter1
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
 
F0742328
F0742328F0742328
F0742328
 
StatPhysPerspectives_AMALEA_Cetraro_AnnaCarbone.pdf
StatPhysPerspectives_AMALEA_Cetraro_AnnaCarbone.pdfStatPhysPerspectives_AMALEA_Cetraro_AnnaCarbone.pdf
StatPhysPerspectives_AMALEA_Cetraro_AnnaCarbone.pdf
 
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
 
Formulation of model likelihood functions
Formulation of model likelihood functionsFormulation of model likelihood functions
Formulation of model likelihood functions
 
An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...
 
2018 MUMS Fall Course - Issue Arising in Several Working Groups: Probabilisti...
2018 MUMS Fall Course - Issue Arising in Several Working Groups: Probabilisti...2018 MUMS Fall Course - Issue Arising in Several Working Groups: Probabilisti...
2018 MUMS Fall Course - Issue Arising in Several Working Groups: Probabilisti...
 
The linear regression model: Theory and Application
The linear regression model: Theory and ApplicationThe linear regression model: Theory and Application
The linear regression model: Theory and Application
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)
 
D0111720
D0111720D0111720
D0111720
 
pattern recognition
pattern recognition pattern recognition
pattern recognition
 

Mehr von Axel de Romblay

Mehr von Axel de Romblay (8)

MLBox 0.8.2
MLBox 0.8.2 MLBox 0.8.2
MLBox 0.8.2
 
Meetup "Paris NLP"
Meetup "Paris NLP" Meetup "Paris NLP"
Meetup "Paris NLP"
 
[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems
 
Meetup "Big Data & Machine Learning" (French version)
Meetup "Big Data & Machine Learning" (French version)Meetup "Big Data & Machine Learning" (French version)
Meetup "Big Data & Machine Learning" (French version)
 
How to automate Machine Learning pipeline ?
How to automate Machine Learning pipeline ?How to automate Machine Learning pipeline ?
How to automate Machine Learning pipeline ?
 
Automate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBoxAutomate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBox
 
MLBox
MLBoxMLBox
MLBox
 
Udacity webinar on Recommendation Systems
Udacity webinar on Recommendation SystemsUdacity webinar on Recommendation Systems
Udacity webinar on Recommendation Systems
 

Kürzlich hochgeladen

Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 

Kürzlich hochgeladen (20)

Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 

Regression on gaussian symbols

  • 1. HAL Id: cel-01620878 https://hal.archives-ouvertes.fr/cel-01620878 Submitted on 24 Oct 2017 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Symbolic Data Analysis: Regression on Gaussian symbols Axel Aronio de Romblay To cite this version: Axel Aronio de Romblay. Symbolic Data Analysis: Regression on Gaussian symbols. Master. France. 2015. cel-01620878
  • 2. SYMBOLIC DATA ANALYSIS: Regression on Gaussian symbols Axel Aronio de Romblay Telecom Paristech axel.aronio-de-romblay@telecom-paristech.fr December 2015 1 Abstract In classical data analysis, data are single values. This is the case if you consider a dataset of n patients which age and size you know. But what if you record the blood pressure or the weight of each patient during a day ? Then, for each patient, you do not have a single-valued data but a set of values since the blood pressure or the weight are not constant during the day. Suppose now that you do not want to record blood pressure a thousand times for each patient and to store it into a database because your memory space is limited. Therefore, you need to aggregate each set of values into symbols: intervals (lower and upper bounds only), box plots, histograms or even distributions (distribution law with mean and variance)... Thus, the issue is to adapt classical statistical tools to symbolic data analysis. More precisely, this article is aimed at proposing a method to fit a regression on Gaussian distributions. This paper is divided as follows: first, it presents the computation of the maximum likelihood estimator and then it compares the new approach with the usual least squares regression. KEY WORDS: Symbolic data, Big Data, Linear regression, Maximum likelihood, Gaussian symbols 2 Introduction Symbolic data were first introduced by Diday (1987). A realization of a symbolic-valued variable may take a finite or an infinite set of values in Rp . A random variable that takes a finite set of either quantitative or qualitative values is called a multi-valued variable. A variable that takes an infinite set of numerical values ranging from a low to a high value is called an interval-valued variable. A more complex symbolic-valued variable may have weights, probabilities, capacities, credibilities or even a distribution associated with its values. This class variable is called modal-valued. Classical data are a special case of symbolic data where the internal distribution puts the probability 1 on a single point value in Rp . Symbolic data analysis offers a solution to the massive data problems, especially when the entities of interest are groups or classes of observations. Recent advances in technology enable storage of extremely large databases. Larger datasets provide more information about the subjects of interest. However, performing even simple exploratory procedures on these datasets requires a lot of computing power. As a result, much research effort in recent years has been steered toward finding more efficient methods to accommodate large datasets. In many situations, characteristics of classes of observations are of higher interest to an analyst than those of individual observations. In these cases, individual observations can be aggregated into classes of observations. A series of papers, Diday (1995), Diday et al. (1996), Emilion (1997) and Diday and Emilion (1996, 1998), established mathematical probabilistic framework for classes of symbolic data. These results were extended to Galois lattices in the symbolic data context in Diday and Emilion (1997, 1998, 2003) and Brito and Polaillon (2005). Beyond those fundamental results, a few recent papers (2010) have presented likelihood functions for symbolic data but none of them came up with concrete results or closed form solutions. Thus, in this paper, we mainly concentrate on trying to have an expression of the symbolic likelihood that is valid and usable. Furthermore, we focus on Gaussian symbolic data which is very common. Moreover, some actual work refers to uniform distributions but struggle to get a workable solution. 1
  • 3. 3 General settings Let X a dataset of n observations and y the corresponding target: X =    x1 ... xn    , xi ∈ Rp , i = 1 . . . n y =    y1 ... yn    , yi ∈ R, i = 1 . . . n We suppose y = X β + , ∼ N(0, σ2 In) and we want to estimate θ = (β, σ2 ), β ∈ Rp and σ ∈ R 4 Gaussian symbols 4.1 Definitions Here, we consider the case where each observation xi is no more a point in Rp but a p dimensional normal distribution with known parameters µi and Σi : yi|xi ∼ N(xi β, σ2 ) xi ∼ N(µi, Σi) Furthermore, we suppose that both samples yi|xi and xi are iid. 4.2 Symbolic likelihood First, let us compute the density of (y,X) or (y, x1, . . . , xn) : p(y, X) = p(y|X)p(X) where each term is given by: p(X) = n i=1 (2π)− p 2 |Σi|− 1 2 exp − 1 2 (xi − µi) Σ−1 i (xi − µi) (1) p(y|X) = n i=1 (2π)− 1 2 (σ2 )− 1 2 exp − 1 2 (yi − xi β)2 σ2 (2) Thus, we can write the joint distribution as: p(y, X) = C.exp − 1 2 ||y − Xβ||2 σ2 + n i=1 xi|Σ−1 i xi (3) using the following notations: C = (2π)− n(p+1) 2 σ−n n i=1 |Σi|− 1 2 and xi = xi − µi Finally, we can write the log-likelihood: l(θ, µi, Σi, yi) = log(p(y)) = log x1 . . . xn p(y, X)dx1 . . . dxn (4) Indeed, from (4), we can easily notice that y is a normal distribution with parameters µy and Σy. E[y] = E[Xβ] = Mβ (5) with : M =    µ1 ... µn    2
  • 4. If we suppose that X and are independents: Σy = V ar(y) = V ar(Xβ) + V ar( ) = E[Xβ.(Xβ) ] − E[Xβ].E[Xβ] + σ2 In where: E[Xβ.(Xβ) ] = E[((xi β)(xj β))i,j] = E[(β (xixj )β)i,j] = (β (Cov(xi, xj) + µiµj )β)i,j = (β δi,jΣiβ)i,j + Mββ M As a conclusion: Σy = (β δi,jΣiβ)i,j + σ2 In =    β Σ1β + σ2 (0) ... (0) β Σnβ + σ2    (6) And the log likelihood is given by : l(θ, µi, Σi, yi) = − n 2 log(2π) − 1 2 n i=1 log(β Σiβ + σ2 ) − 1 2 (y − Mβ) Σ−1 y (y − Mβ) (7) 4.3 Maximum Likelihood Estimator Let us see if we can get a closed form for βMLE by computing the partial derivative of the log-likelihood with respect to β : We write: l1(β) = n i=1 log(β Σiβ + σ2 ) and l2(β) = (y − Mβ) Σ−1 y (y − Mβ) ∂l1 ∂β (β) = 2 n i=1 Σiβ β Σiβ + σ2 ∂l2 ∂β (β) = n i=1 ∂ ∂β (yi− µi|β )2 β Σiβ + σ2 = n i=1 − 2(yi− µi|β )2 Σiβ (β Σiβ + σ2)2 − 2(yi− µi|β )µi β Σiβ + σ2 Now, if we gather the two terms and factorize, we can get: ∂l ∂β (θ, µi, Σi, yi) = 0 ⇐⇒ n i=1 1 β Σiβ + σ2 yiµi − 1 − yi 2 β Σiβ + σ2 Σiβ = 0 (8) where: yi− µi|β = yi − yi = yi With the general settings (Σi = 0), we cannot get a closed form for βMLE. Besides, we do not know σ2 . Let us remind the expression of the partial derivative of the log likelihood with respect to β : ∂l ∂β (θ, µi, Σi, yi) = n i=1 1 β Σiβ + σ2 yiµi − 1 − yi 2 β Σiβ + σ2 Σiβ (9) 3
  • 5. Now we compute the partial derivative of the log likelihood with respect to σ2 : ∂l1 ∂σ2 (σ2 ) = n i=1 1 β Σiβ + σ2 ∂l2 ∂σ2 (σ2 ) = − n i=1 yi β Σiβ + σ2 2 Thus: ∂l ∂σ2 (θ, µi, Σi, yi) = 1 2 n i=1 1 β Σiβ + σ2 yi 2 β Σiβ + σ2 − 1 (10) To conclude, we have: l(θ) = n i=1 l(i) (θ) (11) with: l(i) (θ) =     1 β Σiβ+σ2 yiµi − 1 − yi 2 β Σiβ+σ2 Σiβ 1 2 1 β Σiβ+σ2 yi 2 β Σiβ+σ2 − 1     (12) 4.4 Comments • Finally, we do have an explicit formula for the gradient of the log likelihood, which is a sum of n vector in a p+1 dimensional space. Thus, with (11) or (12), we can run a full-batch or a stochastic gradient descent algorithm to get an approximation of θMLE. • We also notice that we can easily recover the ’usual’ least square regression case where Σi → 0 and xi ↔ µi : n i=1 1 σ2 yiµi = 0 ⇐⇒ (y − Xβ) X = 0 (13) 1 2σ2 n i=1 yi 2 σ2 − 1 = 0 ⇐⇒ − n 2σ2 + 1 2σ4 (y − Xβ) (y − Xβ) = 0 (14) • We supposed that there is no intercept. Indeed, we can get rid of it easily if we first apply the following transformation: xi ∼ N µi 1 , Σi ... . . . 0 and β ∈ Rp+1 • After estimating β, we can deduce an unbiased prediction: yj = µj β • We expect σ2 → +∞ when Σi → +∞. Indeed, if we plot the log-likelihood as a function of σ2 for different values of Σi (we suppose that the Σi are equals for all i: see more details in Appendix), we get this result: 4
  • 6. Figure 1: log-likelihoods for different Σi • From (6), we can see that the variance of yi depends on Σi but especially on β which makes sense. Indeed, if p is large and each coordinate is large, the locations of the yi are not accurate. This can lead to an issue when fitting the regression since we can have overlapping symbols... 4.5 Results In this section, we simulate different datasets and we compare the maximum likelihood estimation with the naive solution of fitting a regression on (µi, yi). We remind the settings: we get (µi, Σi, yi) and we want to estimate β (and σ2 ). Let us note m, Σ, s1 and s2 the following parameters: • M = uniform(0,m,(n,p)) : each coordinate of M (see definition in section 4.2) is generated uniformly between 0 and m. • Σ = [Σi for i in range(n)] = [np.diag(uniform(s1,s2,p)) for i in range(n)] : we suppose here that within a symbol the p features are independents and we generate a diagonal matrix of each Σi where the diagonal coefficients are uniformly taken between [s1, s2]. 4.5.1 Dataset 1 (reference dataset): n=100, p=5, m=100, [s1, s2]=[0.1, 1], σ2 =0.5, β ∈[-2,2] Figure 2: Convergence to the ML estimator Figure 3: MSE for each method 5
  • 7. 4.5.2 Dataset 2: n=100, p=5, m=100, [s1, s2]=[0.1, 1], σ2 =0.5, β ∈[-0.1,0.1] Figure 4: Convergence to the ML estimator Figure 5: MSE for each method 4.5.3 Dataset 3: n=100, p=5, m=100, [s1, s2]=[0.1, 1], σ2 =5, β ∈[-2,2] Figure 6: Convergence to the ML estimator Figure 7: MSE for each method 4.5.4 Dataset 4: n=1000, p=5, m=100, [s1, s2]=[0.1, 1], σ2 =0.5, β ∈[-2,2] Figure 8: Convergence to the ML estimator Figure 9: MSE for each method 6
  • 8. 4.5.5 Dataset 5: n=100, p=5, m=100, [s1, s2]=[1, 5], σ2 =0.5, β ∈[-2,2] Figure 10: Convergence to the ML estimator Figure 11: MSE for each method 4.5.6 Dataset 6: n=100, p=20, m=100, [s1, s2]=[0.1, 1], σ2 =0.5, β ∈[-2,2] Figure 12: Convergence to the ML estimator Figure 13: MSE for each method 5 Conclusion From these results, we can see that the Maximum Likelihood Estimator is always more accurate than fitting a naive least squares regression on the (µi, yi). Nevertheless, the computation time is much longer with a simple gradient descent even if you get an ’appropriate’ learning rate and you start with a ’good’ initialization. To face this issue, it is highly recommanded to start with β0 = (M M)−1 M y. Thus, it will take just a few seconds to get an ’accurate’ approximation of βMLE. Besides, from the graphics, we can notice that: • When the quantity β Σiβ + σ2 is ’large’ (cf Figure 5, Figure 7 and Figure 11), the MLE is highly more accurate than the naive estimator. Indeed, the naive estimator does not take into account the internal variation inside a symbol and thus it is very biased. But in this case, we need to run more iterations. • When n is ’large’ (cf Figure 9): we need to run less iterations but each iteration takes more time. The more you get symbols, the more accurate are the two estimators. Here, it is preferable to run a stochastic gradient descent algorithm: it will be definitely faster but a bit less accurate (unless you run some more iterations). • When p is ’large’ (cf Figure 13): the difference between both estimators is negligible. The MLE approximation needs a lot of iterations and subsampling is not a good solution to speed up the convergence. 7
  • 9. 6 Acknowledgment I would like to thank my supervisor Scott Sisson who guided me through my research exchange and who helped me. I also would like to thank all the administration team at UNSW for their welcome and their useful help to set up my office. 7 References [1] Billard, L. and Diday, E. (2000). Regression Analysis for Interval-Valued Data. Data analysis, Classification, and Related Methods (eds. H.A.L. Kiers, J.-P. Rassoon, P.J.F. Groenen, and M. Schader). Springer-Verlag, Berlin, 369-374. [2] Billard, L. and Diday, E. (2007). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester.6 [3] de Carvalho F.A.T., Lima Neto, E.A. and Tenorio, C.P. (2004). A New Method to Fit a Linear Regression Model for Interval-valued Data. Lecture Notes in Computer Science, KI2004 Advances in Artifical Inteligence. Springer-Verlag, 295-306. [4] Le-Rademacher, J. and Billard, L. (2010). Likelihood Functions and Some Maximum Likelihood Estimators for Symbolic Data. Journal of Statistical Planning and Inference, submitted.7 [5] Lima Neto, E.A. and de Carvalho F.A.T. (2010). Constrained Linear Regression Models for Symbolic Interval-valued Variables. Computational Statistics Data Analysis, 54(2), 333-347. 8 Appendix (*) Figure 1 has been plotted with these parameters: n=500, p=5, m=10, s1=s2, σ2 =0.5, β ∈[-2,2] 8