SlideShare a Scribd company logo
1 of 52
Download to read offline
1/52
MASSICCC: A SaaS Platform for
Clustering and Co-Clustering of Mixed Data
https://massiccc.lille.inria.fr/
F. Laporte
with B. Auder, C. Biernacki, G. Celeux, J. Demont, F. Langrognet, V. Kubicki, C. Poli, J. Renault, S. Iovleff
May 29th 2019, TechTalk, Paris
2/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
3/52
MASSICCC?
massiccc.lille.inria.fr
SaaS: Software as a Service
4/52
MASSICCC: Examples of Applications
Market sales
Cities’ similarities
Predictive maintenance
Health
Data mining
Large dataset
Complex dataset
5/52
MASSICCC??
A high quality and easy to use web platform
where are transfered mature research clustering (and more) software
towards (non academic) professionals
6/52
Here is the computer you need!
7/52
Clustering?
Detect hidden structures in data sets
−1 −0.5 0 0.5 1 1.5 2 2.5
−1.5
−1
−0.5
0
0.5
1
1.5
1st MCA axis
2ndMCAaxis
−1 −0.5 0 0.5 1 1.5 2 2.5
−1.5
−1
−0.5
0
0.5
1
1.5
1st MCA axis
2ndMCAaxis
Low income
Average income
High income
8/52
Large data sets1
1
S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering:
Algorithms and Applications, 29
9/52
An opportunity for detecting weak signal
10/52
Todays features: full mixed/missing
11/52
Notations
Data: n individuals: x = {x1, . . . , xn} = {xO , xM } in a space X of dimension d
Observed individuals xO
Missing individuals xM
Aim: estimation of the partition z and the number of clusters K
Partition in K clusters G1, . . . , GK : z = (z1, . . . , zn), zi = (zi1, . . . , ziK )
xi ∈ Gk ⇔ zih = I{h=k}
Mixed, missing, uncertain
Individuals x Partition z ⇔ Group
? 0.5 red 5 ? ? ? ⇔ ???
0.3 0.1 green 3 ? ? ? ⇔ ???
0.3 0.6 {red,green} 3 ? ? ? ⇔ ???
0.9 [0.25 0.45] red ? ? ? ? ⇔ ???
↓ ↓ ↓ ↓
continuous continuous categorical integer
12/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
13/52
Parametric mixture model
Parametric assumption:
pk (x1) = p(x1; αk )
thus
p(x1) = p(x1; θ) =
K
k=1
πk p(x1; αk )
Mixture parameter:
θ = (π, α) with α = (α1, . . . , αK )
Model: it includes both the family p(·; αk ) and the number of groups K
m = {p(x1; θ) : θ ∈ Θ}
The number of free continuous parameters is given by
ν = dim(Θ)
Clustering becomes a well-posed problem. . .
14/52
The clustering process in mixtures
1 Estimation of θ by ˆθ
2 Estimation of the conditional probability that xi ∈ Gk
tik (ˆθ) = p(Zik = 1|Xi = xi ; ˆθ) =
ˆπk p(xi ; ˆαk )
p(xi ; ˆθ)
3 Estimation of zi by maximum a posteriori (MAP)
ˆzik = I{k=arg maxh=1,...,K tih( ˆθ)}
4 Model selection: BIC, ICL, . . .
15/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
16/52
Possible datasets
Continuous data
14 models on correlation matrix between variables (models’ selection)
Categorical data
Conditional independence
Mixed data
Conditional independence inter-type
Conditional independence intra-type (symmetry between types)
Distributions
Continuous: Gaussian
Categorical: Multinomial
17/52
Estimation of θ
Maximize the complete-likelihood over (θ, z)
c (θ; x, z) =
n
i=1
K
k=1
zik ln {πk p(xi ; αk )} CEM
Maximize the observe-likelihood on θ
(θ; x) =
n
i=1
ln p(xi ; θ) EM
18/52
Principle of EM and CEM
Initialization: θ0
Iteration noq:
Step E: estimate probabilities tq
= {tik (θq
)}
Step C: classify by setting tq
= MAP({tik (θq
)})
Step M: maximize θq+1
= arg maxθ c (θ; x, tq
)
Stopping rule: iteration number or criterion stability
Properties
⊕: simplicity, monotony, low memory requirement
: local maxima (depends on θ0), linear convergence (EM)
19/52
Prostate cancer data (without mixing data)
Individuals: n = 475 patients with prostatic cancer grouped on clinical criteria
into two Stages 3 and 4 of the disease
Variables: d = 12 pre-trial variates were measured on each patient, composed by
eight continuous variables (age, weight, systolic blood pressure, diastolic blood
pressure, serum haemoglobin, size of primary tumour, index of tumour stage and
histolic grade, serum prostatic acid phosphatase) and four categorical variables
with various numbers of levels (performance rating, cardiovascular disease history,
electrocardiogram code, bone metastases)
Model: cond. indep. p(x1; αk ) = p(x1; αcont
k ) · p(x1; αcat
k )
20/52
Mixed data
21/52
Why should I use mixmod?
Avdantage(s)
Compare a lot of different models (continuous data)
Analyse mixed data
Disadvantage(s)
Do not handle missing data
Only continuous or categorical data
22/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
23/52
Full mixed data: conditional independence everywhere2
The aim is to combine continuous, categorical, integer data, ordinal, ranking and
functional data
x1 = (xcont
1 , xcat
1 , xint
1 , . . .)
The proposed solution is to mixed all types by inter-type conditional independence
p(x1; αk ) = p(xcont
1 ; αcont
k ) × p(xcat
1 ; αcat
k ) × p(xint
1 ; αint
k ) × . . .
In addition, for symmetry between types, intra-type conditional independence
Only need to define the univariate pdf for each variable type!
Continuous: Gaussian
Categorical: multinomial
Integer: Poisson
. . .
2
MixtComp software on the MASSICCC platform: https://massiccc.lille.inria.fr/
24/52
Missing data: MAR assumption and estimation
Assumption on the missingness mecanism
Missing At Randon (MAR): the probability that a variable is missing does not
depend on its own value given the observed variables.
Observed log-likelihood. . .
(θ; xO
) =
n
i=1
log
K
k=1
πk p(xO
i ; αk ) =
n
i=1
log






K
k=1
πk
xM
i
p(xO
i , xM
i ; αk)dxM
i
MAR assumption






25/52
SEM algorithm3
A SEM algorithm to estimate θ by maximizing the observed-data log-likelihood
Initialisation: θ(0)
Iteration nb q:
E-step: compute conditional probabilities p(xM
, z|x0
; θ(q)
)
S-step: draw (xM(q)
, z(q)
) from p(xM
, z|x0
; θ(q)
)
M-step: maximize θ(q+1)
= arg maxθ ln p(xO
, xM(q)
, z(q)
; θ)
Stopping rule: iteration number
Properties: simpler than EM and interesting properties!
Avoid possibly difficult E-step in an EM
Classical M steps
Avoids local maxima
The mean of the sequence (θ(q)) approximates ˆθ
The variance of the sequence (θ(q)) gives confidence intervals
3
MixtComp software on the MASSICCC platform: https://massiccc.lille.inria.fr/
26/52
Prostate cancer data (with missing data)4
Individuals: 506 patients with prostatic cancer grouped on clinical criteria into
two Stages 3 and 4 of the disease
Variables: d = 12 pre-trial variates were measured on each patient, composed by
eight continuous variables (age, weight, systolic blood pressure, diastolic blood
pressure, serum haemoglobin, size of primary tumour, index of tumour stage and
histolic grade, serum prostatic acid phosphatase) and four categorical variables
with various numbers of levels (performance rating, cardiovascular disease history,
electrocardiogram code, bone metastases)
Some missing data: 62 missing values (≈ 1%)
We forget the classes (Stages of the desease) for performing clustering
Questions
How many clusters?
Which partition?
4
Byar DP, Green SB (1980): Bulletin Cancer, Paris 67:477-488
27/52
Data upload without preprocessing
28/52
Run clustering analysis
29/52
Several quick result overviews. . . without post-processing
30/52
Variable significance on global partition
+ similarity between variables
31/52
Variable “SG” difference between clusters
32/52
Variable “BM” difference between clusters
33/52
Companies and MixtComp
Modal (Rougegorge)
Inriatech (Alstom, ArcelorMittal, D´ecathlon, ...)
DiagRAMS technologies (predictive maintenance)
34/52
Why should I use MixtComp?
Avdantage(s)
Analyse different kinds of data
Handle missing and partially missing data
Disadvantage(s)
Do not use correlation structure (even with continuous data)
35/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
36/52
High-dimensional (HD) data5
5
S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering:
Algorithms and Applications, 29
37/52
From clustering to co-clustering
[Govaert, 2011]
38/52
Notations
zi : the cluster of the row i
wj : the cluster of the column j
(zi , wj ): the block of the element xij (row i, column j)
z = (z1, . . . , zn): partition of individuals in K custers of rows
w = (w1, . . . , wd ): partition of variables in L clusters of columns
(z, w): bi-partition of the whole data set x
Both space partitions are respectively denoted by Z and W
Restriction
All variables are of the same kind (research in progress for overcoming that. . . )
39/52
MLE estimation: EM algorithm
Observed log-likelihood: (θ; x) = log p(x; θ)
Complete log-likelihood:
c (θ; x, z, w) = log p(x, z, w; θ)
=
i,k
zik log πk +
k,l
wjl log ρl +
i,j,k,l
zik wjl log p(xj
i ; αkl )
E-step of EM (iteration q):
Q(θ, θ(q)
) = E[ c (θ; x, z, w)|x; θ(q)
]
=
i,k
p(zi = k|x; θ(q)
)
t
(q)
ik
ln πk +
j,l
p(wi = l|x; θ(q)
)
s
(q)
jl
ln ρl
+
i,j,k,l
p(zi = k, wj = l|x; θ(q)
)
e
(q)
ijkl
ln p(xij ; αkl )
M-step of EM (iteration q): classical. For instance, for the Bernoulli case, it gives
π
(q+1)
k = i t
(q)
ik
n
, ρ
(q+1)
l =
j s
(q)
jl
d
, α
(q+1)
kl =
i,j e
(q)
ijkl xij
i,j e
(q)
ijkl
40/52
MLE: intractable E step
e
(q)
ijkl is usually intractable. . .
Consequence of dependency between xij s (link between rows and columns)
Involve KnLd calculus (number of possible blocks)
Example: if n = d = 20 and K = L = 2 then 1012 blocks
Example (cont’d): 33 years with a computer calculating 100,000 blocks/second
Alternatives to EM
Variational EM (numerical approx.): conditional independence assumption
p(z, w|x; θ) ≈ p(z|x; θ)p(w|x; θ)
SEM-Gibbs (stochastic approx.): replace E-step by a S-step approx. by Gibbs
z|x, w; θ and w|x, z; θ
41/52
Document clustering (1/2)
Mixture of 1033 medical summaries and 1398 aeronautics summaries
Lines: 2431 documents
Columns: present words (except stop), thus 9275 unique words
Data matrix: cross counting document×words
Poisson model
42/52
Document clustering (2/2)
Results with 2×2 blocs
Medline Cranfield
Medline 1033 0
Cranfield 0 1398
43/52
Running BlockCluster
44/52
Running BlockCluster
45/52
Running BlockCluster
46/52
Running BlockCluster
47/52
Why should I use BlockCluster?
Advantage(s)
Co-clustering (HD data)
Disadvantage(s)
Analyse one kind of data at a time
48/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
49/52
Current work
mixmod
Missing Not At Random data (using logit distribution) =⇒ missing values
impact the probability of missingness
Patnership between CMAP/INRIA/Traumabase
Work with C. Biernacki, G. Celeux, J. Josse and Y. Stroppa
MixtComp
Publish an R package on CRAN
With you?
New kind of data
Complex missingness
2D clusters’ plot
50/52
Use probabilistic modelling as a mathematical guideline
Use the MASSICCC platform for user-friendly implementation
User-friendly interpretation
”One for all” of clustering
Low computer requirement needed
Free software
https://massiccc.lille.inria.fr/
Also check R packages (https://cran.r-project.org/)
Rmixmod: https://cran.r-project.org/web/packages/Rmixmod/index.html
blockcluster: https://cran.r-project.org/web/packages/blockcluster/index.html
51/52
MERCI
!
52/52

More Related Content

What's hot

Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theory
sia16
 

What's hot (20)

SASA 2016
SASA 2016SASA 2016
SASA 2016
 
A quantum-inspired optimization heuristic for the multiple sequence alignment...
A quantum-inspired optimization heuristic for the multiple sequence alignment...A quantum-inspired optimization heuristic for the multiple sequence alignment...
A quantum-inspired optimization heuristic for the multiple sequence alignment...
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernels
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practice
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction models
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphs
 
CSC446: Pattern Recognition (LN3)
CSC446: Pattern Recognition (LN3)CSC446: Pattern Recognition (LN3)
CSC446: Pattern Recognition (LN3)
 
Investigating the 3D structure of the genome with Hi-C data analysis
Investigating the 3D structure of the genome with Hi-C data analysisInvestigating the 3D structure of the genome with Hi-C data analysis
Investigating the 3D structure of the genome with Hi-C data analysis
 
ABC-SysBio – Approximate Bayesian Computation in Python with GPU support
ABC-SysBio – Approximate Bayesian Computation in Python with GPU supportABC-SysBio – Approximate Bayesian Computation in Python with GPU support
ABC-SysBio – Approximate Bayesian Computation in Python with GPU support
 
Polynomial Matrix Decompositions
Polynomial Matrix DecompositionsPolynomial Matrix Decompositions
Polynomial Matrix Decompositions
 
P1121133727
P1121133727P1121133727
P1121133727
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3
 
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
 
ICPR 2016
ICPR 2016ICPR 2016
ICPR 2016
 
Pres metabief2020jmm
Pres metabief2020jmmPres metabief2020jmm
Pres metabief2020jmm
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theory
 
Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biologyKernel methods for data integration in systems biology
Kernel methods for data integration in systems biology
 

Similar to Inria Tech Talk - La classification de données complexes avec MASSICCC

Scaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid ParallelismScaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid Parallelism
Parameswaran Raman
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
butest
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Pooyan Jamshidi
 
isabelle_webinar_jan..
isabelle_webinar_jan..isabelle_webinar_jan..
isabelle_webinar_jan..
butest
 
Introduction
IntroductionIntroduction
Introduction
butest
 
Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...
Université de Liège (ULg)
 
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
The Statistical and Applied Mathematical Sciences Institute
 

Similar to Inria Tech Talk - La classification de données complexes avec MASSICCC (20)

ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Scaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid ParallelismScaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid Parallelism
 
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
 
Triggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphsTriggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphs
 
isabelle_webinar_jan..
isabelle_webinar_jan..isabelle_webinar_jan..
isabelle_webinar_jan..
 
The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Introduction
IntroductionIntroduction
Introduction
 
H2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine UdellH2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine Udell
 
Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)
 

More from Stéphanie Roger

More from Stéphanie Roger (20)

Workshop IA : accélérez vos projets grâce au calcul intensif
Workshop IA : accélérez vos projets grâce au calcul intensif Workshop IA : accélérez vos projets grâce au calcul intensif
Workshop IA : accélérez vos projets grâce au calcul intensif
 
Workshop - Le traitement de données biométriques par la CNIL
Workshop - Le traitement de données biométriques par la CNILWorkshop - Le traitement de données biométriques par la CNIL
Workshop - Le traitement de données biométriques par la CNIL
 
Masterclass Welcome to France with Business France
Masterclass Welcome to France with Business FranceMasterclass Welcome to France with Business France
Masterclass Welcome to France with Business France
 
Inria Tech Talk : Validez vos protocoles IoT avec la plateforme FIT/IoT-LAB
Inria Tech Talk : Validez vos protocoles IoT avec la plateforme FIT/IoT-LABInria Tech Talk : Validez vos protocoles IoT avec la plateforme FIT/IoT-LAB
Inria Tech Talk : Validez vos protocoles IoT avec la plateforme FIT/IoT-LAB
 
Dossier de Presse - chatbot
Dossier de Presse - chatbotDossier de Presse - chatbot
Dossier de Presse - chatbot
 
Masterclass pour s'implanter en Inde avec Business France Export et INPI
Masterclass pour s'implanter en Inde avec Business France Export et INPIMasterclass pour s'implanter en Inde avec Business France Export et INPI
Masterclass pour s'implanter en Inde avec Business France Export et INPI
 
Workshop CNIL - "Privacy Impact Assessment" : comment réaliser une analyse de...
Workshop CNIL - "Privacy Impact Assessment" : comment réaliser une analyse de...Workshop CNIL - "Privacy Impact Assessment" : comment réaliser une analyse de...
Workshop CNIL - "Privacy Impact Assessment" : comment réaliser une analyse de...
 
Masterclass - Vendre au secteur public de santé par l'UGAP
Masterclass - Vendre au secteur public de santé par l'UGAPMasterclass - Vendre au secteur public de santé par l'UGAP
Masterclass - Vendre au secteur public de santé par l'UGAP
 
Inria Tech Talk : IceSL, le logiciel d'impression 3D
Inria Tech Talk : IceSL, le logiciel d'impression 3DInria Tech Talk : IceSL, le logiciel d'impression 3D
Inria Tech Talk : IceSL, le logiciel d'impression 3D
 
Workshop CNIL - RGPD & Objets connectés
Workshop CNIL - RGPD & Objets connectésWorkshop CNIL - RGPD & Objets connectés
Workshop CNIL - RGPD & Objets connectés
 
Masterclass pour se développer en zone ASEAN @BF Export @INPI
 Masterclass pour se développer en zone ASEAN @BF Export @INPI Masterclass pour se développer en zone ASEAN @BF Export @INPI
Masterclass pour se développer en zone ASEAN @BF Export @INPI
 
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMPInria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
 
Workshop CNIL - RGPD & données de santé 22 février
Workshop CNIL - RGPD & données de santé 22 févrierWorkshop CNIL - RGPD & données de santé 22 février
Workshop CNIL - RGPD & données de santé 22 février
 
Workshop les bonnes pratiques pour scaler sur le marché américain - 18 février
Workshop les bonnes pratiques pour scaler sur le marché américain - 18 févrierWorkshop les bonnes pratiques pour scaler sur le marché américain - 18 février
Workshop les bonnes pratiques pour scaler sur le marché américain - 18 février
 
Workshop Financement par la CCIPARIS-IDF
Workshop Financement par la CCIPARIS-IDF Workshop Financement par la CCIPARIS-IDF
Workshop Financement par la CCIPARIS-IDF
 
Masterclass : les grands enjeux de la #Smartcity
Masterclass : les grands enjeux de la #SmartcityMasterclass : les grands enjeux de la #Smartcity
Masterclass : les grands enjeux de la #Smartcity
 
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentéeInria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
 
La Masterclass #RGPD #International @CNIL
La Masterclass #RGPD #International @CNILLa Masterclass #RGPD #International @CNIL
La Masterclass #RGPD #International @CNIL
 
Workshop IA : supercalculateur pour booster vos projets par GENCI
Workshop IA : supercalculateur pour booster vos projets par GENCIWorkshop IA : supercalculateur pour booster vos projets par GENCI
Workshop IA : supercalculateur pour booster vos projets par GENCI
 
Workshop Recrutement #Associés #Fondateurs
Workshop Recrutement #Associés #FondateursWorkshop Recrutement #Associés #Fondateurs
Workshop Recrutement #Associés #Fondateurs
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Inria Tech Talk - La classification de données complexes avec MASSICCC

  • 2. MASSICCC: A SaaS Platform for Clustering and Co-Clustering of Mixed Data https://massiccc.lille.inria.fr/ F. Laporte with B. Auder, C. Biernacki, G. Celeux, J. Demont, F. Langrognet, V. Kubicki, C. Poli, J. Renault, S. Iovleff May 29th 2019, TechTalk, Paris 2/52
  • 3. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 3/52
  • 5. MASSICCC: Examples of Applications Market sales Cities’ similarities Predictive maintenance Health Data mining Large dataset Complex dataset 5/52
  • 6. MASSICCC?? A high quality and easy to use web platform where are transfered mature research clustering (and more) software towards (non academic) professionals 6/52
  • 7. Here is the computer you need! 7/52
  • 8. Clustering? Detect hidden structures in data sets −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5 1st MCA axis 2ndMCAaxis −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5 1st MCA axis 2ndMCAaxis Low income Average income High income 8/52
  • 9. Large data sets1 1 S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering: Algorithms and Applications, 29 9/52
  • 10. An opportunity for detecting weak signal 10/52
  • 11. Todays features: full mixed/missing 11/52
  • 12. Notations Data: n individuals: x = {x1, . . . , xn} = {xO , xM } in a space X of dimension d Observed individuals xO Missing individuals xM Aim: estimation of the partition z and the number of clusters K Partition in K clusters G1, . . . , GK : z = (z1, . . . , zn), zi = (zi1, . . . , ziK ) xi ∈ Gk ⇔ zih = I{h=k} Mixed, missing, uncertain Individuals x Partition z ⇔ Group ? 0.5 red 5 ? ? ? ⇔ ??? 0.3 0.1 green 3 ? ? ? ⇔ ??? 0.3 0.6 {red,green} 3 ? ? ? ⇔ ??? 0.9 [0.25 0.45] red ? ? ? ? ⇔ ??? ↓ ↓ ↓ ↓ continuous continuous categorical integer 12/52
  • 13. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 13/52
  • 14. Parametric mixture model Parametric assumption: pk (x1) = p(x1; αk ) thus p(x1) = p(x1; θ) = K k=1 πk p(x1; αk ) Mixture parameter: θ = (π, α) with α = (α1, . . . , αK ) Model: it includes both the family p(·; αk ) and the number of groups K m = {p(x1; θ) : θ ∈ Θ} The number of free continuous parameters is given by ν = dim(Θ) Clustering becomes a well-posed problem. . . 14/52
  • 15. The clustering process in mixtures 1 Estimation of θ by ˆθ 2 Estimation of the conditional probability that xi ∈ Gk tik (ˆθ) = p(Zik = 1|Xi = xi ; ˆθ) = ˆπk p(xi ; ˆαk ) p(xi ; ˆθ) 3 Estimation of zi by maximum a posteriori (MAP) ˆzik = I{k=arg maxh=1,...,K tih( ˆθ)} 4 Model selection: BIC, ICL, . . . 15/52
  • 16. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 16/52
  • 17. Possible datasets Continuous data 14 models on correlation matrix between variables (models’ selection) Categorical data Conditional independence Mixed data Conditional independence inter-type Conditional independence intra-type (symmetry between types) Distributions Continuous: Gaussian Categorical: Multinomial 17/52
  • 18. Estimation of θ Maximize the complete-likelihood over (θ, z) c (θ; x, z) = n i=1 K k=1 zik ln {πk p(xi ; αk )} CEM Maximize the observe-likelihood on θ (θ; x) = n i=1 ln p(xi ; θ) EM 18/52
  • 19. Principle of EM and CEM Initialization: θ0 Iteration noq: Step E: estimate probabilities tq = {tik (θq )} Step C: classify by setting tq = MAP({tik (θq )}) Step M: maximize θq+1 = arg maxθ c (θ; x, tq ) Stopping rule: iteration number or criterion stability Properties ⊕: simplicity, monotony, low memory requirement : local maxima (depends on θ0), linear convergence (EM) 19/52
  • 20. Prostate cancer data (without mixing data) Individuals: n = 475 patients with prostatic cancer grouped on clinical criteria into two Stages 3 and 4 of the disease Variables: d = 12 pre-trial variates were measured on each patient, composed by eight continuous variables (age, weight, systolic blood pressure, diastolic blood pressure, serum haemoglobin, size of primary tumour, index of tumour stage and histolic grade, serum prostatic acid phosphatase) and four categorical variables with various numbers of levels (performance rating, cardiovascular disease history, electrocardiogram code, bone metastases) Model: cond. indep. p(x1; αk ) = p(x1; αcont k ) · p(x1; αcat k ) 20/52
  • 22. Why should I use mixmod? Avdantage(s) Compare a lot of different models (continuous data) Analyse mixed data Disadvantage(s) Do not handle missing data Only continuous or categorical data 22/52
  • 23. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 23/52
  • 24. Full mixed data: conditional independence everywhere2 The aim is to combine continuous, categorical, integer data, ordinal, ranking and functional data x1 = (xcont 1 , xcat 1 , xint 1 , . . .) The proposed solution is to mixed all types by inter-type conditional independence p(x1; αk ) = p(xcont 1 ; αcont k ) × p(xcat 1 ; αcat k ) × p(xint 1 ; αint k ) × . . . In addition, for symmetry between types, intra-type conditional independence Only need to define the univariate pdf for each variable type! Continuous: Gaussian Categorical: multinomial Integer: Poisson . . . 2 MixtComp software on the MASSICCC platform: https://massiccc.lille.inria.fr/ 24/52
  • 25. Missing data: MAR assumption and estimation Assumption on the missingness mecanism Missing At Randon (MAR): the probability that a variable is missing does not depend on its own value given the observed variables. Observed log-likelihood. . . (θ; xO ) = n i=1 log K k=1 πk p(xO i ; αk ) = n i=1 log       K k=1 πk xM i p(xO i , xM i ; αk)dxM i MAR assumption       25/52
  • 26. SEM algorithm3 A SEM algorithm to estimate θ by maximizing the observed-data log-likelihood Initialisation: θ(0) Iteration nb q: E-step: compute conditional probabilities p(xM , z|x0 ; θ(q) ) S-step: draw (xM(q) , z(q) ) from p(xM , z|x0 ; θ(q) ) M-step: maximize θ(q+1) = arg maxθ ln p(xO , xM(q) , z(q) ; θ) Stopping rule: iteration number Properties: simpler than EM and interesting properties! Avoid possibly difficult E-step in an EM Classical M steps Avoids local maxima The mean of the sequence (θ(q)) approximates ˆθ The variance of the sequence (θ(q)) gives confidence intervals 3 MixtComp software on the MASSICCC platform: https://massiccc.lille.inria.fr/ 26/52
  • 27. Prostate cancer data (with missing data)4 Individuals: 506 patients with prostatic cancer grouped on clinical criteria into two Stages 3 and 4 of the disease Variables: d = 12 pre-trial variates were measured on each patient, composed by eight continuous variables (age, weight, systolic blood pressure, diastolic blood pressure, serum haemoglobin, size of primary tumour, index of tumour stage and histolic grade, serum prostatic acid phosphatase) and four categorical variables with various numbers of levels (performance rating, cardiovascular disease history, electrocardiogram code, bone metastases) Some missing data: 62 missing values (≈ 1%) We forget the classes (Stages of the desease) for performing clustering Questions How many clusters? Which partition? 4 Byar DP, Green SB (1980): Bulletin Cancer, Paris 67:477-488 27/52
  • 28. Data upload without preprocessing 28/52
  • 30. Several quick result overviews. . . without post-processing 30/52
  • 31. Variable significance on global partition + similarity between variables 31/52
  • 32. Variable “SG” difference between clusters 32/52
  • 33. Variable “BM” difference between clusters 33/52
  • 34. Companies and MixtComp Modal (Rougegorge) Inriatech (Alstom, ArcelorMittal, D´ecathlon, ...) DiagRAMS technologies (predictive maintenance) 34/52
  • 35. Why should I use MixtComp? Avdantage(s) Analyse different kinds of data Handle missing and partially missing data Disadvantage(s) Do not use correlation structure (even with continuous data) 35/52
  • 36. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 36/52
  • 37. High-dimensional (HD) data5 5 S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering: Algorithms and Applications, 29 37/52
  • 38. From clustering to co-clustering [Govaert, 2011] 38/52
  • 39. Notations zi : the cluster of the row i wj : the cluster of the column j (zi , wj ): the block of the element xij (row i, column j) z = (z1, . . . , zn): partition of individuals in K custers of rows w = (w1, . . . , wd ): partition of variables in L clusters of columns (z, w): bi-partition of the whole data set x Both space partitions are respectively denoted by Z and W Restriction All variables are of the same kind (research in progress for overcoming that. . . ) 39/52
  • 40. MLE estimation: EM algorithm Observed log-likelihood: (θ; x) = log p(x; θ) Complete log-likelihood: c (θ; x, z, w) = log p(x, z, w; θ) = i,k zik log πk + k,l wjl log ρl + i,j,k,l zik wjl log p(xj i ; αkl ) E-step of EM (iteration q): Q(θ, θ(q) ) = E[ c (θ; x, z, w)|x; θ(q) ] = i,k p(zi = k|x; θ(q) ) t (q) ik ln πk + j,l p(wi = l|x; θ(q) ) s (q) jl ln ρl + i,j,k,l p(zi = k, wj = l|x; θ(q) ) e (q) ijkl ln p(xij ; αkl ) M-step of EM (iteration q): classical. For instance, for the Bernoulli case, it gives π (q+1) k = i t (q) ik n , ρ (q+1) l = j s (q) jl d , α (q+1) kl = i,j e (q) ijkl xij i,j e (q) ijkl 40/52
  • 41. MLE: intractable E step e (q) ijkl is usually intractable. . . Consequence of dependency between xij s (link between rows and columns) Involve KnLd calculus (number of possible blocks) Example: if n = d = 20 and K = L = 2 then 1012 blocks Example (cont’d): 33 years with a computer calculating 100,000 blocks/second Alternatives to EM Variational EM (numerical approx.): conditional independence assumption p(z, w|x; θ) ≈ p(z|x; θ)p(w|x; θ) SEM-Gibbs (stochastic approx.): replace E-step by a S-step approx. by Gibbs z|x, w; θ and w|x, z; θ 41/52
  • 42. Document clustering (1/2) Mixture of 1033 medical summaries and 1398 aeronautics summaries Lines: 2431 documents Columns: present words (except stop), thus 9275 unique words Data matrix: cross counting document×words Poisson model 42/52
  • 43. Document clustering (2/2) Results with 2×2 blocs Medline Cranfield Medline 1033 0 Cranfield 0 1398 43/52
  • 48. Why should I use BlockCluster? Advantage(s) Co-clustering (HD data) Disadvantage(s) Analyse one kind of data at a time 48/52
  • 49. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 49/52
  • 50. Current work mixmod Missing Not At Random data (using logit distribution) =⇒ missing values impact the probability of missingness Patnership between CMAP/INRIA/Traumabase Work with C. Biernacki, G. Celeux, J. Josse and Y. Stroppa MixtComp Publish an R package on CRAN With you? New kind of data Complex missingness 2D clusters’ plot 50/52
  • 51. Use probabilistic modelling as a mathematical guideline Use the MASSICCC platform for user-friendly implementation User-friendly interpretation ”One for all” of clustering Low computer requirement needed Free software https://massiccc.lille.inria.fr/ Also check R packages (https://cran.r-project.org/) Rmixmod: https://cran.r-project.org/web/packages/Rmixmod/index.html blockcluster: https://cran.r-project.org/web/packages/blockcluster/index.html 51/52