SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
DYS875-006

Reproducible Clusters from Microarray Research:
Whither?
Garge, Nikhil et. al.

Gota Morota

Dec. 9, 2009

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
DYS875-006

Seminar

Clustering Gene Expression Profiles

Given: expression profiles for a set of genes or
experiments/individuals/time points
Do: organize profiles into clusters such that
genes in the same cluster are highly similar to each other
genes from different clusters have low similarity to each other

Goal
Understand general characteristics of data and infer something
about a gene based on how it relates to other genes

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
Seminar

DYS875-006

Validity of Clustering Analysis
Clustering presents challenges because
there is no null hypothesis to test and no right answer
the result of clustering may be method sensitive (distance
metric, clustering algorithm)
no way to evaluate the validity of a cluster solution

⇓
Measure replicability of clustering algorithms.
Clusters that produce classifications with greater replicability
would be considered more valid.

Objective
Determine the replicability and degree of stability of commonly
used non-hierarchical clustering algorithms
Gota Morota

Reproducible Clusters from Microarray Research: Whither?
DYS875-006

Seminar

Data
!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#

Real datasets

Simulated datasets

!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#

Table 1: List of microarray datasets considered for the study. Table 1 contains two columns of datasets. Each dataset is described by its
name, source, and sample size (n). Table 1 shows 39 datasets. The first 3 columns list 19 datasets and last three columns describe 18
datasets.

Name of the
dataset

Source

Sample size (n)

Name of the dataset

Source

Sample size (n)

GDS22
GDS171
GDS184
GDS232
GDS274
GDS285
GDS365

GEO
GEO
GEO
GEO
GEO
GEO
GEO

80
30
30
46
80
20
66

[30]
[31]
[32]
[33]
[34]
Unpublished
Unpublished

70
34
100
60
42
24
106

GDS465
GDS331
GDS534
GDS565
GDS427
GDS402
GDS356
GDS389
GDS388
GDS352
GDS531
GDS535

GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO

90
70
74
48
24
12
14
16
18
12
172
12

Leukemia dataset
Medulloblastoma Data Set
Prostate Cancer dataset
Gaffney Head and Neck data
Affymetrix Hu133A Latin Square
CNGI design experiment
Paired pre and post euglycaemic insulin clamp skeletal
muscle biopsies
GDS156
GDS254
GDS268
GDS287
GDS288
GDS472
GDS473
GDS511
GDS520
GDS564
GDS540

GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO
GEO

12
16
24
16
16
14
12
12
20
28
18

Table 2: List of simulated microarray datasets. Table 2 show the
details of simulated datasets. Each of these datasets has
clustering structure k = 6 (six clusters) with correlation ! set to
(0.33)1/2.

3. Compute th
not updated un
the data.

Dataset Name Sample size Number of genes Clusters

4. Alternate ste
ters.

Dataset1
Dataset2
Dataset3
Dataset4
Dataset5
Dataset6
Dataset7
Dataset8

20
100
200
500
1000
40
60
80

1200
1200
1200
1200
1200
1200
1200
1200

6
6
6
6
6
6
6
6

We consider
methods, whic
algorithms for
package.

K-means
In K-means clu
ters and rando
Filtered out genes values
which contained at least one missing value ters. If a gene
Missing
not available in real datasets. We simulated 8 datasets
If we represent microarray data as a matrix with rows repwith 1200 genes and sample sizes ranging from n = 20 to
cluster, as asse
resenting genes and columns representing chips or
1000, where n is the number of subjects. All simulated
Igij !
Standardized the we filtered out all rows which contained at leastsam- meanIgizero and unit
ples, expression values to
one
datasets were structured for 6 clusters (k = 6) with correlaPearson's corre
Zij =
null expression or missing value because we do not know
tion ! set to (0.33) for all pairwise combinations of
SDgi
genes within clusters and zero for all pair wise combinasource(s) for the
observation.
be assigned to
variance to validate our the exactdata can be due missing/null valuetranscription
tions of genes in different clusters. In order
Missing
to array damage,
to the closest cl
methodology, we would predict higher scores when we
errors, etc. Conventional algorithms for clustering require
extract 6 clusters in our fitted solutions. Simulated datacomplete datasets to run and extending these clustering
Where Zij = Z score computed for expression level recalculated. A
sets also help us understand the stability behaviour for
routines to accommodate missing data was beyond the
values other than k = 6 (i.e., when we extract the wrong
scope of our inquiry.
troids will no l
number of clusters). Table 2 explains the details of simuobserved for gene i in sample/subject j, Ig = intensity
Standardization
lated datasets. We acknowledge that number of genes in
Gota Morota
Reproducible Clusters from Microarray Research:ij Whither? K-means cluste
1/2
DYS875-006

Seminar

Four Algorithms Considered

Four non-hierarchical (partitional) clustering algorithms.
Non-hierarchical clusterings require the number of clusters (k) be
pre-specified.
K-means ( kmeans {stats} )
Self Organizing Maps (SOM) (som { cluster })
Clustering LARge Applications (CLARA) (clara { cluster })
Fuzzy C-means (fanny { cluster })

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
Seminar

DYS875-006

K-means
1
2
3
4
5
6

K-Means Clustering

Choose the number of k clusters
Randomly assign items to the k clusters
•! assume our instances are represented by vectors of real
Calculate new centroid for each of the k clusters
values
Calculate the distance of all items to the k centroids
•! put k cluster centers in same space as instances
!
Assign items to closest centroid
•! each cluster is represented by a vector f j
Repeat until clusters assignments are stable

•! consider an example in which our vectors have 2 dimensions
+

instances

+

+

cluster center

+

Figure 1: K-Means
Gota Morota

Reproducible Clusters from Microarray Research: Whither?
DYS875-006

Seminar

K-means

K-Means Clustering

Each iteration involves two steps
1

•! each iteration involves two steps
assignment of instances to clusters

2

–! assignment the means
re-computation of of instances to clusters
–! re-computation of the means
+

+

+

+

+

+
+

+

assignment

re-computation of means

Figure 2: K-Means

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
DYS875-006

Seminar

Other Clustering Methods
Self Organazing Map (SOM)
Similar to K-means, but centroids are restricted to a
two-dimensional grid
Clustering LARge Applications (CLARA)
Extension of PAM(Partition Around Medoids)
it can deal with much larger datasets than PAM
Fuzzy C-means
each gene belongs to a cluster that is specified by a
membership degree (0-1)
basically you can assign genes to more than one cluster
assign the gene to a cluster showing maximum degree of
membership

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
DYS875-006

Seminar

Cluster Stability
Cramer’s v2
χ2
N(k − 1)
where

χ2 is the ordinary χ2 test statistic for independence in
contingency tables

N is the number of genes to be clustered
k is the number of clusters extracted
Stability score
0: no relationship
1: perfect reproducibility
Gota Morota

Reproducible Clusters from Microarray Research: Whither?
DYS875-006

Seminar

Approach to Compute Cluster Stability
!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#

Microarray dataset with S subjects and N genes

Split dataset into “left” and “right” datasets

Left dataset with
S/2 subjects

Sub-sample left
dataset into sets of
various sample
sizes (3 to S/2)

Sub-sample right
dataset into sets of
various sample
sizes (3 to S/2)

Left sub-sampled
set of sample size
“x” (x ranges from
3 to S/2)

Right subsampled set of
sample size “x” (x
ranges from 3 to
S/2)

Cluster left set of
sample size “x” with
k (2 to 10) number
of clusters

Repeat 3
times

Right dataset with
S/2 subjects

Cluster right set of
sample size “x” with
k (2 to 10) number
of clusters

Compute Chi square (X2) between clustering results

Cluster Stability S(x,k) = Cramer’s v2

Figure 1
Algorithm: cluster stability computation
Algorithm: cluster stability computation. Cluster stability score S(x,k) is computed for every "k"(number of clusters) and
every pair of sub-sampled set of sample size "x".

./01!2!34!--

Figure 3: Cluster Stability Computation
/0+12$'3*42)$'&,$(&)$-%,+,%&'$03)0&.2.5

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
Seminar

DYS875-006

Result on Real Datasets – (SOM)
!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#

Table 3: Table showing stability results produced on a real dataset of sample size 16. Table 3 shows stability scores produced on a
given dataset of a sample size of n = 16. We split the dataset into two halves each containing 8 subjects. The left dataset is resampled
6 times producing 6 samples of sample sizes 3 to 8, respectively. Similarly the right dataset is resampled to produce 6 samples. We
measured the strength of the association between the clusters produced on every pair of samples (one sample from left and other
from right dataset both of same sample size) using Cramer's v2 . Columns in the table represent number of clusters (k) and rows
represent sample sizes. Stability score quantified for k = 10 and sample size 8 is 0.3699. This table shows there is 37% agreement
between the clusters produced (k = 10) on pair of samples (a sample from left dataset and other from right dataset both of sample size
8).

K (CLUSTERS)
2
SAMPLESIZE

3
4
5
6
7
8

3

4

5

6

7

8

9

10

0.5883
0.5799
0.5738
0.6433
0.6534
0.6759

0.47091
0.48045
0.48296
0.54638
0.54821
0.58447

0.4503
0.4244
0.4297
0.5142
0.5250
0.5520

0.4028
0.3894
0.3982
0.4727
0.4826
0.5045

0.3809
0.365
0.3644
0.4405
0.4462
0.4700

0.3600
0.3469
0.3430
0.4066
0.4211
0.4592

0.3313
0.3132
0.3195
0.3817
0.3915
0.4160

0.3107
0.297
0.3013
0.3616
0.3679
0.3975

0.2992
0.2858
0.2790
0.3396
0.348
0.3699

sample size. CLARA and Fuzzy C-means, however, mainwe deviate from k = 6, we observed a decline in stability
Figure scores until a sample size of a real scores. This phenomenon can 16
tained low stability4: Stability result on 30 was dataset of sample sizebe clearly observed in
attained. Stability scores then gradually increased after
CLARA, K-means and Fuzzy C-means (Figure 5). Hence,
this threshold. K-means and SOM showed superior stabilscores observed on k = 7 were always higher than that on
ity scores as compared to CLARA until the sample size
k = 2, since k = 7 is nearer to k = 6 (Figure 5). Figure 4
attained n = 30. It is interesting to note that average stabilshows results on simulated datasets for k = 6. We observed
ity achieved is not greater than 0.55 for all four clustering
the following differences in stability behaviors among the
Gota Morota
routines even when at sample size of n = 50 is attained.Reproducible Clustersalgorithms.
four clustering from Microarray Research: Whither?
ng methods. Alternatively, if
e of scores across 37 selected
scores from 37 real datasets)
epresent stability coefficients
clustering structure, we then
nd 0.8 until a sample size of
algorithms is achieved.

iors until sample size reached n = 100. K-means showed
Seminar
DYS875-006
high stability at smaller sample sizes as compared to the
other methods.

Result on Real Datasets – among different algorithms
Real datasets

Stability coefficient

0.5
SOM

0.4

Kmeans

0.3

Fuzzy C-means

0.2

Clara

0.1

48

38

43

33

28

23

18

8

13

0

3

the same clustering structure
tion ! set to (0.33)1/2 within
all datasets show high scores
her values of k. In simulated
utput tables produced on 8
e with each cell computed as
ding cells in 8 tables thereby
scores for each value of k (k
sub-sampled space. The final
ability behavior of the clusvalues of clusters (k) considwe produced a final output
2 to 10) across sub-sampled
esults for various values of k
n in Figure 5. As expected,
ed for the correct number of
tering routines thereby valiprogramming. However, as

0.6

Sample Size

Figure Stability results
Cluster 3
Cluster Stability results. Stability scores for various values of k (2 to 10) are computed on all 37 datasets. For each
dataset, we selected a column (k) showing maximum summation of scores across sample size. Finally all 37 columns
selected on 37 datasets were merged into one resultant column representing stability scores with respect to sample size
for that clustering routine.

Figure 5: Stability result on a real dataset of sample size 16
./01!2!34!-/0+12$'3*42)$'&,$(&)$-%,+,%&'$03)0&.2.5

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
Seminar

DYS875-006

Result on Simulated Datasets – 1
!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#

Figure 3 and Figure 4 suggest that
ble performance than other cluste
ered (SOM, CLARA and Fuzzy C
1.2
SOM showed similar behavior in
1
they are closely related to each oth
SOM
0.8
ids move freely in multidimension
K-means
0.6
constrained to a two-dimensiona
Fuzzy C-means
0.4
SOM, the distance of each input fr
Clara
0.2
is considered, instead of just the c
the neighborhood kernel [29]. Th
0
as conventional clustering algorit
Sample Size
neighborhood kernel is zero [29].
on all four clustering routines
microarray datasets, in general,
Figure Stability results on simulated datasets for k = 6
Cluster 4
structure. We do not claim that th
Cluster Stability results on simulated datasets for k =
the exact stability nature of a giv
6. Datasets are simulated with a clustering structure k = 6 (6
sample size, since these are genera
clusters). The above figure shows high stability scores
observed for k = 6 on all four clustering routines.
and variety of datasets. Nonetheles
consider performing cluster analys
to obtain more stable clustering s
• Stability
Fuzzy C-means simulated fluctuation
criterion for
6: K-means,evenresult onand SOM showeddataset forsuggests ofstatistical(k) for a givensm
k = 6a
in scores
at large sample sizes, whereas CLARA
number clusters
showed consistent behavior (constant level of scores) at
may be accomplished by computi
ous values of k and selecting that
larger sample sizes.
vides a maximum stability score fo
• CLARA maintained 100% stability for larger sample
We also evaluated stability perfo
sizes (300–500) whereas, SOM and Fuzzy C-means failed
Gota Morota
Reproducible Clusters from Microarray Research: Whither?

Figure

476

433

390

347

304

261

218

175

89

132

46

3

Stability coefficient

Simulated datasets for rho = sqrt(0.33) and k=6
Seminar

DYS875-006

Result on Simulated Datasets – 2
!"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-#

Clara: rho = sqrt(0.33) & k=2 to 10

K-means: rho = sqrt(0.33) & k=2 to 10
2

1

3

0.8

4

0.6

5

0.4

6

0.2

7
8
498

465

432

399

366

333

300

267

234

201

168

69

10

Sample Size

Fuzzy C-means: rho = sqrt(0.33) & k=2 to 10

9
10

Sample Size

SOM: rho = sqrt(0.33) & k=2 to 10

Sample Size

2
3

0.6

4

0.4

5

0.2

6
7
498

465

432

399

366

333

300

267

234

201

168

135

69

0
36

8

1
0.8

102

498

465

432

399

366

333

300

7
267

0
234

6
201

0.2
168

5

135

0.4

69

4

36

3

102

2

0.6

3

1
0.8

Stability coefficient

1.2

3

1.2
Sability coefficient

135

9

36

0
102

498

465

432

399

366

333

8
300

0
267

7
234

0.2
201

6

168

0.4

69

5

135

0.6

36

4

102

3

3

1
0.8

1.2
Stability coefficient

2

3

Stability coefficient

1.2

9

Sample Size

Figure 5
Cluster Stability results on simulated datasets for k = 2 to k = 10
Cluster Stability results on simulated datasets for k = 2 to k = 10. Stability scores for various values of k (2 to 10) are
computed on all the 8 simulated datasets. For each dataset, we generate an output table of scores (explained in Algorithms
section). We merge all the 8 output tables produced into one table with each cell computed as average of corresponding cells
in 8 tables. Finally scores are plotted for all k values with respect to sample size. For cleaner visualization purposes, we do not
show stability curves for all k values in figure 5c and figure 5d. a Scores plotted for CLARA for each k (2–10). b Scores plotted
for K-means for each k (2–10). c Scores plotted for Fuzzy Cmeans for each k (2–10). d Scores plotted for SOM for each k (2–
10).

Gota Morota

Reproducible Clusters from Microarray Research: Whither?
DYS875-006

Seminar

Conclusion

microarray datasets may lack natural clustering structure
thereby producing low stability scores on all four methods
the algorithms studied may not be well suited to producing
reliable results
sample sizes typically used in microarray research may be too
small to support derivation of reliable clustering results

Gota Morota

Reproducible Clusters from Microarray Research: Whither?

Weitere ähnliche Inhalte

Was ist angesagt?

Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
Hybridization of Bat and Genetic Algorithm to Solve N-Queens Problem
Hybridization of Bat and Genetic Algorithm to Solve N-Queens ProblemHybridization of Bat and Genetic Algorithm to Solve N-Queens Problem
Hybridization of Bat and Genetic Algorithm to Solve N-Queens ProblemjournalBEEI
 
Statistical Clustering
Statistical ClusteringStatistical Clustering
Statistical Clusteringtim_hare
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyijpla
 
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...ijscmc
 
The comparison study of kernel KC-means and support vector machines for class...
The comparison study of kernel KC-means and support vector machines for class...The comparison study of kernel KC-means and support vector machines for class...
The comparison study of kernel KC-means and support vector machines for class...TELKOMNIKA JOURNAL
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering108kaushik
 
Exponential lindley additive failure rate model
Exponential lindley additive failure rate modelExponential lindley additive failure rate model
Exponential lindley additive failure rate modeleSAT Journals
 
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...IJERA Editor
 
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"IJDKP
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringishmecse13
 

Was ist angesagt? (20)

Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
Hybridization of Bat and Genetic Algorithm to Solve N-Queens Problem
Hybridization of Bat and Genetic Algorithm to Solve N-Queens ProblemHybridization of Bat and Genetic Algorithm to Solve N-Queens Problem
Hybridization of Bat and Genetic Algorithm to Solve N-Queens Problem
 
Statistical Clustering
Statistical ClusteringStatistical Clustering
Statistical Clustering
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
 
The comparison study of kernel KC-means and support vector machines for class...
The comparison study of kernel KC-means and support vector machines for class...The comparison study of kernel KC-means and support vector machines for class...
The comparison study of kernel KC-means and support vector machines for class...
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
 
Exponential lindley additive failure rate model
Exponential lindley additive failure rate modelExponential lindley additive failure rate model
Exponential lindley additive failure rate model
 
47 292-298
47 292-29847 292-298
47 292-298
 
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
 
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 

Andere mochten auch

SEO Presentation In First Class
SEO Presentation In First Class  SEO Presentation In First Class
SEO Presentation In First Class Nayeem Talukder
 
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...Gota Morota
 
Profil desa galung
Profil desa galungProfil desa galung
Profil desa galungAsriady Arch
 
Diffusion kernels on SNP data embedded in a non-Euclidean metric
Diffusion kernels on SNP data embedded in a non-Euclidean metricDiffusion kernels on SNP data embedded in a non-Euclidean metric
Diffusion kernels on SNP data embedded in a non-Euclidean metricGota Morota
 
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...Gota Morota
 
Chapter 12 Application of Gibbs Sampling in Variance Component Estimation and...
Chapter 12 Application of Gibbs Sampling in Variance Component Estimation and...Chapter 12 Application of Gibbs Sampling in Variance Component Estimation and...
Chapter 12 Application of Gibbs Sampling in Variance Component Estimation and...Gota Morota
 
Презентация для лабораторной работы
Презентация для лабораторной работыПрезентация для лабораторной работы
Презентация для лабораторной работыLovingCn
 
Пример для лабораторной работы
Пример для лабораторной работыПример для лабораторной работы
Пример для лабораторной работыLovingCn
 
Dark light electronic short project
Dark light electronic short  projectDark light electronic short  project
Dark light electronic short projectNayeem Talukder
 
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...Gota Morota
 
Learn Youtube marketing and make your carrier in youtube
Learn Youtube marketing and make your carrier in youtubeLearn Youtube marketing and make your carrier in youtube
Learn Youtube marketing and make your carrier in youtubeNayeem Talukder
 
9 ways to success in real estate business
9 ways to success in real estate business9 ways to success in real estate business
9 ways to success in real estate businessNayeem Talukder
 
My personaldresser the perfect gift
My personaldresser    the perfect giftMy personaldresser    the perfect gift
My personaldresser the perfect giftFranco Zullo
 

Andere mochten auch (17)

SEO Presentation In First Class
SEO Presentation In First Class  SEO Presentation In First Class
SEO Presentation In First Class
 
NEW CV2
NEW CV2NEW CV2
NEW CV2
 
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
 
Profil desa galung
Profil desa galungProfil desa galung
Profil desa galung
 
Prateek_CV1
Prateek_CV1Prateek_CV1
Prateek_CV1
 
Diffusion kernels on SNP data embedded in a non-Euclidean metric
Diffusion kernels on SNP data embedded in a non-Euclidean metricDiffusion kernels on SNP data embedded in a non-Euclidean metric
Diffusion kernels on SNP data embedded in a non-Euclidean metric
 
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
 
Catalogo.pdf meridional
Catalogo.pdf meridionalCatalogo.pdf meridional
Catalogo.pdf meridional
 
Chapter 12 Application of Gibbs Sampling in Variance Component Estimation and...
Chapter 12 Application of Gibbs Sampling in Variance Component Estimation and...Chapter 12 Application of Gibbs Sampling in Variance Component Estimation and...
Chapter 12 Application of Gibbs Sampling in Variance Component Estimation and...
 
Презентация для лабораторной работы
Презентация для лабораторной работыПрезентация для лабораторной работы
Презентация для лабораторной работы
 
Пример для лабораторной работы
Пример для лабораторной работыПример для лабораторной работы
Пример для лабораторной работы
 
Dark light electronic short project
Dark light electronic short  projectDark light electronic short  project
Dark light electronic short project
 
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
 
Learn Youtube marketing and make your carrier in youtube
Learn Youtube marketing and make your carrier in youtubeLearn Youtube marketing and make your carrier in youtube
Learn Youtube marketing and make your carrier in youtube
 
9 ways to success in real estate business
9 ways to success in real estate business9 ways to success in real estate business
9 ways to success in real estate business
 
My personaldresser the perfect gift
My personaldresser    the perfect giftMy personaldresser    the perfect gift
My personaldresser the perfect gift
 
Saudi Arabia Presentation Map Slides
Saudi Arabia Presentation Map SlidesSaudi Arabia Presentation Map Slides
Saudi Arabia Presentation Map Slides
 

Ähnlich wie Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: Whither? BMC Bioinformatics.

Clustering and Visualisation using R programming
Clustering and Visualisation using R programmingClustering and Visualisation using R programming
Clustering and Visualisation using R programmingNixon Mendez
 
Kaggle digits analysis_final_fc
Kaggle digits analysis_final_fcKaggle digits analysis_final_fc
Kaggle digits analysis_final_fcZachary Combs
 
LE03.doc
LE03.docLE03.doc
LE03.docbutest
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with RYanchang Zhao
 
CCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataCCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataIRJET Journal
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
 
unsupervised classification.pdf
unsupervised classification.pdfunsupervised classification.pdf
unsupervised classification.pdfsurjeetkoli900
 
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
 ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO... ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...cscpconf
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfVIKASGUPTA127897
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...IJECEIAES
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_financeStefan Duprey
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptxGandhiMathy6
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihoodJames Wong
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihoodHoang Nguyen
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihoodYoung Alista
 
Data miningmaximumlikelihood
Data miningmaximumlikelihoodData miningmaximumlikelihood
Data miningmaximumlikelihoodFraboni Ec
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihoodTony Nguyen
 

Ähnlich wie Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: Whither? BMC Bioinformatics. (20)

Clustering and Visualisation using R programming
Clustering and Visualisation using R programmingClustering and Visualisation using R programming
Clustering and Visualisation using R programming
 
Kaggle digits analysis_final_fc
Kaggle digits analysis_final_fcKaggle digits analysis_final_fc
Kaggle digits analysis_final_fc
 
LE03.doc
LE03.docLE03.doc
LE03.doc
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with R
 
CCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataCCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression Data
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
unsupervised classification.pdf
unsupervised classification.pdfunsupervised classification.pdf
unsupervised classification.pdf
 
MyStataLab Assignment Help
MyStataLab Assignment HelpMyStataLab Assignment Help
MyStataLab Assignment Help
 
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
 ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO... ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
Data miningmaximumlikelihood
Data miningmaximumlikelihoodData miningmaximumlikelihood
Data miningmaximumlikelihood
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 

Kürzlich hochgeladen

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Kürzlich hochgeladen (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Garge, Nikhil et. al. 2005. Reproducible Clusters from Microarray Research: Whither? BMC Bioinformatics.

  • 1. DYS875-006 Reproducible Clusters from Microarray Research: Whither? Garge, Nikhil et. al. Gota Morota Dec. 9, 2009 Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 2. DYS875-006 Seminar Clustering Gene Expression Profiles Given: expression profiles for a set of genes or experiments/individuals/time points Do: organize profiles into clusters such that genes in the same cluster are highly similar to each other genes from different clusters have low similarity to each other Goal Understand general characteristics of data and infer something about a gene based on how it relates to other genes Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 3. Seminar DYS875-006 Validity of Clustering Analysis Clustering presents challenges because there is no null hypothesis to test and no right answer the result of clustering may be method sensitive (distance metric, clustering algorithm) no way to evaluate the validity of a cluster solution ⇓ Measure replicability of clustering algorithms. Clusters that produce classifications with greater replicability would be considered more valid. Objective Determine the replicability and degree of stability of commonly used non-hierarchical clustering algorithms Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 4. DYS875-006 Seminar Data !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-# Real datasets Simulated datasets !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-# Table 1: List of microarray datasets considered for the study. Table 1 contains two columns of datasets. Each dataset is described by its name, source, and sample size (n). Table 1 shows 39 datasets. The first 3 columns list 19 datasets and last three columns describe 18 datasets. Name of the dataset Source Sample size (n) Name of the dataset Source Sample size (n) GDS22 GDS171 GDS184 GDS232 GDS274 GDS285 GDS365 GEO GEO GEO GEO GEO GEO GEO 80 30 30 46 80 20 66 [30] [31] [32] [33] [34] Unpublished Unpublished 70 34 100 60 42 24 106 GDS465 GDS331 GDS534 GDS565 GDS427 GDS402 GDS356 GDS389 GDS388 GDS352 GDS531 GDS535 GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO 90 70 74 48 24 12 14 16 18 12 172 12 Leukemia dataset Medulloblastoma Data Set Prostate Cancer dataset Gaffney Head and Neck data Affymetrix Hu133A Latin Square CNGI design experiment Paired pre and post euglycaemic insulin clamp skeletal muscle biopsies GDS156 GDS254 GDS268 GDS287 GDS288 GDS472 GDS473 GDS511 GDS520 GDS564 GDS540 GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO GEO 12 16 24 16 16 14 12 12 20 28 18 Table 2: List of simulated microarray datasets. Table 2 show the details of simulated datasets. Each of these datasets has clustering structure k = 6 (six clusters) with correlation ! set to (0.33)1/2. 3. Compute th not updated un the data. Dataset Name Sample size Number of genes Clusters 4. Alternate ste ters. Dataset1 Dataset2 Dataset3 Dataset4 Dataset5 Dataset6 Dataset7 Dataset8 20 100 200 500 1000 40 60 80 1200 1200 1200 1200 1200 1200 1200 1200 6 6 6 6 6 6 6 6 We consider methods, whic algorithms for package. K-means In K-means clu ters and rando Filtered out genes values which contained at least one missing value ters. If a gene Missing not available in real datasets. We simulated 8 datasets If we represent microarray data as a matrix with rows repwith 1200 genes and sample sizes ranging from n = 20 to cluster, as asse resenting genes and columns representing chips or 1000, where n is the number of subjects. All simulated Igij ! Standardized the we filtered out all rows which contained at leastsam- meanIgizero and unit ples, expression values to one datasets were structured for 6 clusters (k = 6) with correlaPearson's corre Zij = null expression or missing value because we do not know tion ! set to (0.33) for all pairwise combinations of SDgi genes within clusters and zero for all pair wise combinasource(s) for the observation. be assigned to variance to validate our the exactdata can be due missing/null valuetranscription tions of genes in different clusters. In order Missing to array damage, to the closest cl methodology, we would predict higher scores when we errors, etc. Conventional algorithms for clustering require extract 6 clusters in our fitted solutions. Simulated datacomplete datasets to run and extending these clustering Where Zij = Z score computed for expression level recalculated. A sets also help us understand the stability behaviour for routines to accommodate missing data was beyond the values other than k = 6 (i.e., when we extract the wrong scope of our inquiry. troids will no l number of clusters). Table 2 explains the details of simuobserved for gene i in sample/subject j, Ig = intensity Standardization lated datasets. We acknowledge that number of genes in Gota Morota Reproducible Clusters from Microarray Research:ij Whither? K-means cluste 1/2
  • 5. DYS875-006 Seminar Four Algorithms Considered Four non-hierarchical (partitional) clustering algorithms. Non-hierarchical clusterings require the number of clusters (k) be pre-specified. K-means ( kmeans {stats} ) Self Organizing Maps (SOM) (som { cluster }) Clustering LARge Applications (CLARA) (clara { cluster }) Fuzzy C-means (fanny { cluster }) Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 6. Seminar DYS875-006 K-means 1 2 3 4 5 6 K-Means Clustering Choose the number of k clusters Randomly assign items to the k clusters •! assume our instances are represented by vectors of real Calculate new centroid for each of the k clusters values Calculate the distance of all items to the k centroids •! put k cluster centers in same space as instances ! Assign items to closest centroid •! each cluster is represented by a vector f j Repeat until clusters assignments are stable •! consider an example in which our vectors have 2 dimensions + instances + + cluster center + Figure 1: K-Means Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 7. DYS875-006 Seminar K-means K-Means Clustering Each iteration involves two steps 1 •! each iteration involves two steps assignment of instances to clusters 2 –! assignment the means re-computation of of instances to clusters –! re-computation of the means + + + + + + + + assignment re-computation of means Figure 2: K-Means Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 8. DYS875-006 Seminar Other Clustering Methods Self Organazing Map (SOM) Similar to K-means, but centroids are restricted to a two-dimensional grid Clustering LARge Applications (CLARA) Extension of PAM(Partition Around Medoids) it can deal with much larger datasets than PAM Fuzzy C-means each gene belongs to a cluster that is specified by a membership degree (0-1) basically you can assign genes to more than one cluster assign the gene to a cluster showing maximum degree of membership Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 9. DYS875-006 Seminar Cluster Stability Cramer’s v2 χ2 N(k − 1) where χ2 is the ordinary χ2 test statistic for independence in contingency tables N is the number of genes to be clustered k is the number of clusters extracted Stability score 0: no relationship 1: perfect reproducibility Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 10. DYS875-006 Seminar Approach to Compute Cluster Stability !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-# Microarray dataset with S subjects and N genes Split dataset into “left” and “right” datasets Left dataset with S/2 subjects Sub-sample left dataset into sets of various sample sizes (3 to S/2) Sub-sample right dataset into sets of various sample sizes (3 to S/2) Left sub-sampled set of sample size “x” (x ranges from 3 to S/2) Right subsampled set of sample size “x” (x ranges from 3 to S/2) Cluster left set of sample size “x” with k (2 to 10) number of clusters Repeat 3 times Right dataset with S/2 subjects Cluster right set of sample size “x” with k (2 to 10) number of clusters Compute Chi square (X2) between clustering results Cluster Stability S(x,k) = Cramer’s v2 Figure 1 Algorithm: cluster stability computation Algorithm: cluster stability computation. Cluster stability score S(x,k) is computed for every "k"(number of clusters) and every pair of sub-sampled set of sample size "x". ./01!2!34!-- Figure 3: Cluster Stability Computation /0+12$'3*42)$'&,$(&)$-%,+,%&'$03)0&.2.5 Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 11. Seminar DYS875-006 Result on Real Datasets – (SOM) !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-# Table 3: Table showing stability results produced on a real dataset of sample size 16. Table 3 shows stability scores produced on a given dataset of a sample size of n = 16. We split the dataset into two halves each containing 8 subjects. The left dataset is resampled 6 times producing 6 samples of sample sizes 3 to 8, respectively. Similarly the right dataset is resampled to produce 6 samples. We measured the strength of the association between the clusters produced on every pair of samples (one sample from left and other from right dataset both of same sample size) using Cramer's v2 . Columns in the table represent number of clusters (k) and rows represent sample sizes. Stability score quantified for k = 10 and sample size 8 is 0.3699. This table shows there is 37% agreement between the clusters produced (k = 10) on pair of samples (a sample from left dataset and other from right dataset both of sample size 8). K (CLUSTERS) 2 SAMPLESIZE 3 4 5 6 7 8 3 4 5 6 7 8 9 10 0.5883 0.5799 0.5738 0.6433 0.6534 0.6759 0.47091 0.48045 0.48296 0.54638 0.54821 0.58447 0.4503 0.4244 0.4297 0.5142 0.5250 0.5520 0.4028 0.3894 0.3982 0.4727 0.4826 0.5045 0.3809 0.365 0.3644 0.4405 0.4462 0.4700 0.3600 0.3469 0.3430 0.4066 0.4211 0.4592 0.3313 0.3132 0.3195 0.3817 0.3915 0.4160 0.3107 0.297 0.3013 0.3616 0.3679 0.3975 0.2992 0.2858 0.2790 0.3396 0.348 0.3699 sample size. CLARA and Fuzzy C-means, however, mainwe deviate from k = 6, we observed a decline in stability Figure scores until a sample size of a real scores. This phenomenon can 16 tained low stability4: Stability result on 30 was dataset of sample sizebe clearly observed in attained. Stability scores then gradually increased after CLARA, K-means and Fuzzy C-means (Figure 5). Hence, this threshold. K-means and SOM showed superior stabilscores observed on k = 7 were always higher than that on ity scores as compared to CLARA until the sample size k = 2, since k = 7 is nearer to k = 6 (Figure 5). Figure 4 attained n = 30. It is interesting to note that average stabilshows results on simulated datasets for k = 6. We observed ity achieved is not greater than 0.55 for all four clustering the following differences in stability behaviors among the Gota Morota routines even when at sample size of n = 50 is attained.Reproducible Clustersalgorithms. four clustering from Microarray Research: Whither?
  • 12. ng methods. Alternatively, if e of scores across 37 selected scores from 37 real datasets) epresent stability coefficients clustering structure, we then nd 0.8 until a sample size of algorithms is achieved. iors until sample size reached n = 100. K-means showed Seminar DYS875-006 high stability at smaller sample sizes as compared to the other methods. Result on Real Datasets – among different algorithms Real datasets Stability coefficient 0.5 SOM 0.4 Kmeans 0.3 Fuzzy C-means 0.2 Clara 0.1 48 38 43 33 28 23 18 8 13 0 3 the same clustering structure tion ! set to (0.33)1/2 within all datasets show high scores her values of k. In simulated utput tables produced on 8 e with each cell computed as ding cells in 8 tables thereby scores for each value of k (k sub-sampled space. The final ability behavior of the clusvalues of clusters (k) considwe produced a final output 2 to 10) across sub-sampled esults for various values of k n in Figure 5. As expected, ed for the correct number of tering routines thereby valiprogramming. However, as 0.6 Sample Size Figure Stability results Cluster 3 Cluster Stability results. Stability scores for various values of k (2 to 10) are computed on all 37 datasets. For each dataset, we selected a column (k) showing maximum summation of scores across sample size. Finally all 37 columns selected on 37 datasets were merged into one resultant column representing stability scores with respect to sample size for that clustering routine. Figure 5: Stability result on a real dataset of sample size 16 ./01!2!34!-/0+12$'3*42)$'&,$(&)$-%,+,%&'$03)0&.2.5 Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 13. Seminar DYS875-006 Result on Simulated Datasets – 1 !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-# Figure 3 and Figure 4 suggest that ble performance than other cluste ered (SOM, CLARA and Fuzzy C 1.2 SOM showed similar behavior in 1 they are closely related to each oth SOM 0.8 ids move freely in multidimension K-means 0.6 constrained to a two-dimensiona Fuzzy C-means 0.4 SOM, the distance of each input fr Clara 0.2 is considered, instead of just the c the neighborhood kernel [29]. Th 0 as conventional clustering algorit Sample Size neighborhood kernel is zero [29]. on all four clustering routines microarray datasets, in general, Figure Stability results on simulated datasets for k = 6 Cluster 4 structure. We do not claim that th Cluster Stability results on simulated datasets for k = the exact stability nature of a giv 6. Datasets are simulated with a clustering structure k = 6 (6 sample size, since these are genera clusters). The above figure shows high stability scores observed for k = 6 on all four clustering routines. and variety of datasets. Nonetheles consider performing cluster analys to obtain more stable clustering s • Stability Fuzzy C-means simulated fluctuation criterion for 6: K-means,evenresult onand SOM showeddataset forsuggests ofstatistical(k) for a givensm k = 6a in scores at large sample sizes, whereas CLARA number clusters showed consistent behavior (constant level of scores) at may be accomplished by computi ous values of k and selecting that larger sample sizes. vides a maximum stability score fo • CLARA maintained 100% stability for larger sample We also evaluated stability perfo sizes (300–500) whereas, SOM and Fuzzy C-means failed Gota Morota Reproducible Clusters from Microarray Research: Whither? Figure 476 433 390 347 304 261 218 175 89 132 46 3 Stability coefficient Simulated datasets for rho = sqrt(0.33) and k=6
  • 14. Seminar DYS875-006 Result on Simulated Datasets – 2 !"#$!%&%'(&)*+,%-.!"##$%!!&'())*!"+,'-# Clara: rho = sqrt(0.33) & k=2 to 10 K-means: rho = sqrt(0.33) & k=2 to 10 2 1 3 0.8 4 0.6 5 0.4 6 0.2 7 8 498 465 432 399 366 333 300 267 234 201 168 69 10 Sample Size Fuzzy C-means: rho = sqrt(0.33) & k=2 to 10 9 10 Sample Size SOM: rho = sqrt(0.33) & k=2 to 10 Sample Size 2 3 0.6 4 0.4 5 0.2 6 7 498 465 432 399 366 333 300 267 234 201 168 135 69 0 36 8 1 0.8 102 498 465 432 399 366 333 300 7 267 0 234 6 201 0.2 168 5 135 0.4 69 4 36 3 102 2 0.6 3 1 0.8 Stability coefficient 1.2 3 1.2 Sability coefficient 135 9 36 0 102 498 465 432 399 366 333 8 300 0 267 7 234 0.2 201 6 168 0.4 69 5 135 0.6 36 4 102 3 3 1 0.8 1.2 Stability coefficient 2 3 Stability coefficient 1.2 9 Sample Size Figure 5 Cluster Stability results on simulated datasets for k = 2 to k = 10 Cluster Stability results on simulated datasets for k = 2 to k = 10. Stability scores for various values of k (2 to 10) are computed on all the 8 simulated datasets. For each dataset, we generate an output table of scores (explained in Algorithms section). We merge all the 8 output tables produced into one table with each cell computed as average of corresponding cells in 8 tables. Finally scores are plotted for all k values with respect to sample size. For cleaner visualization purposes, we do not show stability curves for all k values in figure 5c and figure 5d. a Scores plotted for CLARA for each k (2–10). b Scores plotted for K-means for each k (2–10). c Scores plotted for Fuzzy Cmeans for each k (2–10). d Scores plotted for SOM for each k (2– 10). Gota Morota Reproducible Clusters from Microarray Research: Whither?
  • 15. DYS875-006 Seminar Conclusion microarray datasets may lack natural clustering structure thereby producing low stability scores on all four methods the algorithms studied may not be well suited to producing reliable results sample sizes typically used in microarray research may be too small to support derivation of reliable clustering results Gota Morota Reproducible Clusters from Microarray Research: Whither?