A. Ekbal, S. Saha, D. Mollá, and K. Ravikumar.
Multi-Objective Optimization for Clustering of Medical
Publications (2013). Proceedings of the Australasian
Language Technology Association Workshop 2013
(ALTA 2013),
pp53-61, Brisbane, Australia. http://aclweb.org/anthology/U/U13/
Multi-Objective Optimization for Clustering of Medical Publications
1. Multi-Objective Optimization for Clustering of
Medical Publications
Asif Ekbal1
Sriparna Saha1
India Institute of Technology1
Patna, Bihar, India
Diego Moll´2
a
K Ravikumar1
Centre for Language Technology2
Macquarie University
Sydney, Australia
ALTA 2013, Brisbane, Australia
2. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Contents
Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
2/26
3. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Contents
Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
3/26
4. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Evidence Based Medicine
http://laikaspoetnik.wordpress.com/2009/04/04/evidence-based-medicine-the-facebook-of-medicine/
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
4/26
5. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
The Dream
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
5/26
6. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
The Bottom-line Answer
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
6/26
7. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
A Means of Getting There
Output
Input
QUESTION:
Which treatments
work best for
hemorrhoids?
DOCUMENTS:
[11289288]
[12972967]
[1442682]
[15486746]
[16235372]
[16252313]
[17054255]
[17380367]
clustering
=⇒
summarisation
1. Excision is the most effective
treatment for thrombosed
external hemorrhoids.
[11289288] [12972967]
[15486746]
2. For prolapsed internal
hemorrhoids, the best
definitive treatment is
traditional hemorrhoidectomy.
[17054255] [17380367]
3. Of nonoperative techniques,
rubber band ligation produces
the lowest rate of recurrence.
[1442682] [16252313]
[16235372]
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
7/26
8. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
This Work
Each question is formulated as an independent clustering task.
Input
Output
QUESTION:
Which treatments work
best for hemorrhoids?
DOCUMENTS:
[11289288] [12972967]
[1442682] [15486746]
[16235372] [16252313]
[17054255] [17380367]
clustering
=⇒
MOO for Medical Clustering
1. [11289288] [12972967]
[15486746]
2. [17054255] [17380367]
3. [1442682] [16252313]
[16235372]
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
8/26
9. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Related Work
Uses of Document Clustering
Clustering in EBM
Web search
Cluster search results
Topic detection and
tracking
Cluster based on
interventions
Training data expansion
Shash & Molla (2013):
k-means clustering on our
data set
Multi-document
summarisation
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
9/26
10. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Contents
Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
10/26
11. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Clustering and Multi-Objective Optimization
Most existing clustering techniques are based on a single
criterion of goodness.
Several criteria of goodness have been proposed.
So why not try several criteria at once?
Internal Validity
External Validity
BIC-index
CH-index
Minkowski scores
Silhouette-index
F-measures
DB-index
...
...
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
11/26
12. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Information in Internal Validity Indices
Compactness
Measures the distance among the various elements of the
cluster.
We want clusters with short distances between its elements.
Separability
Measures the distance between clusters.
We want relatively large distances between clusters.
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
12/26
13. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
I -Index (Maulik & Bandyopadhyay, 2002)
I (K ) = (
K
EK
DK
cj
xk
j
nk
E1
EK
=
=
=
=
=
=
1
E1
×
× DK )p
K
EK
number of clusters
nk
K
k
k=1
j=1 de (c k , x j )
K
maxi,j=1 de (c i , c j )
centroid of the jth cluster
jth point of the kth cluster
total number of points present in the kth cluster
increases I as the clusters become more compact.
DK increases I as the separation between clusters increase.
(p is a parameter set to 2 in this paper)
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
13/26
14. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
XB-Index (Xie & Beni, 1991)
XB(K ) =
K
cj
xk
j
n
[uij ]K ×n
=
=
=
=
=
K
i=1
n
2
j=1 uij
xj − ci
n(mini=k c i − c k
2
2)
number of clusters
centroid of the jth cluster
jth point of the kth cluster
total number of points present in the dataset
cluster membership matrix
The numerator quantifies the compactness of the clusters.
The denominator quantifies the separation between clusters.
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
14/26
15. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
MOO: The Pareto Optimal Front
f2(minimize)
2
4
1
5
3
f1(maximize)
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
15/26
16. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Contents
Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
16/26
17. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
String Representation
AMOSA-clus implements simulated annealing (SA).
Centroid-based real-encoding:
Each member of the archive is encoded as a string that
represents the centroids of the partitions.
Each centroid is indivisible.
Given a fixed maximum number of clusters Kmax , the initial
number of centroids and their centroids are determined
randomly.
< 12.3 1.4 22.1 0.01 0.0 15.3 10.2 7.5 >
Represents four cluster centroids:
(12.3, 1.4), (22.1, 0.01), (0.0, 15.3), (10.2, 7.5)
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
17/26
18. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Assignment of Points to the Clusters
Assignment of points and update of cluster centroids resembles an
iteration of the K -means clustering algorithm.
1. A point j is assigned to the cluster k whose centroid has the
minimum distance to j:
k = argmini=1,...K d(x j , c i )
(1)
2. After all points are assigned to a cluster, the cluster centroids
are updated:
ci
=
MOO for Medical Clustering
ni
i
j=1 (x j )
ni
, 1≤i ≤K
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
(2)
18/26
19. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Search Operators
Mutation 1 Perturb the centroids of a random cluster using a
Laplacian distribution:
p( ) ∝ e −
| −µ|
δ
Mutation 2 Delete a random cluster centroid.
Mutation 3 Add a new cluster centroid.
< 3.5 1.5 2.1 4.9 1.6 1.2 >
1. If we choose centroid 2, then update centroid (2.1, 4.9). The
new string is: < 3.5 1.5 1.2 3.6 1.6 1.2 >
2. If we choose centroid 3, the new string will be:
< 3.5 1.5 2.1 4.9 >.
3. New string: < 3.5 1.5
2.1 4.9
MOO for Medical Clustering
1.6 1.2
9.7 2.5 >
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
19/26
20. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Selecting a Solution
The algorithm produces a set of alternative solutions.
Each solution is optimal according to some criteria.
Unsupervised Setting
Semi-supervised Setting
Choose one solution randomly.
f2(minimize)
2
Select the solution with
best entropy in known
assignments.
4
1
Each question has a
portion of known
clustering assignments.
5
3
f1(maximize)
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
20/26
21. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Contents
Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
21/26
22. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Data
Clinical Inquiries from the Journal of Family Practice.
276 clinical questions (276 clustering tasks).
Each question has an average of 5.89 documents.
Which treatments work best for hemorrhoids?
1. Excision is the most effective treatment for thrombosed external
hemorrhoids. [11289288] [12972967] [15486746]
2. For prolapsed internal hemorrhoids, the best definitive treatment is
traditional hemorrhoidectomy. [17054255] [17380367]
3. Of nonoperative techniques, rubber band ligation produces the
lowest rate of recurrence. [1442682] [16252313] [16235372]
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
22/26
23. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Results
Distance
Measure
AMOSA-clus1
AMOSA-clus2
best
average
best
average
K-means
(baseline)
Euclidean
Cosine
0.190
0.187
0.249
0.231
0.177
0.177
0.235
0.230
0.240
0.237
Unsupervised: Average solution is slightly better than baseline
(differences statistically significant).
Semi-supervised: Best solution is clearly better than baseline
(differences statistically significant).
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
23/26
24. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Finding the Number of Clusters
Distance
Measure
AMOSA-clus1
AMOSA-clus2
best
average
best
average
K-means
(baseline)
Euclidean
Cosine
0.190
0.187
0.249
0.231
0.177
0.177
0.235
0.230
0.240
0.237
AMOSA-clus1: Number of clusters as given by the original data.
Average 2.38 clusters.
AMOSA-clus2: Try several numbers of clusters and select the
solution that optimises I -index and XB-index.
Euclidean distance: Average 2.34 clusters.
Cosine distance: Average 2.51 clusters.
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
24/26
25. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Finding the Number of Clusters
error =
− predictedi )2
# of questions
i (targeti
Method
Error
AMOSA-clus2 Cosine
AMOSA-clus2 Euclidean
k=1
k=2
k=3
k=4
Rule of Thumb
Cover
1.90
1.91
3.91
2.14
2.38
4.61
2.56
1.98
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
25/26
26. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Conclusions
Conclusions
Unsupervised setting: slight improvement over k-means baseline.
Semi-supervised setting: clear improvement over k-means baseline.
Number of clusters: better than standard methods.
Further Work
Test on other domains.
Test using other cluster validity indices.
Compare with other semi-supervised methods.
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
26/26
27. Clustering for Evidence Based Medicine
Clustering as a MOO Problem
AMOSA-clus
Results
Conclusions
Conclusions
Unsupervised setting: slight improvement over k-means baseline.
Semi-supervised setting: clear improvement over k-means baseline.
Number of clusters: better than standard methods.
Further Work
Test on other domains.
Test using other cluster validity indices.
Compare with other semi-supervised methods.
Questions?
MOO for Medical Clustering
Asif Ekbal, Sriparna Saha, Diego Moll´, K Ravikumar
a
26/26