A Framework for Online Clustering Based on Evolving Semi-supervision

A Framework for Online Clustering
Based on Evolving Semi-Supervision
Guilherme Alves, Maria Camila N. Barioni and Elaine Faria
Universidade Federal de Uberlândia

Outline
■ Introduction
■ The CABESS Framework
■ Pointwise CABESS
■ Experiments and Results
■ Conclusion
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 2

Introduction
Motivation
■ The desired organization for the data may change over time.
Semi-supervised approaches may be useful for guiding clustering
algorithms in the adaptation process.
■ Additional information (semi-supervision) may change over time.
■ It may causes clustering transitions: birth, split or merge.

Introduction
Objective and Research Questions
Goal: to provide a framework that is able to use and maintain
semi-supervision correctly to enable efficient and effective online
clustering processes.
Q1. How accurately does semi-supervision aid on clustering
effectiveness when there are external clustering transitions over time?

Introduction
Q2. Are there major differences between clustering effectiveness when
using semi-supervised clustering approaches based on feedback and
labels?

Introduction
labels?
Q3. How the feedback window size variation affects semi-supervision
information and clustering effectiveness?

Introduction
labels?
Q4. How efficient is our approach compared to existing semi-supervised
approaches?

Introduction
Example
𝑡0
1 1
1 1 2 2
2
1 2
1
1
1
1
1 1
1
2 2
2
2
2
1
1
1
1
1
III.Groundtruth
Time

Introduction
Example
𝑡0 𝑡
1 1
1 1 2 2
2
1 2
1
1
1
1
1 1
1
2 2
2
2
2
1
1
1
1
1
3
3 3 2 2
2
3 2
4
4
4
4
4
2 2
2
2
2
3
33
4 4
4
4
2 2
4 4
III.Groundtruth
Time

Introduction
Example
𝑡0 𝑡 𝑡
1 1
1 1 2 2
2
1 2
1
1
1
1
1 1
1
2 2
2
2
2
1
1
1
1
1
3
3 3 2 2
2
3 2
4
4
4
4
4
2 2
2
2
2
3
33
4 4
4
4
2 2
4 4
3
3 2 2
2
3 2
4
4
4
4 4
4
2
2
4
3
4
4
2 2
2
3
4 4
3 2
2233
3
2
2
3
III.Groundtruth
Time

Introduction
Example
𝑡0 𝑡 𝑡 𝑡
1 1
1 1 2 2
2
1 2
1
1
1
1
1 1
1
2 2
2
2
2
1
1
1
1
1
3
3 3 2 2
2
3 2
4
4
4
4
4
2 2
2
2
2
3
33
4 4
4
4
2 2
4 4
3
3 2 2
2
3 2
4
4
4
4 4
4
2
2
4
3
4
4
2 2
2
3
4 4
3 2
2233
3
2
2
3 3
3 2 2
2
3 2
4
4
4 4
4
2
2
2
4
3
4
4
4
2 23
4 4
3
2 2
4
22
2233
3
3
III.Groundtruth
Time

Introduction
Example
𝑡0
1 2
1
1
2 2
2
1
1 1
1 1 2 2
2
1 2
1
1
1
1
1 1
1
2 2
2
2
2
1
1
1
1
1
II.PartitionSetIII.Groundtruth
Time

Introduction
Example
𝑡0 𝑡
1 2
1
1
2 2
2
1
1 2
1
2 2
23
3
41
1 1
1 1 2 2
2
1 2
1
1
1
1
1 1
1
2 2
2
2
2
1
1
1
1
1
3
3 3 2 2
2
3 2
4
4
4
4
4
2 2
2
2
2
3
33
4 4
4
4
2 2
4 4
Time

Introduction
Example
𝑡0 𝑡 𝑡
DivisãoSplit: → { , }
1 2
1
1
2 2
2
1
1 2
1
2 2
23
3
41
2
4 4
2
4
3
1
2 2
23
3
2
1 1
1 1 2 2
2
1 2
1
1
1
1
1 1
1
2 2
2
2
2
1
1
1
1
1
3
3 3 2 2
2
3 2
4
4
4
4
4
2 2
2
2
2
3
33
4 4
4
4
2 2
4 4
3
3 2 2
2
3 2
4
4
4
4 4
4
2
2
4
3
4
4
2 2
2
3
4 4
3 2
2233
3
2
2
3
Time
I.Clustering
transition

Introduction
Example
𝑡0 𝑡 𝑡 𝑡
DivisãoSplit: → { , }
1 2
1
1
2 2
2
1
1 2
1
2 2
23
3
41
2
4 4
2
4
3
1
2 2
23
3
2
3 2
4
4
2
4
3
4
2 2
2
23
3
1 1
1 1 2 2
2
1 2
1
1
1
1
1 1
1
2 2
2
2
2
1
1
1
1
1
3
3 3 2 2
2
3 2
4
4
4
4
4
2 2
2
2
2
3
33
4 4
4
4
2 2
4 4
3
3 2 2
2
3 2
4
4
4
4 4
4
2
2
4
3
4
4
2 2
2
3
4 4
3 2
2233
3
2
2
3 3
3 2 2
2
3 2
4
4
4 4
4
2
2
2
4
3
4
4
4
2 23
4 4
3
2 2
4
22
2233
3
3
Time
I.Clustering
transition

Introduction
Main Contributions
■ The introduction of CABESS (Cluster Adaptation Based on
Evolving Semi-Supervision)
■ A framework for online clustering using semi-supervision in the form
of feedback.
■ A strategy that extracts semi-supervision information from
feedback given in the form of labels.
■ An approach to keep the labels consistent over time.

The CABESS Framework
AF Summarization
1
Semi-supervised
Clustering 4
Clustering
2
ℱ
Semi-Supervision
Deduction 3
B
Transition
detection 5
𝒟
CABESS
𝑡 ℛ 𝑡
, 𝒮 𝑡
ℱ 𝑡
𝒟 𝑡
T
Partition
Set Storage
𝑡 𝑡−
𝑡
F
T
𝑡−
𝒮 𝑡
A
B
Verify if new instances have been generated or the user has been satisfied with the cluster quality.
Verify if it is the first clustering performed.

The Framework CABESS
Pointwise CABESS
AF
BIRCH
[Zhang et al.1996]
1
SSDBScan
[Lelis and Sander
2009] 4
DBScan
[Ester et al.1996]
2
ℱ
Semi-Supervision
Deduction 3
B
MONIC
[Spiliopoulou et al.
2006] 5
𝒟
CABESS
𝑡 ℛ 𝑡
, 𝒮 𝑡
ℱ 𝑡
𝒟 𝑡
T
Partition
Set Storage
𝑡 𝑡−
𝑡
F
T
𝑡−
𝒮 𝑡

The CABESS Framework » Pointwise CABESS
Extracting labels from feedbacks
1. Feedbacks
𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛
Instance-level labels
✓
✓
✓
✓
✓
✓
✓
(a)

1. Feedbacks
■ A neighborhood 𝑁 is defined as a set of instances that must be in the
same cluster.
✓
✓
✓
✓
✓
✓
✓
(a) (b)

1. Feedbacks
■ A neighborhood 𝑁 is defined as a set of instances that must be in the
same cluster.
■ Same previous cluster AND received positive feedback → Same label
■ Received displacement feedback → assign to label of the neighborhood
of the destination rule.
✓
✓
✓
✓
✓
✓
✓
1
1
2
1
2
2
2
2
2
(a) (b) (c)

2. Instance-level labels
𝑑𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛
Summarized-level labels
■ Performed as a propagation task
■ If one of the summarized instances has labeled instances with
different labels
■ Split it to obtain purified summarized instances
■ summarized instances that contain only the same label in labeled
instances.
1
1
2
1
2
2
2
2
2
1
1
2
1
2
2
2
2
2
1
1
2
2
2
2
(c) (d)

Dealing with obsolete labels
■ Obsolete labels: labels assigned to instances, for which the
clusters do not exist anymore.
■ Minimizing the problem: adoption of a detector of transitions.
■ Better neighborhood management.
■ when a cluster survives both neighborhood and associated labels are
preserved.
𝑠𝑝𝑙𝑖𝑡
does not exist at 𝒕 → label 2 becomes obsolete
𝑡 𝑡

Experiments
Datasets
Name # instances d # classes Reference Type
DB7 9,050 2 8 [Silva et al. 2015]
SyntheticSYN3 5,000 2 3 streamMOA
SYN4 10,000 3 5 streamMOA
FROGS 1,484 8 4 [Colonna et al. 2016]
RealIPEA 5,564 5 27 IPEA
KDD’995 24,692 19 11 UCI

Experiments
Comparison Methods
■ Unsupervised (DBScan)
■ Consists in periodically executing a clustering algorithm without any
semi-supervision.
■ Semi-supervised (SSDBScan)
■ Static Approach.
■ Consists in periodically applying a semi-supervised clustering algorithm.
■ This approach does not discard any label over time.
■ Window-based Approach.
■ It is a variation of the previous approach.
■ Instead of executing the clustering algorithm over all the label set, we
remove old labels.

Experiments
Protocol and Evaluation
■ Adjusted Rand Index (ARI)
■ Prequential Protocol
■ For each timestamp 𝑡 only one label is considered valid for a
data instance according to the grouping tree.
■ Online arrival of instances and feedback.
■ The arrival of the data instances and the user feedback are given
according to the uniform distribution.

Experimental Results
Semi-supervised vs. Unsupervised

Semi-supervised vs. Unsupervised

Feedbacks vs. Labels
using semi-supervised clustering approaches based on feedback and labels?

Feedback Window Size Variation

Efficiency
Q4. How efficient is our approach compared to existing semi-supervised
approaches?

Conclusion
■ Higher efficiency when compared to other semi-supervised
approaches
■ While keeping an equivalent effectiveness.
■ Future works:
■ Exploring other types of semi-supervision information such as
instance-level constraints.
■ Tackling other strategies for detecting transitions.

References
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for
discovering clusters in large spatial databases with noise. In KDD, pages 226–231.
AAAI Press
Colonna, J. G., Gama, J., and Nakamura, E. F. (2016). Recognizing Family, Genus, and
Species of Anuran Using a Hierarchical Classification Approach. pages 198–212.
Springer, Cham.
Lai, H. P., Visani, M., Boucher, A., and Ogier, J.-M. (2014). A new interactive semi-
supervised clustering model for large image database indexing. Pattern
Recognition Letters, 37(1):94–106.
Lelis, L. and Sander, J. (2009). Semi-supervised Density-Based Clustering. In IEEE
ICDM, pages 842–847.
Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., and Schult, R. (2006). MONIC: Modeling
and Monitoring Cluster Transitions. In ACM SIGKDD, page 706, New York, NY, USA.
ACM Press.
Zhang, T., Ramakrishnan, R., and Livny, M. (1996). BIRCH: An Efficient Data
Clustering Method for very Large Databases. ACM SIGMOD Record, 25(2):103–114.

A Framework for Online
Clustering Based on
Evolving Semi-Supervision
Guilherme Alves guilhermealves@ufu.br
Maria Camila N. Barioni camila.barioni@ufu.br
Elaine Faria elaine@ufu.br
Acknowledgments

A Framework for Online Clustering Based on Evolving Semi-supervision

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie A Framework for Online Clustering Based on Evolving Semi-supervision

Ähnlich wie A Framework for Online Clustering Based on Evolving Semi-supervision (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

A Framework for Online Clustering Based on Evolving Semi-supervision