GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
A Framework for Online Clustering Based on Evolving Semi-supervision
1. A Framework for Online Clustering
Based on Evolving Semi-Supervision
Guilherme Alves, Maria Camila N. Barioni and Elaine Faria
Universidade Federal de Uberlândia
2. Outline
■ Introduction
■ The CABESS Framework
■ Pointwise CABESS
■ Experiments and Results
■ Conclusion
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 2
3. Introduction
Motivation
■ The desired organization for the data may change over time.
Semi-supervised approaches may be useful for guiding clustering
algorithms in the adaptation process.
■ Additional information (semi-supervision) may change over time.
■ It may causes clustering transitions: birth, split or merge.
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 3
4. Introduction
Objective and Research Questions
Goal: to provide a framework that is able to use and maintain
semi-supervision correctly to enable efficient and effective online
clustering processes.
Q1. How accurately does semi-supervision aid on clustering
effectiveness when there are external clustering transitions over time?
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 4
5. Introduction
Objective and Research Questions
Goal: to provide a framework that is able to use and maintain
semi-supervision correctly to enable efficient and effective online
clustering processes.
Q1. How accurately does semi-supervision aid on clustering
effectiveness when there are external clustering transitions over time?
Q2. Are there major differences between clustering effectiveness when
using semi-supervised clustering approaches based on feedback and
labels?
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 5
6. Introduction
Objective and Research Questions
Goal: to provide a framework that is able to use and maintain
semi-supervision correctly to enable efficient and effective online
clustering processes.
Q1. How accurately does semi-supervision aid on clustering
effectiveness when there are external clustering transitions over time?
Q2. Are there major differences between clustering effectiveness when
using semi-supervised clustering approaches based on feedback and
labels?
Q3. How the feedback window size variation affects semi-supervision
information and clustering effectiveness?
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 6
7. Introduction
Objective and Research Questions
Goal: to provide a framework that is able to use and maintain
semi-supervision correctly to enable efficient and effective online
clustering processes.
Q1. How accurately does semi-supervision aid on clustering
effectiveness when there are external clustering transitions over time?
Q2. Are there major differences between clustering effectiveness when
using semi-supervised clustering approaches based on feedback and
labels?
Q3. How the feedback window size variation affects semi-supervision
information and clustering effectiveness?
Q4. How efficient is our approach compared to existing semi-supervised
approaches?
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 7
16. Introduction
Main Contributions
■ The introduction of CABESS (Cluster Adaptation Based on
Evolving Semi-Supervision)
■ A framework for online clustering using semi-supervision in the form
of feedback.
■ A strategy that extracts semi-supervision information from
feedback given in the form of labels.
■ An approach to keep the labels consistent over time.
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 16
17. The CABESS Framework
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 17
AF Summarization
1
Semi-supervised
Clustering 4
Clustering
2
ℱ
Semi-Supervision
Deduction 3
B
Transition
detection 5
𝒟
CABESS
𝑡 ℛ 𝑡
, 𝒮 𝑡
ℱ 𝑡
𝒟 𝑡
T
Partition
Set Storage
𝑡 𝑡−
𝑡
F
T
𝑡−
𝒮 𝑡
A
B
Verify if new instances have been generated or the user has been satisfied with the cluster quality.
Verify if it is the first clustering performed.
18. The Framework CABESS
Pointwise CABESS
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 18
AF
BIRCH
[Zhang et al.1996]
1
SSDBScan
[Lelis and Sander
2009] 4
DBScan
[Ester et al.1996]
2
ℱ
Semi-Supervision
Deduction 3
B
MONIC
[Spiliopoulou et al.
2006] 5
𝒟
CABESS
𝑡 ℛ 𝑡
, 𝒮 𝑡
ℱ 𝑡
𝒟 𝑡
T
Partition
Set Storage
𝑡 𝑡−
𝑡
F
T
𝑡−
𝒮 𝑡
20. The CABESS Framework » Pointwise CABESS
Extracting labels from feedbacks
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 20
1. Feedbacks
𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛
Instance-level labels
■ A neighborhood 𝑁 is defined as a set of instances that must be in the
same cluster.
✓
✓
✓
✓
✓
✓
✓
(a) (b)
21. The CABESS Framework » Pointwise CABESS
Extracting labels from feedbacks
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 21
1. Feedbacks
𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛
Instance-level labels
■ A neighborhood 𝑁 is defined as a set of instances that must be in the
same cluster.
■ Same previous cluster AND received positive feedback → Same label
■ Received displacement feedback → assign to label of the neighborhood
of the destination rule.
✓
✓
✓
✓
✓
✓
✓
1
1
2
1
2
2
2
2
2
(a) (b) (c)
22. The CABESS Framework » Pointwise CABESS
Extracting labels from feedbacks
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 22
2. Instance-level labels
𝑑𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛
Summarized-level labels
■ Performed as a propagation task
■ If one of the summarized instances has labeled instances with
different labels
■ Split it to obtain purified summarized instances
■ summarized instances that contain only the same label in labeled
instances.
1
1
2
1
2
2
2
2
2
1
1
2
1
2
2
2
2
2
1
1
2
2
2
2
(c) (d)
23. The CABESS Framework » Pointwise CABESS
Dealing with obsolete labels
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 23
■ Obsolete labels: labels assigned to instances, for which the
clusters do not exist anymore.
■ Minimizing the problem: adoption of a detector of transitions.
■ Better neighborhood management.
■ when a cluster survives both neighborhood and associated labels are
preserved.
𝑠𝑝𝑙𝑖𝑡
does not exist at 𝒕 → label 2 becomes obsolete
𝑡 𝑡
24. Experiments
Datasets
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 24
Name # instances d # classes Reference Type
DB7 9,050 2 8 [Silva et al. 2015]
SyntheticSYN3 5,000 2 3 streamMOA
SYN4 10,000 3 5 streamMOA
FROGS 1,484 8 4 [Colonna et al. 2016]
RealIPEA 5,564 5 27 IPEA
KDD’995 24,692 19 11 UCI
25. Experiments
Comparison Methods
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 25
■ Unsupervised (DBScan)
■ Consists in periodically executing a clustering algorithm without any
semi-supervision.
■ Semi-supervised (SSDBScan)
■ Static Approach.
■ Consists in periodically applying a semi-supervised clustering algorithm.
■ This approach does not discard any label over time.
■ Window-based Approach.
■ It is a variation of the previous approach.
■ Instead of executing the clustering algorithm over all the label set, we
remove old labels.
26. Experiments
Protocol and Evaluation
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 26
■ Adjusted Rand Index (ARI)
■ Prequential Protocol
■ For each timestamp 𝑡 only one label is considered valid for a
data instance according to the grouping tree.
■ Online arrival of instances and feedback.
■ The arrival of the data instances and the user feedback are given
according to the uniform distribution.
27. Experimental Results
Semi-supervised vs. Unsupervised
Q1. How accurately does semi-supervision aid on clustering
effectiveness when there are external clustering transitions over time?
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 27
28. Experimental Results
Semi-supervised vs. Unsupervised
Q1. How accurately does semi-supervision aid on clustering
effectiveness when there are external clustering transitions over time?
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 28
29. Experimental Results
Feedbacks vs. Labels
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 29
Q2. Are there major differences between clustering effectiveness when
using semi-supervised clustering approaches based on feedback and labels?
30. Experimental Results
Feedbacks vs. Labels
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 30
Q2. Are there major differences between clustering effectiveness when
using semi-supervised clustering approaches based on feedback and labels?
31. Experimental Results
Feedback Window Size Variation
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 31
Q3. How the feedback window size variation affects semi-supervision
information and clustering effectiveness?
33. Conclusion
■ Higher efficiency when compared to other semi-supervised
approaches
■ While keeping an equivalent effectiveness.
■ Future works:
■ Exploring other types of semi-supervision information such as
instance-level constraints.
■ Tackling other strategies for detecting transitions.
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 33
34. References
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for
discovering clusters in large spatial databases with noise. In KDD, pages 226–231.
AAAI Press
Colonna, J. G., Gama, J., and Nakamura, E. F. (2016). Recognizing Family, Genus, and
Species of Anuran Using a Hierarchical Classification Approach. pages 198–212.
Springer, Cham.
Lai, H. P., Visani, M., Boucher, A., and Ogier, J.-M. (2014). A new interactive semi-
supervised clustering model for large image database indexing. Pattern
Recognition Letters, 37(1):94–106.
Lelis, L. and Sander, J. (2009). Semi-supervised Density-Based Clustering. In IEEE
ICDM, pages 842–847.
Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., and Schult, R. (2006). MONIC: Modeling
and Monitoring Cluster Transitions. In ACM SIGKDD, page 706, New York, NY, USA.
ACM Press.
Zhang, T., Ramakrishnan, R., and Livny, M. (1996). BIRCH: An Efficient Data
Clustering Method for very Large Databases. ACM SIGMOD Record, 25(2):103–114.
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 34
35. A Framework for Online
Clustering Based on
Evolving Semi-Supervision
Guilherme Alves guilhermealves@ufu.br
Maria Camila N. Barioni camila.barioni@ufu.br
Elaine Faria elaine@ufu.br
6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 35
Acknowledgments