Weitere ähnliche Inhalte
Ähnlich wie A frame work for clustering time evolving data
Ähnlich wie A frame work for clustering time evolving data (20)
A frame work for clustering time evolving data
- 1. INTERNATIONALComputer Volume OF COMPUTER ENGINEERING –
International Journal of
JOURNAL 3, Issueand Technology (IJCET), ISSN 0976
6367(Print), ISSN 0976 – 6375(Online)
Engineering
3, October-December (2012), © IAEME
& TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 3, Issue 3, October - December (2012), pp. 377-383
IJCET
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2012): 3.9580 (Calculated by GISI) ©IAEME
www.jifactor.com
A FRAME WORK FOR CLUSTERING TIME EVOLVING DATA USING
SLIDING WINDOW TECHNIQUE
Y. Swapna1, S. Ravi Sankar2
1
(Faculty, CSE Department, National Institute of Technology, Goa, India, spr@nitgoa.ac.in)
2
(Faculty, CSE Department, National Institute of Technology, Goa, India, srs@nitgoa.ac.in)
ABSTRACT
Clustering is defined as the process of dividing a dataset into mutually exclusive groups
such that the members of each group are as "close" as possible to one another and different
groups are as "far" as possible from one another. Sampling is defined as representing large data
sets into smaller random samples of data. It is used to improve the efficiency of clustering.
Though sampling is applied, the points that are not sampled will not have their labels after the
normal process. The problem has been solved for numerical domain, where as clustering of time-
evolving data in the categorical domain still remains a challenging issue. In this paper, Sliding
Window is used to form subset of data from dataset using specified size (i.e.) collection of data
from the database and transfer to the module. The drifting concept detection has been proposed
which introduces new algorithm that finds the number of outliers that cannot be assigned to any
of the cluster. The objective of this algorithm is to compare the distribution of clusters and
outliers between the last clustering result and the current temporal clustering result. The
experimental evaluation shows that performing DCD is faster than doing clustering once on the
entire data set and DCD can provide high-quality clustering results with correctly detected
drifting concepts.
Keywords: clustering, sampling, categorical domain, labels, sliding window, drifting concept
detection.
I. INTRODUCTION
Our present information age society thrives and evolves on knowledge. Knowledge is
derived from information gleaned from a wide variety of reservoirs of data (databases).
Clustering is an important technique for exploratory data analysis and has been the focus of
substantial research in several domains for decades. Clusters are connected regions of a multi-
dimensional space containing of a relatively high density of points, separated from other such
377
- 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –
6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME
regions by a region containing a low density of points. It is useful for classification and can
reveal the structure in high-dimensional data spaces, outliers may be interesting, statistical
pattern recognition, machine learning, and information retrieval because of its use in a wide
range of applications. Cluster analysis is the assignment of a set of observations into subsets
(called clusters) so that observations in the same cluster are similar in some sense. It helps us to
gain insight into the data distribution. In real world domain, the concept of interest may depend
on some hidden context, not given plainly in the form of predictive features, which has become a
problem as these concepts drift with time. A suitable example would be buying preferences of
customers which may change with time, depending on their needs, climatic conditions, discounts
etc. Since the concepts behind the data evolve with time, the underlying clusters may also change
significantly with time. The concept not only decreases the quality of clusters but also disregards
the expectations of users, which usually require recent clustering results. Many works have been
explored based on the problem of clustering time-evolving data in the numerical domain.
Categorical attributes also prevalently exist in real data with drifting concepts, for
example Web logs that record the browsing history of users, stock market details, buying records
of customers often evolve with time. Previous works on clustering categorical data focus on
doing clustering on the entire data set and drifting concepts were not taken consideration.
Consequently, the problem of clustering time evolving data in the categorical domain remains a
challenging issue. The objective is to propose a framework for performing clustering on the
categorical time-evolving data. The goal is to use a generalized clustering framework that utilizes
existing clustering algorithms that detects if there is a drifting concept or not in the incoming
data, instead of designing a specific clustering algorithm. Sliding window technique is adopted to
detect the drifting concepts.
II. RELATED WORK
Many different numerical clustering algorithms have been proposed that consider the time-
evolving data and traditional categorical clustering algorithms [1]. An effective and efficient method,
called, clustream for clustering large evolving data streams was proposed by [5]. This method tries to
cluster the whole stream at one time rather than viewing the stream as a changing process over time. A
density-based method called DenStream was proposed in [2] for discovering clusters in an evolving data
stream.
Evolutionary clustering algorithms were proposed in [5] and [3]. They adopted the same method
that performs data clustering over time and tries to optimize two potentially conflicting criteria: first, the
previous and the present cluster must be similar without drifting concept, and second, clustering should
reflect the data arrived at that time step with the drifting concept. In [6], a generic frame work for this
problem used k-means and agglomerative hierarchical clustering algorithms that were extended according
to the problem domain. In [5], a measure of temporal smoothness is integrated in the overall measure of
clustering quality. Due to this, the proposed method uses stable and consistent clustering results that are
less sensitive to short-term noises while at the same time are adaptive to long-term cluster drifts. The
previously proposed methods have concentrated on the problem of clustering time evolving data in the
numerical domain. In [4], problem of clustering categorical data is discussed, which performs clustering
on customer transaction data in a market database.
378
- 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –
6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME
In [6], [4], a framework to perform clustering on the categorical time-evolving data has been
proposed. Especially the rough membership function in rough set theory represents a concept that induces
a fuzzy set. Several extension works based on k-modes are presented for different objectives, fuzzy k-
modes [6], initial points refinement [2], etc. These categorical algorithms focus on performing clustering
on the entire data set and do not consider time-evolving trends.
III. THE PROPOSED APPROACH
We propose a generalized clustering framework that utilizes existing clustering
algorithms and detects if there is a drifting concept or not in the incoming data. In order to detect
the drifting concepts at different sliding windows, we propose the algorithm DCD to compare the
cluster distributions between the last clustering result and the temporal current clustering result.
It is a collection of data which is extracted from the database that we are going to cluster
and the data from the database which is time evolving categorical data (It is not sequential basis
manner). We used a synthetic data generator [5] to generate data sets with different number of
data points and attributes. The number of data points varies from 10,000 to 100,000, and the
dimensionality is in the range of 10-50. In all synthetic data sets, each dimension possesses 20
attribute values.
Sliding Window is used to form subset of data from dataset using specified size (i.e.)
collection of data from the database and transfer to the module. In this paper, a practical
categorical clustering representative, named “Node Importance Representative” (abbreviated as
NIR), is utilized. It represents clusters by measuring the importance of each attribute value in the
clusters. Drifting Concept Detection (DCD) algorithm (fig.2) is used to detect the difference of
cluster distribution between the current data subset and the last clustering result. In order to
perform proper evaluation, we label the clusters and those that do not belong to any cluster are
called an outlier. The result is set to perform reclustering if the difference between the clusters is
large enough. Two clusters are said to be similar (resemblance), if they satisfy the condition
between point pj and cluster ck i.e. 1< k<l obtains maximum of the cluster point. The
resemblance for a given data point p j and an NIR table of clusters ck, is defined by the following
equation:
R ( , ܿ ) = ∑ ݓሺܿ , ܫ ሻ
ୀଵ (1)
Where ܫ is one entry in the NIR table of clusters ܿ . As shown in the equation (1), resemblance
can be directly obtained by summing up the nodes’ importance in the NIR table of clustersܿ .
Resemblance will be larger if data point contains nodes that are more important in one cluster
than in another cluster and is considered to obtain maximal resemblance. If resemblance values
between each cluster are small, then it will be treated as an outlier. Therefore, a threshold ߣ in
each cluster is set to identify outliers. The decision function is defined as follows:
Label = { ܿ, כ if max R ( , ܿ ሻ ≥ ߣ where 1 ≤ i ≤ l;
outliers; otherwise.
As shown in fig.1, the data points in the second sliding window are going to perform data
labeling and thresholds are λ1 = λ2 = 0.5. The first data point p6 = (B, E, F) in S2 is decomposed
379
- 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –
6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME
into 3 nodes, i.e., {[A1=B]}, {[A2=E]}, {[A3=F]}. The resemblance of in ܿଵ is zero, and in ܿଶ
ଵ ଵ
it is also zero, since the maximal resemblance is not larger than the threshold, hence the data
point is considered as an outlier. The resemblance of in ܿଵ is 0.037 and in ܿଶ it is
ଵ ଵ
1.537(0.5 +0.037 +1). Then the maximal resemblance value is R (ܿ , ଶ ) and the resemblance
ଵ
value is larger than the threshold λ2 = 0.5, therefore is labeled clusterܿଶ .
ଵ
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
A1 C I C S C B I B S B
A2 W W W W W E T E I O
A3 D M N M D F H G H G
S1 S2
p11 p12 p13 p14 p15
S I Z I S
W W P W W
P P T P P
S3
ܿଵ
ଵ
C C C I ܿଶ
Sଵ
W W W W W
D N D M M
ܿଶ
′ଵ
ܿଶ
′ଶ outliers
I S B B B
T T E E O
H H F G G
Fig. 1: The temporal clustering result ′ that is obtained by data labeling.
Algorithm Used:
Let temp= ܥሾ௧ ,௧ିଵሿ
DriftingConceptDetecting (temp, ܵ ௧ )
outliers out = 0
while there is next tuple in ܵ ௧ do
read in data point from St
divide into nodes ܫଵ to ܫ
for all clusters tempi in ݉݁ݐdo
calculate Resemblance R(pj, tempi)
end for
find Maximal Resemblance tempm
if R( , tempm ) ≥ ߣ then
is assign to ܿ else
′௧
out = out + 1
380
- 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –
6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME
end if
end while
Outlier = out;
{Do data labeling on current sliding window }
Numdiffclusters = 0
For all clusters tempi in temp do
ሾ ,షభሿ
If ቤ ೖ ሾ ,షభሿ ሾ ,షభሿ െ ሾ ,షభሿ ቤ then
∑ೣసభ ೣ ∑ೖ
ೣసభ ೣ
Numdiffclusters = numdiffclusters + 1
end if
end for
௨௧ ௨ௗ௨௦௧௦
if ே > θ or ߟ then
ሾ ,షభሿ
{Concept Drifts}
dump out temp
call initial clustering on St
else
{Concept not drifts}
add ′ ܥ௧ into temp
update NIR as ܥሾ௧ ,௧ሿ
end if
Since we measure the similarity between the data point and the cluster ܿ as R ( , ܿ ሻ,
the cluster with the maximal resemblance is the most appropriate cluster for that data point. If the
maximal resemblance (the most appropriate cluster) is smaller than the threshold ߣ in that
cluster, the data point is seen as an outlier. In order to observe the relationship between different
clustering results, cluster relationship analysis is used to analyze and show the changes between
different clustering results. It measures the similarity of clusters between the clustering results at
different time stamps and links the similar clusters.
Cluster Cluster ܿଶ
ଶ
Cluster ܿଵ
ଵ ܿଵ 0.012
ଶ
0.182
Cluster ܿଶ
ଵ 0.567 0
Cluster Cluster ܿଶ
ଷ
Cluster ܿଵ
ଶ ܿଵ 1
ଷ
0
Cluster ܿଶ
ଶ 0 0
Fig. 2: The similarity table between clustering results
ഥ തതത
The cosine measure CM ( ܿଶ , ܿଵ ). = (1.537/1.225)* 1.578 = 0.567, which is larger than CM
ଵ ଶ
തതതത തതത
(ܿ ଵ , ܿଵ ). Therefore cluster ܿଶ is said to be more similar to ܿଵ than to clusterܿଵ .
ଶ ଵ ଶ ଵ
ଵ
381
- 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –
6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME
Table 1: Symbols used in Algorithm
Aa The a-th attribute in the data set.
C[t1 , t2] The clustering result from t1 to t2.
Ct The clustering result on sliding window t.
C1t The temporal clustering result on sliding window t.
Cj The j-th cluster in C.
ܿప
ഥ The node importance vector of ܿ .
ܫ The r-th node in ܿ .
|ܫ | The number of occurrence of ܫ .
K The number of clusters in C.
݉ The number of data points in ܿ .
N The size of sliding window.
ܵ௧ The sliding window t.
T The timestamp index of sliding window.
ݓሺܿ , ܫ ሻ The importance of ܫ in ܿ .
Θ The outlier threshold.
Ε The cluster variation threshold.
Η The cluster difference threshold.
CM(ܿ , ܿ ) The cosine measure between cluster vectors ܿప and ܿఫ
ഥ ഥ.
IV RESULTS:
The following table shows the results in terms of precision and recall of DCD are
efficient on detecting drifting concepts.
N=1000
Settings drifting precision Recall
D1 35.6 0.557 0.873
D2 39.2 0.825 0.992
D3 46 0.816 0.98
D4 44.5 0.443 0.97
Fig. 3: The precision and recall of the DCD
We change clustering pairs to obtain the data sets with drifting concepts and then test the
detecting accuracy of algorithm DCD by those data sets. The outlier threshold θ is set to 0.1, and
the cluster variation threshold ε is set to 0.1, and also, the cluster difference threshold η is set to
0.5. The number of clusters k, which is the required parameter on the initial clustering step and
reclustering step, is set to the maximum number of clusters in each setting, e.g., k = 10 in D1 and
k = 20 in D3. In addition, each synthetic data set is generated by randomly combining 50
clustering results on that data set setting, and the precision and recall shown in fig.3 are the
averages of 20 experiments. The precision and recall are more than 80 percent when the size of
the sliding window is larger than 2,000. It is a little low when the size of the sliding window is
set to 1,000 because the drifting concepts often cross two windows, we only count one as a
382
- 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –
6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME
correct hit, and the other window is considered as a miss. However, the detecting recall is the
highest one when the size of sliding window is set to 1,000. The drifting concepts will probably
not be omitted in the sliding window when the data set is separated in detail. If we choose two
examples of bank datasets that are synthesized by settings D1 and D2 and evaluate clustering
results on each sliding window, it generates a new clustering results when the drifting concept is
detected, it also response quickly to the trend of evolving dataset.
IV. CONCLUSION
In this paper we have proposed a framework to perform clustering on categorical time-
evolving data. In order to detect the drifting concepts at different sliding windows, we proposed
the algorithm DCD to compare the cluster distributions between the last clustering result and the
temporal current clustering result. If the results are quite different, the last clustering result will
be dumped out, and the current data in this sliding window will perform reclustering. In order to
observe the relationship between different clustering results, cluster relationship analysis is used
to analyze and show the changes between different clustering results. The experimental
evaluation shows that performing DCD is faster than doing clustering once on the entire data set
and DCD can provide high-quality clustering results with correctly detected drifting concepts.
Therefore, the result demonstrates that our framework is practical for detecting drifting concepts
in time-evolving categorical data.
V.REFERENCES
[1] D. Barbara, Y. Li, and J. Couto, Coolcat: An Entropy-Based Algorithm for Categorical
Clustering, Proc. ACM Int’l Conf. Information and Knowledge Management (CIKM), 2002.
[2] F. Cao, M. Ester, W. Qian, and A. Zhou, Density-Based Clustering over an Evolving Data
Stream with Noise, Proc. Sixth SIAM Int’l Conf. Data Mining (SDM), 2006.
[3] H.-L. Chen, K.-T. Chuang, and M.-S. Chen, Labeling Unclustered Categorical Data into
Clusters Based on the Important Attribute Values, Proc. Fifth IEEE Int’l Conf. Data Mining
(ICDM), 2005.
[4] O. Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain, A Web Usage Mining
Framework for Mining Evolving User Profiles in Dynamic Web Sites, IEEE Trans. Knowledge
and Data Eng., vol. 20, no. 2, pp. 202-215, Feb. 2008.
[5] Hung-Leng Chen, Ming-Syan Chen, and Su-Chen Lin, Catching the Trend: A Framework
for Clustering Concept-Drifting Categorical Data, IEEE Trans. Knowledge and Data Eng., vol.
21, no. 5, May 2009.
[6] Z. Huang and M.K. Ng, A Fuzzy k-Modes Algorithm for Clustering Categorical Data, IEEE
Trans. Fuzzy Systems, 1999.
383