SlideShare ist ein Scribd-Unternehmen logo
1 von 7
Downloaden Sie, um offline zu lesen
INTERNATIONALComputer Volume OF COMPUTER ENGINEERING –
 International Journal of
                              JOURNAL 3, Issueand Technology (IJCET), ISSN 0976
 6367(Print), ISSN 0976 – 6375(Online)
                                       Engineering
                                                   3, October-December (2012), © IAEME
                            & TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 3, Issue 3, October - December (2012), pp. 377-383
                                                                           IJCET
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2012): 3.9580 (Calculated by GISI)               ©IAEME
www.jifactor.com



  A FRAME WORK FOR CLUSTERING TIME EVOLVING DATA USING
               SLIDING WINDOW TECHNIQUE

                                Y. Swapna1, S. Ravi Sankar2
   1
     (Faculty, CSE Department, National Institute of Technology, Goa, India, spr@nitgoa.ac.in)
   2
     (Faculty, CSE Department, National Institute of Technology, Goa, India, srs@nitgoa.ac.in)


 ABSTRACT
         Clustering is defined as the process of dividing a dataset into mutually exclusive groups
 such that the members of each group are as "close" as possible to one another and different
 groups are as "far" as possible from one another. Sampling is defined as representing large data
 sets into smaller random samples of data. It is used to improve the efficiency of clustering.
 Though sampling is applied, the points that are not sampled will not have their labels after the
 normal process. The problem has been solved for numerical domain, where as clustering of time-
 evolving data in the categorical domain still remains a challenging issue. In this paper, Sliding
 Window is used to form subset of data from dataset using specified size (i.e.) collection of data
 from the database and transfer to the module. The drifting concept detection has been proposed
 which introduces new algorithm that finds the number of outliers that cannot be assigned to any
 of the cluster. The objective of this algorithm is to compare the distribution of clusters and
 outliers between the last clustering result and the current temporal clustering result. The
 experimental evaluation shows that performing DCD is faster than doing clustering once on the
 entire data set and DCD can provide high-quality clustering results with correctly detected
 drifting concepts.
 Keywords: clustering, sampling, categorical domain, labels, sliding window, drifting concept
 detection.

 I. INTRODUCTION
        Our present information age society thrives and evolves on knowledge. Knowledge is
 derived from information gleaned from a wide variety of reservoirs of data (databases).
 Clustering is an important technique for exploratory data analysis and has been the focus of
 substantial research in several domains for decades. Clusters are connected regions of a multi-
 dimensional space containing of a relatively high density of points, separated from other such

                                               377
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –
6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

regions by a region containing a low density of points. It is useful for classification and can
reveal the structure in high-dimensional data spaces, outliers may be interesting, statistical
pattern recognition, machine learning, and information retrieval because of its use in a wide
range of applications. Cluster analysis is the assignment of a set of observations into subsets
(called clusters) so that observations in the same cluster are similar in some sense. It helps us to
gain insight into the data distribution. In real world domain, the concept of interest may depend
on some hidden context, not given plainly in the form of predictive features, which has become a
problem as these concepts drift with time. A suitable example would be buying preferences of
customers which may change with time, depending on their needs, climatic conditions, discounts
etc. Since the concepts behind the data evolve with time, the underlying clusters may also change
significantly with time. The concept not only decreases the quality of clusters but also disregards
the expectations of users, which usually require recent clustering results. Many works have been
explored based on the problem of clustering time-evolving data in the numerical domain.

        Categorical attributes also prevalently exist in real data with drifting concepts, for
example Web logs that record the browsing history of users, stock market details, buying records
of customers often evolve with time. Previous works on clustering categorical data focus on
doing clustering on the entire data set and drifting concepts were not taken consideration.
Consequently, the problem of clustering time evolving data in the categorical domain remains a
challenging issue. The objective is to propose a framework for performing clustering on the
categorical time-evolving data. The goal is to use a generalized clustering framework that utilizes
existing clustering algorithms that detects if there is a drifting concept or not in the incoming
data, instead of designing a specific clustering algorithm. Sliding window technique is adopted to
detect the drifting concepts.

II. RELATED WORK

         Many different numerical clustering algorithms have been proposed that consider the time-
evolving data and traditional categorical clustering algorithms [1]. An effective and efficient method,
called, clustream for clustering large evolving data streams was proposed by [5]. This method tries to
cluster the whole stream at one time rather than viewing the stream as a changing process over time. A
density-based method called DenStream was proposed in [2] for discovering clusters in an evolving data
stream.

         Evolutionary clustering algorithms were proposed in [5] and [3]. They adopted the same method
that performs data clustering over time and tries to optimize two potentially conflicting criteria: first, the
previous and the present cluster must be similar without drifting concept, and second, clustering should
reflect the data arrived at that time step with the drifting concept. In [6], a generic frame work for this
problem used k-means and agglomerative hierarchical clustering algorithms that were extended according
to the problem domain. In [5], a measure of temporal smoothness is integrated in the overall measure of
clustering quality. Due to this, the proposed method uses stable and consistent clustering results that are
less sensitive to short-term noises while at the same time are adaptive to long-term cluster drifts. The
previously proposed methods have concentrated on the problem of clustering time evolving data in the
numerical domain. In [4], problem of clustering categorical data is discussed, which performs clustering
on customer transaction data in a market database.




                                                     378
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –
6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

         In [6], [4], a framework to perform clustering on the categorical time-evolving data has been
proposed. Especially the rough membership function in rough set theory represents a concept that induces
a fuzzy set. Several extension works based on k-modes are presented for different objectives, fuzzy k-
modes [6], initial points refinement [2], etc. These categorical algorithms focus on performing clustering
on the entire data set and do not consider time-evolving trends.

III. THE PROPOSED APPROACH

         We propose a generalized clustering framework that utilizes existing clustering
algorithms and detects if there is a drifting concept or not in the incoming data. In order to detect
the drifting concepts at different sliding windows, we propose the algorithm DCD to compare the
cluster distributions between the last clustering result and the temporal current clustering result.
      It is a collection of data which is extracted from the database that we are going to cluster
and the data from the database which is time evolving categorical data (It is not sequential basis
manner). We used a synthetic data generator [5] to generate data sets with different number of
data points and attributes. The number of data points varies from 10,000 to 100,000, and the
dimensionality is in the range of 10-50. In all synthetic data sets, each dimension possesses 20
attribute values.

        Sliding Window is used to form subset of data from dataset using specified size (i.e.)
collection of data from the database and transfer to the module. In this paper, a practical
categorical clustering representative, named “Node Importance Representative” (abbreviated as
NIR), is utilized. It represents clusters by measuring the importance of each attribute value in the
clusters. Drifting Concept Detection (DCD) algorithm (fig.2) is used to detect the difference of
cluster distribution between the current data subset and the last clustering result. In order to
perform proper evaluation, we label the clusters and those that do not belong to any cluster are
called an outlier. The result is set to perform reclustering if the difference between the clusters is
large enough. Two clusters are said to be similar (resemblance), if they satisfy the condition
between point pj and cluster ck i.e. 1< k<l obtains maximum of the cluster point. The
resemblance for a given data point p j and an NIR table of clusters ck, is defined by the following
equation:

                R ( ‫݌‬௝ , ܿ௞ ) = ∑௤ ‫ݓ‬ሺܿ௞ , ‫ܫ‬௞௥ ሻ
                                 ௥ୀଵ                     (1)

Where ‫ܫ‬௞௥ is one entry in the NIR table of clusters ܿ௞ . As shown in the equation (1), resemblance
can be directly obtained by summing up the nodes’ importance in the NIR table of clustersܿ௞ .
Resemblance will be larger if data point contains nodes that are more important in one cluster
than in another cluster and is considered to obtain maximal resemblance. If resemblance values
between each cluster are small, then it will be treated as an outlier. Therefore, a threshold ߣ௜ in
each cluster is set to identify outliers. The decision function is defined as follows:

        Label = { ܿ௜‫, כ‬                if max R (‫݌‬௝ , ܿ௜ ሻ ≥ ߣ௜       where 1 ≤ i ≤ l;
        outliers;               otherwise.

        As shown in fig.1, the data points in the second sliding window are going to perform data
labeling and thresholds are λ1 = λ2 = 0.5. The first data point p6 = (B, E, F) in S2 is decomposed

                                                   379
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –
6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

into 3 nodes, i.e., {[A1=B]}, {[A2=E]}, {[A3=F]}. The resemblance of ‫ ଺݌‬in ܿଵ is zero, and in ܿଶ
                                                                                 ଵ             ଵ

it is also zero, since the maximal resemblance is not larger than the threshold, hence the data
point ‫ ଺݌‬is considered as an outlier. The resemblance of ‫ ଻݌‬in ܿଵ is 0.037 and in ܿଶ it is
                                                                         ଵ               ଵ

1.537(0.5 +0.037 +1). Then the maximal resemblance value is R (‫ܿ , ଻݌‬ଶ ) and the resemblance
                                                                             ଵ

value is larger than the threshold λ2 = 0.5, therefore ‫ ଻݌‬is labeled clusterܿଶ .
                                                                              ଵ




                  p1     p2     p3     p4      p5    p6      p7         p8   p9   p10
          A1      C       I     C      S       C     B        I         B    S     B
          A2      W      W      W      W       W     E       T          E     I    O
          A3      D      M      N      M       D     F       H          G    H     G

                                S1                                      S2

                              p11    p12       p13     p14        p15
                               S      I         Z       I          S
                               W      W         P       W          W
                               P      P         T       P          P

                                               S3
                               ܿଵ
                                ଵ

                              C C C      I ܿଶ
                                           Sଵ

                              W W W      W W
                              D N D      M M
                               ܿଶ
                                ′ଵ
                                     ܿଶ
                                      ′ଶ    outliers
                                   I  S    B B B
                                   T T     E E O
                                   H H     F G G

        Fig. 1: The temporal clustering result ࡯′૛ that is obtained by data labeling.
Algorithm Used:

       Let temp=‫ ܥ‬ሾ௧೐ ,௧ିଵሿ
       DriftingConceptDetecting (temp, ܵ ௧ )
       outliers out = 0
       while there is next tuple in ܵ ௧ do
       read in data point ‫݌‬௝ from St
       divide ‫݌‬௝ into nodes ‫ܫ‬ଵ to ‫ܫ‬௤
       for all clusters tempi in ‫ ݌݉݁ݐ‬do
       calculate Resemblance R(pj, tempi)
       end for
       find Maximal Resemblance tempm
        if R( ‫݌‬௝ , tempm ) ≥ ߣ௠ then
        ‫݌‬௝ is assign to ܿ௠ else
                         ′௧

        out = out + 1

                                               380
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –
6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

        end if
        end while
       Outlier = out;
       {Do data labeling on current sliding window }
       Numdiffclusters = 0
       For all clusters tempi in temp do
                           ሾ೟ ,೟షభሿ
                         ௠೔ ೐                      ௠೔ ೟
        If   ቤ    ೖ ሾ೟೐ ,೟షభሿ ሾ೟೐ ,೟షభሿ   െ     ሾ೟೐ ,೟షభሿ ೟   ቤ then
                 ∑ೣసభ        ௠ೣ               ∑ೖ
                                               ೣసభ       ௠ೣ


        Numdiffclusters = numdiffclusters + 1
        end if
        end for
            ௢௨௧௟௜௘௥            ௡௨௠ௗ௜௙௙௖௟௨௦௧௘௥௦
         if ே > θ or                           ൐ ߟ then
                                  ௞ ሾ೟೐ ,೟షభሿ
         {Concept Drifts}
         dump out temp
         call initial clustering on St
         else
         {Concept not drifts}
          add ‫′ ܥ‬௧ into temp
          update NIR as ‫ ܥ‬ሾ௧೐ ,௧ሿ
          end if

        Since we measure the similarity between the data point ‫݌‬௝ and the cluster ܿ௜ as R (‫݌‬௝ , ܿ௜ ሻ,
the cluster with the maximal resemblance is the most appropriate cluster for that data point. If the
maximal resemblance (the most appropriate cluster) is smaller than the threshold ߣ௜ in that
cluster, the data point is seen as an outlier. In order to observe the relationship between different
clustering results, cluster relationship analysis is used to analyze and show the changes between
different clustering results. It measures the similarity of clusters between the clustering results at
different time stamps and links the similar clusters.

                                                 Cluster   Cluster ܿଶ
                                                                    ଶ

                                   Cluster ܿଵ
                                            ଵ    ܿଵ 0.012
                                                  ଶ
                                                              0.182
                                   Cluster ܿଶ
                                            ଵ        0.567      0
                                                 Cluster   Cluster ܿଶ
                                                                    ଷ

                                   Cluster ܿଵ
                                            ଶ    ܿଵ 1
                                                  ଷ
                                                                0
                                   Cluster ܿଶ
                                            ଶ        0          0
                         Fig. 2: The similarity table between clustering results

                                ഥ തതത
The cosine measure CM ( ܿଶ , ܿଵ ). = (1.537/1.225)* 1.578 = 0.567, which is larger than CM
                                 ଵ   ଶ
 തതതത തതത
(ܿ ଵ , ܿଵ ). Therefore cluster ܿଶ is said to be more similar to ܿଵ than to clusterܿଵ .
        ଶ                       ଵ                                ଶ                 ଵ
    ଵ




                                                        381
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –
6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

                               Table 1: Symbols used in Algorithm

       Aa            The a-th attribute in the data set.
       C[t1 , t2]    The clustering result from t1 to t2.
       Ct            The clustering result on sliding window t.
       C1t           The temporal clustering result on sliding window t.
        Cj           The j-th cluster in C.
        ܿప
         ഥ           The node importance vector of ܿ௜ .
        ‫ܫ‬௜௥          The r-th node in ܿ௜ .
       |‫ܫ‬௜௥ |        The number of occurrence of ‫ܫ‬௜௥ .
        K            The number of clusters in C.
        ݉௜           The number of data points in ܿ௜ .
        N            The size of sliding window.
        ܵ௧           The sliding window t.
        T            The timestamp index of sliding window.
    ‫ݓ‬ሺܿ௜ , ‫ܫ‬௜௥ ሻ     The importance of ‫ܫ‬௜௥ in ܿ௜ .
       Θ             The outlier threshold.
       Ε             The cluster variation threshold.
       Η             The cluster difference threshold.
   CM(ܿ௜ , ܿ௝ )      The cosine measure between cluster vectors ܿప and ܿఫ
                                                                  ഥ      ഥ.

IV RESULTS:

        The following table shows the results in terms of precision and recall of DCD are
efficient on detecting drifting concepts.

                                                               N=1000
                           Settings      drifting         precision Recall
                             D1           35.6              0.557    0.873
                             D2           39.2              0.825    0.992
                             D3             46              0.816     0.98
                             D4           44.5              0.443     0.97

                           Fig. 3: The precision and recall of the DCD

        We change clustering pairs to obtain the data sets with drifting concepts and then test the
detecting accuracy of algorithm DCD by those data sets. The outlier threshold θ is set to 0.1, and
the cluster variation threshold ε is set to 0.1, and also, the cluster difference threshold η is set to
0.5. The number of clusters k, which is the required parameter on the initial clustering step and
reclustering step, is set to the maximum number of clusters in each setting, e.g., k = 10 in D1 and
k = 20 in D3. In addition, each synthetic data set is generated by randomly combining 50
clustering results on that data set setting, and the precision and recall shown in fig.3 are the
averages of 20 experiments. The precision and recall are more than 80 percent when the size of
the sliding window is larger than 2,000. It is a little low when the size of the sliding window is
set to 1,000 because the drifting concepts often cross two windows, we only count one as a

                                                    382
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 –
6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME

correct hit, and the other window is considered as a miss. However, the detecting recall is the
highest one when the size of sliding window is set to 1,000. The drifting concepts will probably
not be omitted in the sliding window when the data set is separated in detail. If we choose two
examples of bank datasets that are synthesized by settings D1 and D2 and evaluate clustering
results on each sliding window, it generates a new clustering results when the drifting concept is
detected, it also response quickly to the trend of evolving dataset.

IV. CONCLUSION

      In this paper we have proposed a framework to perform clustering on categorical time-
evolving data. In order to detect the drifting concepts at different sliding windows, we proposed
the algorithm DCD to compare the cluster distributions between the last clustering result and the
temporal current clustering result. If the results are quite different, the last clustering result will
be dumped out, and the current data in this sliding window will perform reclustering. In order to
observe the relationship between different clustering results, cluster relationship analysis is used
to analyze and show the changes between different clustering results. The experimental
evaluation shows that performing DCD is faster than doing clustering once on the entire data set
and DCD can provide high-quality clustering results with correctly detected drifting concepts.
Therefore, the result demonstrates that our framework is practical for detecting drifting concepts
in time-evolving categorical data.

V.REFERENCES

[1] D. Barbara, Y. Li, and J. Couto, Coolcat: An Entropy-Based Algorithm for Categorical
Clustering, Proc. ACM Int’l Conf. Information and Knowledge Management (CIKM), 2002.
[2] F. Cao, M. Ester, W. Qian, and A. Zhou, Density-Based Clustering over an Evolving Data
Stream with Noise, Proc. Sixth SIAM Int’l Conf. Data Mining (SDM), 2006.
[3] H.-L. Chen, K.-T. Chuang, and M.-S. Chen, Labeling Unclustered Categorical Data into
Clusters Based on the Important Attribute Values, Proc. Fifth IEEE Int’l Conf. Data Mining
(ICDM), 2005.
[4] O. Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain, A Web Usage Mining
Framework for Mining Evolving User Profiles in Dynamic Web Sites, IEEE Trans. Knowledge
and Data Eng., vol. 20, no. 2, pp. 202-215, Feb. 2008.
[5] Hung-Leng Chen, Ming-Syan Chen, and Su-Chen Lin, Catching the Trend: A Framework
for Clustering Concept-Drifting Categorical Data, IEEE Trans. Knowledge and Data Eng., vol.
21, no. 5, May 2009.
[6] Z. Huang and M.K. Ng, A Fuzzy k-Modes Algorithm for Clustering Categorical Data, IEEE
Trans. Fuzzy Systems, 1999.




                                                 383

Weitere ähnliche Inhalte

Was ist angesagt?

Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringIJERD Editor
 
7. 10083 12464-1-pb
7. 10083 12464-1-pb7. 10083 12464-1-pb
7. 10083 12464-1-pbIAESIJEECS
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAEditor Jacotech
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal
 
Scalable Constrained Spectral Clustering
Scalable Constrained Spectral ClusteringScalable Constrained Spectral Clustering
Scalable Constrained Spectral Clustering1crore projects
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
 
Clustering Approach Recommendation System using Agglomerative Algorithm
Clustering Approach Recommendation System using Agglomerative AlgorithmClustering Approach Recommendation System using Agglomerative Algorithm
Clustering Approach Recommendation System using Agglomerative AlgorithmIRJET Journal
 
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...IRJET Journal
 
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATIONGRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATIONijdms
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataIRJET Journal
 
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering TechniquesIRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering TechniquesIRJET Journal
 
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...acijjournal
 
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimizationAlexander Decker
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clusteringijcsity
 

Was ist angesagt? (18)

Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
 
7. 10083 12464-1-pb
7. 10083 12464-1-pb7. 10083 12464-1-pb
7. 10083 12464-1-pb
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
Scalable Constrained Spectral Clustering
Scalable Constrained Spectral ClusteringScalable Constrained Spectral Clustering
Scalable Constrained Spectral Clustering
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
 
Clustering Approach Recommendation System using Agglomerative Algorithm
Clustering Approach Recommendation System using Agglomerative AlgorithmClustering Approach Recommendation System using Agglomerative Algorithm
Clustering Approach Recommendation System using Agglomerative Algorithm
 
A0360109
A0360109A0360109
A0360109
 
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
Survey Paper on Clustering Data Streams Based on Shared Density between Micro...
 
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATIONGRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
GRAPH BASED LOCAL RECODING FOR DATA ANONYMIZATION
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
 
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering TechniquesIRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
 
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
COMPARATIVE PERFORMANCE ANALYSIS OF RNSC AND MCL ALGORITHMS ON POWER-LAW DIST...
 
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimization
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
 

Andere mochten auch

Visualization of sorting algorithms using flash
Visualization of sorting algorithms using flashVisualization of sorting algorithms using flash
Visualization of sorting algorithms using flashiaemedu
 
Network marketing through buzz marketing strategy
Network marketing through buzz marketing strategyNetwork marketing through buzz marketing strategy
Network marketing through buzz marketing strategyiaemedu
 
Reduction of mismatch and shading loss by use
Reduction of mismatch and shading loss by useReduction of mismatch and shading loss by use
Reduction of mismatch and shading loss by useiaemedu
 
Feature integration for image information retrieval using image mining techni...
Feature integration for image information retrieval using image mining techni...Feature integration for image information retrieval using image mining techni...
Feature integration for image information retrieval using image mining techni...iaemedu
 
Application of non traditional optimization for quality improvement in tool ...
Application of non  traditional optimization for quality improvement in tool ...Application of non  traditional optimization for quality improvement in tool ...
Application of non traditional optimization for quality improvement in tool ...iaemedu
 
Octave wave sound signal measurements in ducted axial fan under stall region ...
Octave wave sound signal measurements in ducted axial fan under stall region ...Octave wave sound signal measurements in ducted axial fan under stall region ...
Octave wave sound signal measurements in ducted axial fan under stall region ...iaemedu
 
Evaluation of the saharan aerosol impact on solar radiation over the tamanras...
Evaluation of the saharan aerosol impact on solar radiation over the tamanras...Evaluation of the saharan aerosol impact on solar radiation over the tamanras...
Evaluation of the saharan aerosol impact on solar radiation over the tamanras...iaemedu
 
Influence of local segmentation in the context of digital image processing
Influence of local segmentation in the context of digital image processingInfluence of local segmentation in the context of digital image processing
Influence of local segmentation in the context of digital image processingiaemedu
 
Design and development of an automotive vertical doors opening system avdos
Design and development of an automotive vertical doors opening system avdosDesign and development of an automotive vertical doors opening system avdos
Design and development of an automotive vertical doors opening system avdosiaemedu
 
Optimal placement of custom power
Optimal placement of custom powerOptimal placement of custom power
Optimal placement of custom poweriaemedu
 
An improved robust and secured image steganographic scheme
An improved robust and secured image steganographic schemeAn improved robust and secured image steganographic scheme
An improved robust and secured image steganographic schemeiaemedu
 

Andere mochten auch (11)

Visualization of sorting algorithms using flash
Visualization of sorting algorithms using flashVisualization of sorting algorithms using flash
Visualization of sorting algorithms using flash
 
Network marketing through buzz marketing strategy
Network marketing through buzz marketing strategyNetwork marketing through buzz marketing strategy
Network marketing through buzz marketing strategy
 
Reduction of mismatch and shading loss by use
Reduction of mismatch and shading loss by useReduction of mismatch and shading loss by use
Reduction of mismatch and shading loss by use
 
Feature integration for image information retrieval using image mining techni...
Feature integration for image information retrieval using image mining techni...Feature integration for image information retrieval using image mining techni...
Feature integration for image information retrieval using image mining techni...
 
Application of non traditional optimization for quality improvement in tool ...
Application of non  traditional optimization for quality improvement in tool ...Application of non  traditional optimization for quality improvement in tool ...
Application of non traditional optimization for quality improvement in tool ...
 
Octave wave sound signal measurements in ducted axial fan under stall region ...
Octave wave sound signal measurements in ducted axial fan under stall region ...Octave wave sound signal measurements in ducted axial fan under stall region ...
Octave wave sound signal measurements in ducted axial fan under stall region ...
 
Evaluation of the saharan aerosol impact on solar radiation over the tamanras...
Evaluation of the saharan aerosol impact on solar radiation over the tamanras...Evaluation of the saharan aerosol impact on solar radiation over the tamanras...
Evaluation of the saharan aerosol impact on solar radiation over the tamanras...
 
Influence of local segmentation in the context of digital image processing
Influence of local segmentation in the context of digital image processingInfluence of local segmentation in the context of digital image processing
Influence of local segmentation in the context of digital image processing
 
Design and development of an automotive vertical doors opening system avdos
Design and development of an automotive vertical doors opening system avdosDesign and development of an automotive vertical doors opening system avdos
Design and development of an automotive vertical doors opening system avdos
 
Optimal placement of custom power
Optimal placement of custom powerOptimal placement of custom power
Optimal placement of custom power
 
An improved robust and secured image steganographic scheme
An improved robust and secured image steganographic schemeAn improved robust and secured image steganographic scheme
An improved robust and secured image steganographic scheme
 

Ähnlich wie A frame work for clustering time evolving data

SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...IJDKP
 
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET-  	  Enhanced Density Based Method for Clustering Data StreamIRJET-  	  Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data StreamIRJET Journal
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2IAEME Publication
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...IJCSIS Research Publications
 
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...IOSR Journals
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...ijtsrd
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016ijcsbi
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1bPRAWEEN KUMAR
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...IJCNCJournal
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...IJCNCJournal
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...Nicolle Dammann
 
Certain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic DataminingCertain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic Dataminingijdmtaiir
 
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET Journal
 

Ähnlich wie A frame work for clustering time evolving data (20)

SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
 
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET-  	  Enhanced Density Based Method for Clustering Data StreamIRJET-  	  Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data Stream
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
 
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
 
G0354451
G0354451G0354451
G0354451
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
 
50120130406008
5012013040600850120130406008
50120130406008
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...
 
Certain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic DataminingCertain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic Datamining
 
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
 

Mehr von iaemedu

Tech transfer making it as a risk free approach in pharmaceutical and biotech in
Tech transfer making it as a risk free approach in pharmaceutical and biotech inTech transfer making it as a risk free approach in pharmaceutical and biotech in
Tech transfer making it as a risk free approach in pharmaceutical and biotech iniaemedu
 
Integration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniquesIntegration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniquesiaemedu
 
Effective broadcasting in mobile ad hoc networks using grid
Effective broadcasting in mobile ad hoc networks using gridEffective broadcasting in mobile ad hoc networks using grid
Effective broadcasting in mobile ad hoc networks using gridiaemedu
 
Effect of scenario environment on the performance of mane ts routing
Effect of scenario environment on the performance of mane ts routingEffect of scenario environment on the performance of mane ts routing
Effect of scenario environment on the performance of mane ts routingiaemedu
 
Adaptive job scheduling with load balancing for workflow application
Adaptive job scheduling with load balancing for workflow applicationAdaptive job scheduling with load balancing for workflow application
Adaptive job scheduling with load balancing for workflow applicationiaemedu
 
Survey on transaction reordering
Survey on transaction reorderingSurvey on transaction reordering
Survey on transaction reorderingiaemedu
 
Semantic web services and its challenges
Semantic web services and its challengesSemantic web services and its challenges
Semantic web services and its challengesiaemedu
 
Website based patent information searching mechanism
Website based patent information searching mechanismWebsite based patent information searching mechanism
Website based patent information searching mechanismiaemedu
 
Revisiting the experiment on detecting of replay and message modification
Revisiting the experiment on detecting of replay and message modificationRevisiting the experiment on detecting of replay and message modification
Revisiting the experiment on detecting of replay and message modificationiaemedu
 
Prediction of customer behavior using cma
Prediction of customer behavior using cmaPrediction of customer behavior using cma
Prediction of customer behavior using cmaiaemedu
 
Performance analysis of manet routing protocol in presence
Performance analysis of manet routing protocol in presencePerformance analysis of manet routing protocol in presence
Performance analysis of manet routing protocol in presenceiaemedu
 
Performance measurement of different requirements engineering
Performance measurement of different requirements engineeringPerformance measurement of different requirements engineering
Performance measurement of different requirements engineeringiaemedu
 
Mobile safety systems for automobiles
Mobile safety systems for automobilesMobile safety systems for automobiles
Mobile safety systems for automobilesiaemedu
 
Efficient text compression using special character replacement
Efficient text compression using special character replacementEfficient text compression using special character replacement
Efficient text compression using special character replacementiaemedu
 
Agile programming a new approach
Agile programming a new approachAgile programming a new approach
Agile programming a new approachiaemedu
 
Adaptive load balancing techniques in global scale grid environment
Adaptive load balancing techniques in global scale grid environmentAdaptive load balancing techniques in global scale grid environment
Adaptive load balancing techniques in global scale grid environmentiaemedu
 
A survey on the performance of job scheduling in workflow application
A survey on the performance of job scheduling in workflow applicationA survey on the performance of job scheduling in workflow application
A survey on the performance of job scheduling in workflow applicationiaemedu
 
A survey of mitigating routing misbehavior in mobile ad hoc networks
A survey of mitigating routing misbehavior in mobile ad hoc networksA survey of mitigating routing misbehavior in mobile ad hoc networks
A survey of mitigating routing misbehavior in mobile ad hoc networksiaemedu
 
A novel approach for satellite imagery storage by classify
A novel approach for satellite imagery storage by classifyA novel approach for satellite imagery storage by classify
A novel approach for satellite imagery storage by classifyiaemedu
 
A self recovery approach using halftone images for medical imagery
A self recovery approach using halftone images for medical imageryA self recovery approach using halftone images for medical imagery
A self recovery approach using halftone images for medical imageryiaemedu
 

Mehr von iaemedu (20)

Tech transfer making it as a risk free approach in pharmaceutical and biotech in
Tech transfer making it as a risk free approach in pharmaceutical and biotech inTech transfer making it as a risk free approach in pharmaceutical and biotech in
Tech transfer making it as a risk free approach in pharmaceutical and biotech in
 
Integration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniquesIntegration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniques
 
Effective broadcasting in mobile ad hoc networks using grid
Effective broadcasting in mobile ad hoc networks using gridEffective broadcasting in mobile ad hoc networks using grid
Effective broadcasting in mobile ad hoc networks using grid
 
Effect of scenario environment on the performance of mane ts routing
Effect of scenario environment on the performance of mane ts routingEffect of scenario environment on the performance of mane ts routing
Effect of scenario environment on the performance of mane ts routing
 
Adaptive job scheduling with load balancing for workflow application
Adaptive job scheduling with load balancing for workflow applicationAdaptive job scheduling with load balancing for workflow application
Adaptive job scheduling with load balancing for workflow application
 
Survey on transaction reordering
Survey on transaction reorderingSurvey on transaction reordering
Survey on transaction reordering
 
Semantic web services and its challenges
Semantic web services and its challengesSemantic web services and its challenges
Semantic web services and its challenges
 
Website based patent information searching mechanism
Website based patent information searching mechanismWebsite based patent information searching mechanism
Website based patent information searching mechanism
 
Revisiting the experiment on detecting of replay and message modification
Revisiting the experiment on detecting of replay and message modificationRevisiting the experiment on detecting of replay and message modification
Revisiting the experiment on detecting of replay and message modification
 
Prediction of customer behavior using cma
Prediction of customer behavior using cmaPrediction of customer behavior using cma
Prediction of customer behavior using cma
 
Performance analysis of manet routing protocol in presence
Performance analysis of manet routing protocol in presencePerformance analysis of manet routing protocol in presence
Performance analysis of manet routing protocol in presence
 
Performance measurement of different requirements engineering
Performance measurement of different requirements engineeringPerformance measurement of different requirements engineering
Performance measurement of different requirements engineering
 
Mobile safety systems for automobiles
Mobile safety systems for automobilesMobile safety systems for automobiles
Mobile safety systems for automobiles
 
Efficient text compression using special character replacement
Efficient text compression using special character replacementEfficient text compression using special character replacement
Efficient text compression using special character replacement
 
Agile programming a new approach
Agile programming a new approachAgile programming a new approach
Agile programming a new approach
 
Adaptive load balancing techniques in global scale grid environment
Adaptive load balancing techniques in global scale grid environmentAdaptive load balancing techniques in global scale grid environment
Adaptive load balancing techniques in global scale grid environment
 
A survey on the performance of job scheduling in workflow application
A survey on the performance of job scheduling in workflow applicationA survey on the performance of job scheduling in workflow application
A survey on the performance of job scheduling in workflow application
 
A survey of mitigating routing misbehavior in mobile ad hoc networks
A survey of mitigating routing misbehavior in mobile ad hoc networksA survey of mitigating routing misbehavior in mobile ad hoc networks
A survey of mitigating routing misbehavior in mobile ad hoc networks
 
A novel approach for satellite imagery storage by classify
A novel approach for satellite imagery storage by classifyA novel approach for satellite imagery storage by classify
A novel approach for satellite imagery storage by classify
 
A self recovery approach using halftone images for medical imagery
A self recovery approach using halftone images for medical imageryA self recovery approach using halftone images for medical imagery
A self recovery approach using halftone images for medical imagery
 

A frame work for clustering time evolving data

  • 1. INTERNATIONALComputer Volume OF COMPUTER ENGINEERING – International Journal of JOURNAL 3, Issueand Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 – 6375(Online) Engineering 3, October-December (2012), © IAEME & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 3, Issue 3, October - December (2012), pp. 377-383 IJCET © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2012): 3.9580 (Calculated by GISI) ©IAEME www.jifactor.com A FRAME WORK FOR CLUSTERING TIME EVOLVING DATA USING SLIDING WINDOW TECHNIQUE Y. Swapna1, S. Ravi Sankar2 1 (Faculty, CSE Department, National Institute of Technology, Goa, India, spr@nitgoa.ac.in) 2 (Faculty, CSE Department, National Institute of Technology, Goa, India, srs@nitgoa.ac.in) ABSTRACT Clustering is defined as the process of dividing a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another and different groups are as "far" as possible from one another. Sampling is defined as representing large data sets into smaller random samples of data. It is used to improve the efficiency of clustering. Though sampling is applied, the points that are not sampled will not have their labels after the normal process. The problem has been solved for numerical domain, where as clustering of time- evolving data in the categorical domain still remains a challenging issue. In this paper, Sliding Window is used to form subset of data from dataset using specified size (i.e.) collection of data from the database and transfer to the module. The drifting concept detection has been proposed which introduces new algorithm that finds the number of outliers that cannot be assigned to any of the cluster. The objective of this algorithm is to compare the distribution of clusters and outliers between the last clustering result and the current temporal clustering result. The experimental evaluation shows that performing DCD is faster than doing clustering once on the entire data set and DCD can provide high-quality clustering results with correctly detected drifting concepts. Keywords: clustering, sampling, categorical domain, labels, sliding window, drifting concept detection. I. INTRODUCTION Our present information age society thrives and evolves on knowledge. Knowledge is derived from information gleaned from a wide variety of reservoirs of data (databases). Clustering is an important technique for exploratory data analysis and has been the focus of substantial research in several domains for decades. Clusters are connected regions of a multi- dimensional space containing of a relatively high density of points, separated from other such 377
  • 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME regions by a region containing a low density of points. It is useful for classification and can reveal the structure in high-dimensional data spaces, outliers may be interesting, statistical pattern recognition, machine learning, and information retrieval because of its use in a wide range of applications. Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. It helps us to gain insight into the data distribution. In real world domain, the concept of interest may depend on some hidden context, not given plainly in the form of predictive features, which has become a problem as these concepts drift with time. A suitable example would be buying preferences of customers which may change with time, depending on their needs, climatic conditions, discounts etc. Since the concepts behind the data evolve with time, the underlying clusters may also change significantly with time. The concept not only decreases the quality of clusters but also disregards the expectations of users, which usually require recent clustering results. Many works have been explored based on the problem of clustering time-evolving data in the numerical domain. Categorical attributes also prevalently exist in real data with drifting concepts, for example Web logs that record the browsing history of users, stock market details, buying records of customers often evolve with time. Previous works on clustering categorical data focus on doing clustering on the entire data set and drifting concepts were not taken consideration. Consequently, the problem of clustering time evolving data in the categorical domain remains a challenging issue. The objective is to propose a framework for performing clustering on the categorical time-evolving data. The goal is to use a generalized clustering framework that utilizes existing clustering algorithms that detects if there is a drifting concept or not in the incoming data, instead of designing a specific clustering algorithm. Sliding window technique is adopted to detect the drifting concepts. II. RELATED WORK Many different numerical clustering algorithms have been proposed that consider the time- evolving data and traditional categorical clustering algorithms [1]. An effective and efficient method, called, clustream for clustering large evolving data streams was proposed by [5]. This method tries to cluster the whole stream at one time rather than viewing the stream as a changing process over time. A density-based method called DenStream was proposed in [2] for discovering clusters in an evolving data stream. Evolutionary clustering algorithms were proposed in [5] and [3]. They adopted the same method that performs data clustering over time and tries to optimize two potentially conflicting criteria: first, the previous and the present cluster must be similar without drifting concept, and second, clustering should reflect the data arrived at that time step with the drifting concept. In [6], a generic frame work for this problem used k-means and agglomerative hierarchical clustering algorithms that were extended according to the problem domain. In [5], a measure of temporal smoothness is integrated in the overall measure of clustering quality. Due to this, the proposed method uses stable and consistent clustering results that are less sensitive to short-term noises while at the same time are adaptive to long-term cluster drifts. The previously proposed methods have concentrated on the problem of clustering time evolving data in the numerical domain. In [4], problem of clustering categorical data is discussed, which performs clustering on customer transaction data in a market database. 378
  • 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME In [6], [4], a framework to perform clustering on the categorical time-evolving data has been proposed. Especially the rough membership function in rough set theory represents a concept that induces a fuzzy set. Several extension works based on k-modes are presented for different objectives, fuzzy k- modes [6], initial points refinement [2], etc. These categorical algorithms focus on performing clustering on the entire data set and do not consider time-evolving trends. III. THE PROPOSED APPROACH We propose a generalized clustering framework that utilizes existing clustering algorithms and detects if there is a drifting concept or not in the incoming data. In order to detect the drifting concepts at different sliding windows, we propose the algorithm DCD to compare the cluster distributions between the last clustering result and the temporal current clustering result. It is a collection of data which is extracted from the database that we are going to cluster and the data from the database which is time evolving categorical data (It is not sequential basis manner). We used a synthetic data generator [5] to generate data sets with different number of data points and attributes. The number of data points varies from 10,000 to 100,000, and the dimensionality is in the range of 10-50. In all synthetic data sets, each dimension possesses 20 attribute values. Sliding Window is used to form subset of data from dataset using specified size (i.e.) collection of data from the database and transfer to the module. In this paper, a practical categorical clustering representative, named “Node Importance Representative” (abbreviated as NIR), is utilized. It represents clusters by measuring the importance of each attribute value in the clusters. Drifting Concept Detection (DCD) algorithm (fig.2) is used to detect the difference of cluster distribution between the current data subset and the last clustering result. In order to perform proper evaluation, we label the clusters and those that do not belong to any cluster are called an outlier. The result is set to perform reclustering if the difference between the clusters is large enough. Two clusters are said to be similar (resemblance), if they satisfy the condition between point pj and cluster ck i.e. 1< k<l obtains maximum of the cluster point. The resemblance for a given data point p j and an NIR table of clusters ck, is defined by the following equation: R ( ‫݌‬௝ , ܿ௞ ) = ∑௤ ‫ݓ‬ሺܿ௞ , ‫ܫ‬௞௥ ሻ ௥ୀଵ (1) Where ‫ܫ‬௞௥ is one entry in the NIR table of clusters ܿ௞ . As shown in the equation (1), resemblance can be directly obtained by summing up the nodes’ importance in the NIR table of clustersܿ௞ . Resemblance will be larger if data point contains nodes that are more important in one cluster than in another cluster and is considered to obtain maximal resemblance. If resemblance values between each cluster are small, then it will be treated as an outlier. Therefore, a threshold ߣ௜ in each cluster is set to identify outliers. The decision function is defined as follows: Label = { ܿ௜‫, כ‬ if max R (‫݌‬௝ , ܿ௜ ሻ ≥ ߣ௜ where 1 ≤ i ≤ l; outliers; otherwise. As shown in fig.1, the data points in the second sliding window are going to perform data labeling and thresholds are λ1 = λ2 = 0.5. The first data point p6 = (B, E, F) in S2 is decomposed 379
  • 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME into 3 nodes, i.e., {[A1=B]}, {[A2=E]}, {[A3=F]}. The resemblance of ‫ ଺݌‬in ܿଵ is zero, and in ܿଶ ଵ ଵ it is also zero, since the maximal resemblance is not larger than the threshold, hence the data point ‫ ଺݌‬is considered as an outlier. The resemblance of ‫ ଻݌‬in ܿଵ is 0.037 and in ܿଶ it is ଵ ଵ 1.537(0.5 +0.037 +1). Then the maximal resemblance value is R (‫ܿ , ଻݌‬ଶ ) and the resemblance ଵ value is larger than the threshold λ2 = 0.5, therefore ‫ ଻݌‬is labeled clusterܿଶ . ଵ p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 A1 C I C S C B I B S B A2 W W W W W E T E I O A3 D M N M D F H G H G S1 S2 p11 p12 p13 p14 p15 S I Z I S W W P W W P P T P P S3 ܿଵ ଵ C C C I ܿଶ Sଵ W W W W W D N D M M ܿଶ ′ଵ ܿଶ ′ଶ outliers I S B B B T T E E O H H F G G Fig. 1: The temporal clustering result ࡯′૛ that is obtained by data labeling. Algorithm Used: Let temp=‫ ܥ‬ሾ௧೐ ,௧ିଵሿ DriftingConceptDetecting (temp, ܵ ௧ ) outliers out = 0 while there is next tuple in ܵ ௧ do read in data point ‫݌‬௝ from St divide ‫݌‬௝ into nodes ‫ܫ‬ଵ to ‫ܫ‬௤ for all clusters tempi in ‫ ݌݉݁ݐ‬do calculate Resemblance R(pj, tempi) end for find Maximal Resemblance tempm if R( ‫݌‬௝ , tempm ) ≥ ߣ௠ then ‫݌‬௝ is assign to ܿ௠ else ′௧ out = out + 1 380
  • 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME end if end while Outlier = out; {Do data labeling on current sliding window } Numdiffclusters = 0 For all clusters tempi in temp do ሾ೟ ,೟షభሿ ௠೔ ೐ ௠೔ ೟ If ቤ ೖ ሾ೟೐ ,೟షభሿ ሾ೟೐ ,೟షభሿ െ ሾ೟೐ ,೟షభሿ ೟ ቤ then ∑ೣసభ ௠ೣ ∑ೖ ೣసభ ௠ೣ Numdiffclusters = numdiffclusters + 1 end if end for ௢௨௧௟௜௘௥ ௡௨௠ௗ௜௙௙௖௟௨௦௧௘௥௦ if ே > θ or ൐ ߟ then ௞ ሾ೟೐ ,೟షభሿ {Concept Drifts} dump out temp call initial clustering on St else {Concept not drifts} add ‫′ ܥ‬௧ into temp update NIR as ‫ ܥ‬ሾ௧೐ ,௧ሿ end if Since we measure the similarity between the data point ‫݌‬௝ and the cluster ܿ௜ as R (‫݌‬௝ , ܿ௜ ሻ, the cluster with the maximal resemblance is the most appropriate cluster for that data point. If the maximal resemblance (the most appropriate cluster) is smaller than the threshold ߣ௜ in that cluster, the data point is seen as an outlier. In order to observe the relationship between different clustering results, cluster relationship analysis is used to analyze and show the changes between different clustering results. It measures the similarity of clusters between the clustering results at different time stamps and links the similar clusters. Cluster Cluster ܿଶ ଶ Cluster ܿଵ ଵ ܿଵ 0.012 ଶ 0.182 Cluster ܿଶ ଵ 0.567 0 Cluster Cluster ܿଶ ଷ Cluster ܿଵ ଶ ܿଵ 1 ଷ 0 Cluster ܿଶ ଶ 0 0 Fig. 2: The similarity table between clustering results ഥ തതത The cosine measure CM ( ܿଶ , ܿଵ ). = (1.537/1.225)* 1.578 = 0.567, which is larger than CM ଵ ଶ തതതത തതത (ܿ ଵ , ܿଵ ). Therefore cluster ܿଶ is said to be more similar to ܿଵ than to clusterܿଵ . ଶ ଵ ଶ ଵ ଵ 381
  • 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME Table 1: Symbols used in Algorithm Aa The a-th attribute in the data set. C[t1 , t2] The clustering result from t1 to t2. Ct The clustering result on sliding window t. C1t The temporal clustering result on sliding window t. Cj The j-th cluster in C. ܿప ഥ The node importance vector of ܿ௜ . ‫ܫ‬௜௥ The r-th node in ܿ௜ . |‫ܫ‬௜௥ | The number of occurrence of ‫ܫ‬௜௥ . K The number of clusters in C. ݉௜ The number of data points in ܿ௜ . N The size of sliding window. ܵ௧ The sliding window t. T The timestamp index of sliding window. ‫ݓ‬ሺܿ௜ , ‫ܫ‬௜௥ ሻ The importance of ‫ܫ‬௜௥ in ܿ௜ . Θ The outlier threshold. Ε The cluster variation threshold. Η The cluster difference threshold. CM(ܿ௜ , ܿ௝ ) The cosine measure between cluster vectors ܿప and ܿఫ ഥ ഥ. IV RESULTS: The following table shows the results in terms of precision and recall of DCD are efficient on detecting drifting concepts. N=1000 Settings drifting precision Recall D1 35.6 0.557 0.873 D2 39.2 0.825 0.992 D3 46 0.816 0.98 D4 44.5 0.443 0.97 Fig. 3: The precision and recall of the DCD We change clustering pairs to obtain the data sets with drifting concepts and then test the detecting accuracy of algorithm DCD by those data sets. The outlier threshold θ is set to 0.1, and the cluster variation threshold ε is set to 0.1, and also, the cluster difference threshold η is set to 0.5. The number of clusters k, which is the required parameter on the initial clustering step and reclustering step, is set to the maximum number of clusters in each setting, e.g., k = 10 in D1 and k = 20 in D3. In addition, each synthetic data set is generated by randomly combining 50 clustering results on that data set setting, and the precision and recall shown in fig.3 are the averages of 20 experiments. The precision and recall are more than 80 percent when the size of the sliding window is larger than 2,000. It is a little low when the size of the sliding window is set to 1,000 because the drifting concepts often cross two windows, we only count one as a 382
  • 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print), ISSN 0976 – 6375(Online) Volume 3, Issue 3, October-December (2012), © IAEME correct hit, and the other window is considered as a miss. However, the detecting recall is the highest one when the size of sliding window is set to 1,000. The drifting concepts will probably not be omitted in the sliding window when the data set is separated in detail. If we choose two examples of bank datasets that are synthesized by settings D1 and D2 and evaluate clustering results on each sliding window, it generates a new clustering results when the drifting concept is detected, it also response quickly to the trend of evolving dataset. IV. CONCLUSION In this paper we have proposed a framework to perform clustering on categorical time- evolving data. In order to detect the drifting concepts at different sliding windows, we proposed the algorithm DCD to compare the cluster distributions between the last clustering result and the temporal current clustering result. If the results are quite different, the last clustering result will be dumped out, and the current data in this sliding window will perform reclustering. In order to observe the relationship between different clustering results, cluster relationship analysis is used to analyze and show the changes between different clustering results. The experimental evaluation shows that performing DCD is faster than doing clustering once on the entire data set and DCD can provide high-quality clustering results with correctly detected drifting concepts. Therefore, the result demonstrates that our framework is practical for detecting drifting concepts in time-evolving categorical data. V.REFERENCES [1] D. Barbara, Y. Li, and J. Couto, Coolcat: An Entropy-Based Algorithm for Categorical Clustering, Proc. ACM Int’l Conf. Information and Knowledge Management (CIKM), 2002. [2] F. Cao, M. Ester, W. Qian, and A. Zhou, Density-Based Clustering over an Evolving Data Stream with Noise, Proc. Sixth SIAM Int’l Conf. Data Mining (SDM), 2006. [3] H.-L. Chen, K.-T. Chuang, and M.-S. Chen, Labeling Unclustered Categorical Data into Clusters Based on the Important Attribute Values, Proc. Fifth IEEE Int’l Conf. Data Mining (ICDM), 2005. [4] O. Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain, A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites, IEEE Trans. Knowledge and Data Eng., vol. 20, no. 2, pp. 202-215, Feb. 2008. [5] Hung-Leng Chen, Ming-Syan Chen, and Su-Chen Lin, Catching the Trend: A Framework for Clustering Concept-Drifting Categorical Data, IEEE Trans. Knowledge and Data Eng., vol. 21, no. 5, May 2009. [6] Z. Huang and M.K. Ng, A Fuzzy k-Modes Algorithm for Clustering Categorical Data, IEEE Trans. Fuzzy Systems, 1999. 383