A hybrid approach to clustering in big data

CONTACT: PRAVEEN KUMAR. L (,+91 – 9791938249)
MAIL ID: sunsid1989@gmail.com, praveen@nexgenproject.com
Web: www.nexgenproject.com, www.finalyear-ieeeprojects.com
A HYBRID APPROACH TO CLUSTERING IN BIG DATA
ABSTRACT:
Clustering of big data has received much attention recently. In this paper, we
present a new clusiVAT algorithm and compare it with four other popular data
clustering algorithms. Three of the four comparison methods are based on the
well known, classical batch k-means model. Specifically, we use k-means, single
pass k-means, online k-means, and clustering using representatives (CURE) for
numerical comparisons. clusiVAT is based on sampling the data, imaging the
reordered distance matrix to estimate the number of clusters in the data visually,
clustering the samples using a relative of single linkage (SL), and then
noniteratively extending the labels to the rest of the data-set using the nearest
prototype rule. Previous work has established that clusiVAT produces true SL
clusters in compact-separated data. We have performed experiments to show
that k-means and its modified algorithms suffer from initialization issues that
cause many failures. On the other hand, clusiVAT needs no initialization, and
almost always finds partitions that accurately match ground truth labels in labeled
data. CURE also finds SL type partitions but is much slower than the other four
algorithms. In our experiments, clusiVAT proves to be the fastest and most
accurate of the five algorithms; e.g., it recovers 97% of the ground truth labels in
the real world KDD-99 cup data (4 292 637 samples in 41 dimensions) in 76 s.

EXISTING SYSTEM:
Data clustering is primarily concerned with separating objects into k different
groups, which presupposes one important preclustering task, namely, estimating
the number of clusters in the data (clustering tendency). The visual assessment of
tendency (VAT) algorithm [16] addresses the question of clustering tendency by
reordering the dissimilarity matrix D to obtain D∗ so that different clusters may be
displayed as dark blocks along the diagonal of the image of D∗. SL proceeds by
connecting the next nearest vertex to the current edge until the complete MST is
formed. k clusters are then formed by cutting the largest k − 1 edges of the MST.
SL performs best if the clusters are long, chain-like clouds, well separated from
each other. As cluster separation decreases and the clusters in the data start
merging with each other, SL becomes unreliable. Nonetheless, SL has been
successfully used in many data clustering applications. In the field of astronomy,
dark matter halos were discovered by Lacey and Cole [17] using SL. In the field of
wireless sensor networks, Moshtaghi et al. [18] used SL for anomaly detection.
Dendrograms, which are visual representations of linkage clusters, are used in
many numerical taxonomy applications [19]. In the field of healthcare, SL has
been used to segment time-series sensor data for patient monitoring at eldercare
facilities [20]. Zhang et al.

PROPOSED SYSTEM:
In this paper, we discusstwo connectivity-based algorithms, clusiVAT and
clustering using representatives (CURE). Centroid-based clustering algorithms
represent clusters as groups located in close proximity to their cluster centers.
Most centroid-based models depend on optimizing an objective function, which
typically measures a property such as: 1) intercluster separation; 2) within-cluster
variance; or 3) both. Technologies such as social media, mobile computing, and
the realization of the Internet of Things (IoT) generate an exorbitant amount of
data every day, which comprise the big data problem. Big data approaches
currently consider one or more aspects of the so called 5Vs (volume, velocity,
variety, value, and veracity) . This paper concentrates on the volume aspect of big
data, which requires novel techniques to be addressed by conventional data
clustering algorithms. In this paper, “k-means” refers to the batch version. The k-
means algorithm is easy to implement and is computationally efficient, but it has
various limitations. For example, the number of clusters is an input for k-means,
which is usually not known. More worrisome is the fact that k-means often gets
stuck at a local trap state of its objective function, which may lead to incorrect
cluster interpretations. This problem is usually ascribed to poor initialization.
Another limitation of k-means is that its distance-based model for identifying
good clusters depends on the topology of the norm used in its objective function.
The usual model uses an inner product norm whose topology matches well with

elliptically shaped clusters. Furthermore, k-means tries to impose the same shape
on all k clusters. Thus, in some sense k-means and SL work well for data
distributions at geometrically opposite extremes. A large number of algorithms
based on both SL and k-means have been proposed for the big data clustering
problem. To the best of our knowledge, the first scalable SL-based algorithm was
proposed, where it was called scalable-VAT (sVAT)-SL. The clusiVAT model and
algorithm proposed in this paper are extensions of the ideas presented . Another
scalable relative of sVAT-SL was discussed and compared to a fast MST algorithm
called filter-Kruskal. As for the big data versions of k-means, a hierarchical version
that divides the data into two parts at each step before clustering, named
bisecting k-means, was proposed. A fast, scalable version of k-means was
presented, which does not require all the data to be stored in main memory at
the same time. A fuzzy algorithm based on k-means for big data was proposed in.
Eschrich et al. replaced group points with the group centroid to speed up a fuzzy
version of k-means for big data. Feldman et al. used coresets to approximate a
large number of datapoints from big data by a single point. In this paper, we have
used two big data adaptations of k-means namely, spkm, and okm, which split the
big dataset into small chunks of data before clustering for faster run time. An
application of k-means based clustering is presented .

CONCLUSION:
In this paper, we have illustrated our new clusiVAT algorithm for big datasets and
have compared its performance to four other popular clustering algorithms: 1) k-
means; 2) spkm; 3) okm; and 4) CURE. To show the usefulness of clusiVAT in
terms of CPU time and PA, we performed experiments on 24 2-D synthetic
datasets (having a maximum of 1 000 000 datapoints), nine high-dimensional
synthetic datasets (having a maximum of 500 000, 500 dimensional datapoints),
and two real-life big datasets (the largest of which has 4 292 637 vectors with41
features each). We found that for CS datasets our newclusiVAT gives an accuracy
of 100% in much less timethank-means and its variants, and CURE. For 2-D non-CS
datasets,clusiVATgives quite high accuracy (≥99.8%) in 12–18 times less CPU time
than k-means and its relatives, and 60–90times less CPU time than CURE.
REFERENCE:
[1] A. Jain, M. Murty, and P. Flynn, “Data clustering: A review,” ACM Comput.
Surv., vol. 31, no. 3, pp. 264–323, Sep. 1999.
[2] D. Jiang, C. Tang, and A. Zhang, “Cluster analysis for gene expression data: A
survey,” IEEETrans. Knowl. Data Eng., vol. 16, no. 11, pp. 1370–1386, Nov. 2004.
[3] A. K. Jain, “Data clustering: 50 years beyond k-means,” in Machine Learning
and Knowledge Discovery in Databases. Berlin, Germany: Springer, 2008, pp. 3–4.

[4] J. Bezdek, Pattern Recognition With Objective Function Algorithms. New York,
NY, USA: Plenum, 1981.
[5] Y. Yang, Z. Ma, Y. Yang, F. Nie, and H. T. Shen, “Multitask spectral clustering by
exploring intertask correlation,” IEEE Trans. Cybern., vol. 45, no. 5, pp. 1069–
1080, May 2015.
[6] H. Zhu, C. Liu, Y. Ge, H. Xiong, and E. Chen, “Popularity modeling for mobile
Apps: A sequential approach,” IEEE Trans. Cybern., vol. 45, no. 7, pp. 1303–1314,
Jul. 2015.
[7] R. Sibson, “SLINK: An optimally efficient algorithm for the singlelink cluster
method,” Comput. J. (Brit. Comput. Soc.), vol. 16, no. 1, pp. 30–34, Jan. 1973.
[8] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, “Internet of Things (IoT): A
vision, architectural elements, and future directions,” Future Gener. Comput.
Syst., vol. 29, no. 7, pp. 1645–1660, Sep. 2013.
[9] A. Shilton, S. Rajasegarar, C. Leckie, and M. Palaniswami, “DP1SVM: A dynamic
planar one-class support vector machine for Internet of Things environment,” in
Proc. Int. Conf. Rec. Adv. Internet Things (RIoT), Singapore, Apr. 2015, pp. 1–6.
[10] J. Jin, J. Gubbi, S. Marusic, and M. Palaniswami, “An information framework
for creating a smart city through Internetof Things,” IEEE InternetThings J., vol. 1,
no. 2, pp. 112–121, Apr. 2014.

A hybrid approach to clustering in big data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie A hybrid approach to clustering in big data

Ähnlich wie A hybrid approach to clustering in big data (20)

Mehr von Nexgen Technology

Mehr von Nexgen Technology (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

A hybrid approach to clustering in big data