This document summarizes and analyzes clustering algorithms for big data mining. It discusses traditional clustering techniques (partitioning, hierarchical, density-based, etc.) and evaluates them based on their ability to handle big data's volume, variety, and velocity characteristics. The document also proposes a MapReduce framework for implementing clustering algorithms for big data in a parallel and distributed manner. It experimentally compares execution times of traditional k-means clustering versus k-means using the proposed MapReduce approach.
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability
1. A Comprehensive Study of Clustering
Algorithms for Big Data Mining with
MapReduce Capability
Kamlesh Kumar Pandey
Research Scholar
Dept. of Computer Science & Applications
Dr. HariSingh Gour Vishwavidyalaya (A Central University), Sagar, M.P.
E-mail: kamleshamk@gmail.com
International Conference on Social Networking and Computational Intelligence
(Paper ID : 173)
Paper Presentation
on
2. Content
• Objectives
• Big Data
• Big Data Mining
• Clustering taxonomy
• Analysis of Clustering Algorithm for Big Data Mining
• Summarization of Clustering Algorithm based on Three-Dimensional of Big Data
• Proposed MapReduce Framework for the Clustering Algorithm
• Experimental
3. Objectives
• The objective of this study is identifying a traditional clustering algorithms
for big data respect to volume, variety, and velocity and built the common
executable framework for clustering algorithm with the MapReduce
approach under big data mining.
4. Big Data
• Present time technology is growing very fast. Every originations, industries or person
moving towards Internet of things, cloud computing, warless sensor networks, social
media, internet. These sources generated a data growing fast in per second, minutes or per
hour in size of Terabytes or Petabytes .
• Diebold et Al. (2000) is a first writer who discussed the word Big Data in his research
paper. All of these authors define Big Data there means if the data set is large then
gigabyte then these type of data set is known as Big Data.
• Doug Laney et al (2001) was the first person who gave a proper definition for Big Data.
He gave three characteristics Volume, Variety, and Velocity of Big Data and these
characteristics known as 3 V’s of Big Data Management. If traditional data have met two
basic characteristic at a time these data are come to under Big data.
• Gartner (2012), “Big data is high-volume, high-velocity and high-variety information
assets that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making”
5. Big Data V’s
• In present time seven V’s used for Big Data where the first three V’s Volume,
Variety, and Velocity are the main characteristics of big data. In addition to
Veracity, Variability, Value, and Visualization are depending on the organization.
6. Big Data Mining
• Big Data Mining fetching on the requested information, uncovering
hidden relationship or patterns or extracting for the needed information or
knowledge from a dataset these datasets have to meet three V’s of Big
Data with higher complexity.
7. Clustering
• Clustering is the one of the approaches for analysis and discovering the
complex relation, pattern, and data in the form of underlying groups for the
unlabeled object and Big Data perspective, the clustering algorithm must be
deal high volume, high variety and high velocity with scalability.
8. Clustering Taxonomy
• Partitioning based Clustering: These clustering methods divided the dataset into
K partition based on the distance function.
• Hierarchical based Clustering: In this approach, large data are organized in a
hierarchical manner based on the medium of proximity and its detect on easily
relationship between data points.
• Density Based Clustering: These clustering methods divided the dataset into
based on the higher density of the data space.
• Grid-Based Clustering: The core idea of grid clustering algorithms is that original
data space is converted into a grid format which defines the size for clustering.
• Model-Based Clustering: These clustering methods divided the data set into based
on models such as mathematics, and statistical distribution.
9. Analysis of Clustering Algorithm for Big Data Mining
• Design of clustering algorithms needs some criteria for big data mining,
which is defining to Volume, Velocity, and Variety and increases the
efficiency of the clustering.
• Volume related criteria such as cluster is must be dealt huge size, high
dimensional and noisy of the dataset.
• Variety related criteria such as cluster is must be recognized as dataset
categorization and clusters Shape.
• Velocity related criteria define the complexity, scalability, and performance of
the clustering algorithm during the execution of real dataset.
10. Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset
type
Cluster
shape
Scalability Time
complexity
Partition
based
Clustering
K-Means Large No High Numerical Convex Medium 0 (knt)
K-Medoies Small No Low Categorical Convex Low 0(k(n-k)2)
PAM Small No Low Numerical Convex Low 0 (k3 * n2)
CLARA Large No Low Numerical Convex High 0(ks2+k(n-k))
CLARANS Large No Low Numerical Convex Medium 0(n2)
11. Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data(2)
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset type Cluster
shape
Scalability Time complexity
Hierarchic
al
based
Clustering
BIRCH Large No Low Numerical Convex High 0(n)
CURE Large Yes High Numerical Arbitrary High 0(n2logn)
ROKE Small Yes Low Numerical/Ca
tegorical
Arbitrary Medium 0(n2logn)
Chameleon Small No Low All type Data Arbitrary High 0(n2)
ECHIDNA Large No Low Multivariate Convex High 0(nb(1+logbm)
12. Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data(3)
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset type Cluster
shape
Scalability Time complexity
Density
based
Clustering
DBSCAN Large No Low Numerical Arbitrary Medium 0(nlogn)
OPTICS Large No Low Numerical Arbitrary Medium 0(nlogn)
Mean-shift Small No Low Numerical Arbitrary Low 0 (kernel)
DENCLUE Large Yes High Numerical Arbitrary Medium 0(log |d|)
GDBSCAN Large No Low Numerical Arbitrary Medium ----------------
13. Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data(4)
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset type Cluster
shape
Scalability Time complexity
Grid
based
Clustering
STING Large Yes Small Spatial Arbitrary High 0(n)
CLIQUE Small Yes Medium Numerical Convex High 0(n+k2)
Wave
Cluster
Large No High Spatial Arbitrary Medium 0(n)
OptiGrid Large Yes High Spatial Arbitrary Medium 0(nd) to 0(nd-log n)
MAFIA Large No High Numerical Arbitrary High 0(cp + pn)
14. Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data(5)
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset type Cluster
shape
Scalability Time complexity
Model
based
Clustering
COBWEB Large No Medium Numerical Arbitrary Medium 0(n2)
SLINK Large No Medium Numerical Arbitrary Medium 0(n2)
SOM Small Yes Low Multivariate Arbitrary Low 0(n2m)
ART Large No High Multivariate Arbitrary High (type+layer)
EM Large Yes Low Spatial Convex 0(knp)
15. Proposed MapReduce Framework for the Clustering
Algorithm
• If any clustering algorithm works under huge dataset or high dimensional with scalability and
heterogeneous data in the form of arbitrary shape so they suitable for big data mining.
• Designing of a clustering algorithm for big data mining has a capability for parallel and distributed
computing. MapReduce is one of the programming model for implementation of big data mining.
• MapReduce techniques are inspired by the Map and Reduce function.
• The idea of Map function is breakdown to a task into possible phases and executes these phases in
parallel order without disturbing any phases. Map function also assigns appropriate key/value pairs
in every data.
• Reduce function collects all map results and combining all values based on the same key and given
a final result of the MapReduce computational task. This concept reduces the computational time
for big data mining
16. Proposed MapReduce Framework for the Clustering
Algorithm(2)
Step 1: Big data set is transformed into <key, value> pairs because MapReduce used to
HDFS with parallel and distributed computing.
Step 2: Mapper function takes <key, value> pairs as input and executes on parallel order
according to the existing clustering algorithm.
Step 3: Combiner function combine all Map results and sort every <value> according to
<key> and given to output as <key, list (value)> format.
Step 4: Reduce function takes the output from Combiner function and maps to one <key, list
(value)> to another <key, list (value)> according to existing clustering algorithm and
calculate the final cluster result.
Step 5: Reduce function given the accurate and unique number of cluster.
18. Experimental
• K-Means, BIRCH, CLARA, CURE, DBSCAN, DENCLUE, Wavecluster are
some good clustering algorithm for big data mining because it fulfills the
criteria of big data clustering.
• Dataset: - Power ( 512,320 real data points with 7 dimensions)
• System:- Intel I3 processor, 4 GB RAM, 320 GB hard disk, windows 7.
we show execution time of existing K-Mean and MapReduce base K-Mean
clustering algorithm.
Algorithm Execution time in second
K-mean (existing) 60
K-mean (Proposed MapReduce Based) 20
19. References
[1]. Sivarajah U. and Kamal M.M.: Critical analysis of Big Data challenges and analytical methods, Journal of Business Research (Elsevier),
Vol 70, pp 263-286, DOI: 10.1016/j.jbusres.2016.08.001, (2017).
[2]. Wasastjerna M.C.: The role of big data and digital privacy in merger review. European Competition Journal, vol. 14, no. 2-3, pp. 417-
444, DOI: 10.1080/17441056.2018.1533364, (2018).
[3]. Gandomi A., and H. M.: Beyond the hype Big data concepts methods and analytics. I.J. of Info. Man., vol. 35, no. 2, pp. 137 -144, DOI:
10.1016/j.ijinfomgt.2014.10.007, (2015).
[4]. Pandey K.K.: Mining on Relationship in Big Data era Using Apriori Algorithm, Proc. Of NCDAMLS, pp. 55-60, ISBN: 978-93-5291-
457-9, (2018).
[5]. Che D., P. Z., and S.M., and From Big Data to Big Data Mining Challenges Issues and Opportunities. LNCS, vol. 7827, pp. 1-12 , doi
10.1007/978-3-642-40270-8_1, (2013).
[6]. Li N., Zeng L., Qing H., and Zhongzhi S.: Parallel Implementation of Apriori Algorithm Based on MapReduce. Proc of 13th IEEE ACIS
International Conference on SEAIPDC, DOI: 10.1109/SNPD.2012.31, (2017).
[7]. Oussous A., Benjelloun F.Z., Lahcen A.A., and Belfkih S.: Big Data technologies: A survey, Journal of King Saud University – Computer
and Information Sciences, Vol-30, pp 431–448, DOI: 10.1016/j.jksuci.2017.06.001, (2018).
[8]. Chen M., M.S., and L.Y.: Big Data A Survey. Mob. Netw. Appl., vol. 19, no. 2, pp. 171–209, doi 10.1007/s11036-013-0489-0, (2014).
[9]. Gole S., and Tidke B.: A survey of Big Data in social media using data mining techniques. Proc. of IEEE ICACCS, doi
10.1109/ICACCS.2015.7324059, (2015).
[10]. Elgendy N., and E. A.: Big Data Analytics A Literature Review Paper. LNAI, vol. 8557, pp. 214–227, doi 10.1007/978-3-319-08976-
8_16, (2014).
[11]. Ozkose H., Ari E.S., and Gencer C.: Yesterday, Today and Tomorrow of Big Data, Procedia - Social and Behavioral Sciences, vol. 195,
pp. 1042-1050, doi 10.1016/j.sbspro.2015.06.147, (2015).
20. References
[12]. Kaur P. and Kaur K., :Comparative Study of Techniquesand Issues in Data Clustering, Lecture Notes in Networks and Systems, Vol-8,
pp 1-8, DOI 10.1007/978-981-10-3818-1_1,(2017).
[13]. Nagpal A., Jatain A. and Gaur D.:Review based on Data Clustering Algorithms, Proc. of IEEE Conference on ICT, published by IEEE
Xplore,pp 298-303, DOI: 10.1109/CICT.2013.6558109, (2013).
[14]. Berkhin P.,:Survey of Clustering Data Mining Techniques, M. (eds) Grouping Multidimensional Data, pp. 25-71, doi 10.1007/3-540-
28349-8_2, (2006).
[15]. Chen W.,OliverioJ.,Kim H.O, and Shen J., The Modeling and Simulation of Data Clustering Algorithms in Data Mining with Big Data,
Journal of Industrial Integration and Management: Innovation and Entrepreneurship, DOI:10.1142/S2424862218500173,(2018).
[16]. Xu R.,and Wunsch D. : Survey of Clustering Algorithms, IEEE TRANSACTIONS ON NEURAL NETWORKS, Vol. 16, Issue 3, pp
645-678, (2005).
[17]. Xu D., and Tian Y.: A Comprehensive Survey of Clustering Algorithms, Annals of Data Science, Vol 2, Issue 2, pp 165–193,DOI:
10.1007/s40745-015-0040-1,(2015).
[18]. Pandove D.and Goel S.: A Comprehensive Study on ClusteringApproaches for Big Data Mining, Proc. Of IEEE ICECS, pp 1333-
1338,(2015).
[19]. Fahad A; Alshatri N, Tari Z, Alamri A, Khalil I, AND ZomayaA.Y.,:A Survey of Clustering Algorithms for BigData: Taxonomy and
Empirical Analysis, IEEE Transactions on Emerging Topics in Computing, Vol 2, Issue 3,pp 267 - 279, DOI: 10.1109/TETC.2014.2330519 ,
(2014).
[20]. Jain A. K., Murty M. N. and Flynn P. J., Data clustering: a review, ACM Computing Surveys, Vol 31,Issue 3, pp 264-323, DOI:
10.1145/331499.331504,(1999).
[21]. Shirkhorshidi A.S., Aghabozorgi S, Wah T.Y. and HerawanT.:Big Data Clustering: A Review, published by Lecture Notes in Computer
Science(Springer), Vol 8583, DOI: 10.1007/978-3-319-09156-3_49,(2014).
21. References
[22]. Berkhin P., A Survey of Clustering Data Mining Techniques, Grouping Multidimensional Data (Springer), DOI: 10.1007/3-540-28349-
8_2 (2006).
[23]. Pujari A.K, Rajesh K. & Reddy D.S.: Clustering Techniques in Data Mining—A Survey, IETE Journal of Research, vol 47, Issue 1-2, pp
19-28, DOI: 10.1080/03772063.2001.11416199,(2001).
[24]. Dave M., and Gianey R. : Different Clustering Algorithms for Big Data Analytics: A Review, Proc of IEEE SMART, pp 328-333,(2016).
[25]. Macqueen J.: Some methods for classification and analysis of multivariate observations. Proceedings 5th Berkeley Symposium on
Mathematical Statistics Probability, Vol 1,pp 281–297,(1967).
[26]. Emani C.K., Cullot N. and Nicolle C: Understandable Big Data: A survey, Computer Science Review, Vol-17, pp 70-81, DOI:
dx.doi.org/10.1016/j.cosrev.2015.05.002, (2015).