SlideShare a Scribd company logo
1 of 22
A Comprehensive Study of Clustering
Algorithms for Big Data Mining with
MapReduce Capability
Kamlesh Kumar Pandey
Research Scholar
Dept. of Computer Science & Applications
Dr. HariSingh Gour Vishwavidyalaya (A Central University), Sagar, M.P.
E-mail: kamleshamk@gmail.com
International Conference on Social Networking and Computational Intelligence
(Paper ID : 173)
Paper Presentation
on
Content
• Objectives
• Big Data
• Big Data Mining
• Clustering taxonomy
• Analysis of Clustering Algorithm for Big Data Mining
• Summarization of Clustering Algorithm based on Three-Dimensional of Big Data
• Proposed MapReduce Framework for the Clustering Algorithm
• Experimental
Objectives
• The objective of this study is identifying a traditional clustering algorithms
for big data respect to volume, variety, and velocity and built the common
executable framework for clustering algorithm with the MapReduce
approach under big data mining.
Big Data
• Present time technology is growing very fast. Every originations, industries or person
moving towards Internet of things, cloud computing, warless sensor networks, social
media, internet. These sources generated a data growing fast in per second, minutes or per
hour in size of Terabytes or Petabytes .
• Diebold et Al. (2000) is a first writer who discussed the word Big Data in his research
paper. All of these authors define Big Data there means if the data set is large then
gigabyte then these type of data set is known as Big Data.
• Doug Laney et al (2001) was the first person who gave a proper definition for Big Data.
He gave three characteristics Volume, Variety, and Velocity of Big Data and these
characteristics known as 3 V’s of Big Data Management. If traditional data have met two
basic characteristic at a time these data are come to under Big data.
• Gartner (2012), “Big data is high-volume, high-velocity and high-variety information
assets that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making”
Big Data V’s
• In present time seven V’s used for Big Data where the first three V’s Volume,
Variety, and Velocity are the main characteristics of big data. In addition to
Veracity, Variability, Value, and Visualization are depending on the organization.
Big Data Mining
• Big Data Mining fetching on the requested information, uncovering
hidden relationship or patterns or extracting for the needed information or
knowledge from a dataset these datasets have to meet three V’s of Big
Data with higher complexity.
Clustering
• Clustering is the one of the approaches for analysis and discovering the
complex relation, pattern, and data in the form of underlying groups for the
unlabeled object and Big Data perspective, the clustering algorithm must be
deal high volume, high variety and high velocity with scalability.
Clustering Taxonomy
• Partitioning based Clustering: These clustering methods divided the dataset into
K partition based on the distance function.
• Hierarchical based Clustering: In this approach, large data are organized in a
hierarchical manner based on the medium of proximity and its detect on easily
relationship between data points.
• Density Based Clustering: These clustering methods divided the dataset into
based on the higher density of the data space.
• Grid-Based Clustering: The core idea of grid clustering algorithms is that original
data space is converted into a grid format which defines the size for clustering.
• Model-Based Clustering: These clustering methods divided the data set into based
on models such as mathematics, and statistical distribution.
Analysis of Clustering Algorithm for Big Data Mining
• Design of clustering algorithms needs some criteria for big data mining,
which is defining to Volume, Velocity, and Variety and increases the
efficiency of the clustering.
• Volume related criteria such as cluster is must be dealt huge size, high
dimensional and noisy of the dataset.
• Variety related criteria such as cluster is must be recognized as dataset
categorization and clusters Shape.
• Velocity related criteria define the complexity, scalability, and performance of
the clustering algorithm during the execution of real dataset.
Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset
type
Cluster
shape
Scalability Time
complexity
Partition
based
Clustering
K-Means Large No High Numerical Convex Medium 0 (knt)
K-Medoies Small No Low Categorical Convex Low 0(k(n-k)2)
PAM Small No Low Numerical Convex Low 0 (k3 * n2)
CLARA Large No Low Numerical Convex High 0(ks2+k(n-k))
CLARANS Large No Low Numerical Convex Medium 0(n2)
Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data(2)
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset type Cluster
shape
Scalability Time complexity
Hierarchic
al
based
Clustering
BIRCH Large No Low Numerical Convex High 0(n)
CURE Large Yes High Numerical Arbitrary High 0(n2logn)
ROKE Small Yes Low Numerical/Ca
tegorical
Arbitrary Medium 0(n2logn)
Chameleon Small No Low All type Data Arbitrary High 0(n2)
ECHIDNA Large No Low Multivariate Convex High 0(nb(1+logbm)
Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data(3)
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset type Cluster
shape
Scalability Time complexity
Density
based
Clustering
DBSCAN Large No Low Numerical Arbitrary Medium 0(nlogn)
OPTICS Large No Low Numerical Arbitrary Medium 0(nlogn)
Mean-shift Small No Low Numerical Arbitrary Low 0 (kernel)
DENCLUE Large Yes High Numerical Arbitrary Medium 0(log |d|)
GDBSCAN Large No Low Numerical Arbitrary Medium ----------------
Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data(4)
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset type Cluster
shape
Scalability Time complexity
Grid
based
Clustering
STING Large Yes Small Spatial Arbitrary High 0(n)
CLIQUE Small Yes Medium Numerical Convex High 0(n+k2)
Wave
Cluster
Large No High Spatial Arbitrary Medium 0(n)
OptiGrid Large Yes High Spatial Arbitrary Medium 0(nd) to 0(nd-log n)
MAFIA Large No High Numerical Arbitrary High 0(cp + pn)
Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data(5)
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset type Cluster
shape
Scalability Time complexity
Model
based
Clustering
COBWEB Large No Medium Numerical Arbitrary Medium 0(n2)
SLINK Large No Medium Numerical Arbitrary Medium 0(n2)
SOM Small Yes Low Multivariate Arbitrary Low 0(n2m)
ART Large No High Multivariate Arbitrary High (type+layer)
EM Large Yes Low Spatial Convex 0(knp)
Proposed MapReduce Framework for the Clustering
Algorithm
• If any clustering algorithm works under huge dataset or high dimensional with scalability and
heterogeneous data in the form of arbitrary shape so they suitable for big data mining.
• Designing of a clustering algorithm for big data mining has a capability for parallel and distributed
computing. MapReduce is one of the programming model for implementation of big data mining.
• MapReduce techniques are inspired by the Map and Reduce function.
• The idea of Map function is breakdown to a task into possible phases and executes these phases in
parallel order without disturbing any phases. Map function also assigns appropriate key/value pairs
in every data.
• Reduce function collects all map results and combining all values based on the same key and given
a final result of the MapReduce computational task. This concept reduces the computational time
for big data mining
Proposed MapReduce Framework for the Clustering
Algorithm(2)
Step 1: Big data set is transformed into <key, value> pairs because MapReduce used to
HDFS with parallel and distributed computing.
Step 2: Mapper function takes <key, value> pairs as input and executes on parallel order
according to the existing clustering algorithm.
Step 3: Combiner function combine all Map results and sort every <value> according to
<key> and given to output as <key, list (value)> format.
Step 4: Reduce function takes the output from Combiner function and maps to one <key, list
(value)> to another <key, list (value)> according to existing clustering algorithm and
calculate the final cluster result.
Step 5: Reduce function given the accurate and unique number of cluster.
Proposed MapReduce Framework for the Clustering
Algorithm(3)
Experimental
• K-Means, BIRCH, CLARA, CURE, DBSCAN, DENCLUE, Wavecluster are
some good clustering algorithm for big data mining because it fulfills the
criteria of big data clustering.
• Dataset: - Power ( 512,320 real data points with 7 dimensions)
• System:- Intel I3 processor, 4 GB RAM, 320 GB hard disk, windows 7.
we show execution time of existing K-Mean and MapReduce base K-Mean
clustering algorithm.
Algorithm Execution time in second
K-mean (existing) 60
K-mean (Proposed MapReduce Based) 20
References
[1]. Sivarajah U. and Kamal M.M.: Critical analysis of Big Data challenges and analytical methods, Journal of Business Research (Elsevier),
Vol 70, pp 263-286, DOI: 10.1016/j.jbusres.2016.08.001, (2017).
[2]. Wasastjerna M.C.: The role of big data and digital privacy in merger review. European Competition Journal, vol. 14, no. 2-3, pp. 417-
444, DOI: 10.1080/17441056.2018.1533364, (2018).
[3]. Gandomi A., and H. M.: Beyond the hype Big data concepts methods and analytics. I.J. of Info. Man., vol. 35, no. 2, pp. 137 -144, DOI:
10.1016/j.ijinfomgt.2014.10.007, (2015).
[4]. Pandey K.K.: Mining on Relationship in Big Data era Using Apriori Algorithm, Proc. Of NCDAMLS, pp. 55-60, ISBN: 978-93-5291-
457-9, (2018).
[5]. Che D., P. Z., and S.M., and From Big Data to Big Data Mining Challenges Issues and Opportunities. LNCS, vol. 7827, pp. 1-12 , doi
10.1007/978-3-642-40270-8_1, (2013).
[6]. Li N., Zeng L., Qing H., and Zhongzhi S.: Parallel Implementation of Apriori Algorithm Based on MapReduce. Proc of 13th IEEE ACIS
International Conference on SEAIPDC, DOI: 10.1109/SNPD.2012.31, (2017).
[7]. Oussous A., Benjelloun F.Z., Lahcen A.A., and Belfkih S.: Big Data technologies: A survey, Journal of King Saud University – Computer
and Information Sciences, Vol-30, pp 431–448, DOI: 10.1016/j.jksuci.2017.06.001, (2018).
[8]. Chen M., M.S., and L.Y.: Big Data A Survey. Mob. Netw. Appl., vol. 19, no. 2, pp. 171–209, doi 10.1007/s11036-013-0489-0, (2014).
[9]. Gole S., and Tidke B.: A survey of Big Data in social media using data mining techniques. Proc. of IEEE ICACCS, doi
10.1109/ICACCS.2015.7324059, (2015).
[10]. Elgendy N., and E. A.: Big Data Analytics A Literature Review Paper. LNAI, vol. 8557, pp. 214–227, doi 10.1007/978-3-319-08976-
8_16, (2014).
[11]. Ozkose H., Ari E.S., and Gencer C.: Yesterday, Today and Tomorrow of Big Data, Procedia - Social and Behavioral Sciences, vol. 195,
pp. 1042-1050, doi 10.1016/j.sbspro.2015.06.147, (2015).
References
[12]. Kaur P. and Kaur K., :Comparative Study of Techniquesand Issues in Data Clustering, Lecture Notes in Networks and Systems, Vol-8,
pp 1-8, DOI 10.1007/978-981-10-3818-1_1,(2017).
[13]. Nagpal A., Jatain A. and Gaur D.:Review based on Data Clustering Algorithms, Proc. of IEEE Conference on ICT, published by IEEE
Xplore,pp 298-303, DOI: 10.1109/CICT.2013.6558109, (2013).
[14]. Berkhin P.,:Survey of Clustering Data Mining Techniques, M. (eds) Grouping Multidimensional Data, pp. 25-71, doi 10.1007/3-540-
28349-8_2, (2006).
[15]. Chen W.,OliverioJ.,Kim H.O, and Shen J., The Modeling and Simulation of Data Clustering Algorithms in Data Mining with Big Data,
Journal of Industrial Integration and Management: Innovation and Entrepreneurship, DOI:10.1142/S2424862218500173,(2018).
[16]. Xu R.,and Wunsch D. : Survey of Clustering Algorithms, IEEE TRANSACTIONS ON NEURAL NETWORKS, Vol. 16, Issue 3, pp
645-678, (2005).
[17]. Xu D., and Tian Y.: A Comprehensive Survey of Clustering Algorithms, Annals of Data Science, Vol 2, Issue 2, pp 165–193,DOI:
10.1007/s40745-015-0040-1,(2015).
[18]. Pandove D.and Goel S.: A Comprehensive Study on ClusteringApproaches for Big Data Mining, Proc. Of IEEE ICECS, pp 1333-
1338,(2015).
[19]. Fahad A; Alshatri N, Tari Z, Alamri A, Khalil I, AND ZomayaA.Y.,:A Survey of Clustering Algorithms for BigData: Taxonomy and
Empirical Analysis, IEEE Transactions on Emerging Topics in Computing, Vol 2, Issue 3,pp 267 - 279, DOI: 10.1109/TETC.2014.2330519 ,
(2014).
[20]. Jain A. K., Murty M. N. and Flynn P. J., Data clustering: a review, ACM Computing Surveys, Vol 31,Issue 3, pp 264-323, DOI:
10.1145/331499.331504,(1999).
[21]. Shirkhorshidi A.S., Aghabozorgi S, Wah T.Y. and HerawanT.:Big Data Clustering: A Review, published by Lecture Notes in Computer
Science(Springer), Vol 8583, DOI: 10.1007/978-3-319-09156-3_49,(2014).
References
[22]. Berkhin P., A Survey of Clustering Data Mining Techniques, Grouping Multidimensional Data (Springer), DOI: 10.1007/3-540-28349-
8_2 (2006).
[23]. Pujari A.K, Rajesh K. & Reddy D.S.: Clustering Techniques in Data Mining—A Survey, IETE Journal of Research, vol 47, Issue 1-2, pp
19-28, DOI: 10.1080/03772063.2001.11416199,(2001).
[24]. Dave M., and Gianey R. : Different Clustering Algorithms for Big Data Analytics: A Review, Proc of IEEE SMART, pp 328-333,(2016).
[25]. Macqueen J.: Some methods for classification and analysis of multivariate observations. Proceedings 5th Berkeley Symposium on
Mathematical Statistics Probability, Vol 1,pp 281–297,(1967).
[26]. Emani C.K., Cullot N. and Nicolle C: Understandable Big Data: A survey, Computer Science Review, Vol-17, pp 70-81, DOI:
dx.doi.org/10.1016/j.cosrev.2015.05.002, (2015).
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability

More Related Content

What's hot

A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
 
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelClustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelWaqas Tariq
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010BOSC 2010
 
Skytree big data london meetup - may 2013
Skytree   big data london meetup - may 2013Skytree   big data london meetup - may 2013
Skytree big data london meetup - may 2013bigdatalondon
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Dataidescitation
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstracttsysglobalsolutions
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAdnan Akhter
 
Query evaluation over network of data aggregators
Query evaluation over network of data aggregatorsQuery evaluation over network of data aggregators
Query evaluation over network of data aggregatorsIAEME Publication
 
EFFICIENT R-TREE BASED INDEXING SCHEME FOR SERVER-CENTRIC CLOUD STORAGE SYSTEM
EFFICIENT R-TREE BASED INDEXING SCHEME FOR SERVER-CENTRIC CLOUD STORAGE SYSTEMEFFICIENT R-TREE BASED INDEXING SCHEME FOR SERVER-CENTRIC CLOUD STORAGE SYSTEM
EFFICIENT R-TREE BASED INDEXING SCHEME FOR SERVER-CENTRIC CLOUD STORAGE SYSTEMNexgen Technology
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualizationbigdataviz_bay
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016ijcsbi
 
On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...Universidade de São Paulo
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning ClusteringMapR Technologies
 
Data stream mining techniques: a review
Data stream mining techniques: a reviewData stream mining techniques: a review
Data stream mining techniques: a reviewTELKOMNIKA JOURNAL
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsBita Kazemi
 

What's hot (20)

A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelClustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
 
Skytree big data london meetup - may 2013
Skytree   big data london meetup - may 2013Skytree   big data london meetup - may 2013
Skytree big data london meetup - may 2013
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Data
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning Techniques
 
Query evaluation over network of data aggregators
Query evaluation over network of data aggregatorsQuery evaluation over network of data aggregators
Query evaluation over network of data aggregators
 
EFFICIENT R-TREE BASED INDEXING SCHEME FOR SERVER-CENTRIC CLOUD STORAGE SYSTEM
EFFICIENT R-TREE BASED INDEXING SCHEME FOR SERVER-CENTRIC CLOUD STORAGE SYSTEMEFFICIENT R-TREE BASED INDEXING SCHEME FOR SERVER-CENTRIC CLOUD STORAGE SYSTEM
EFFICIENT R-TREE BASED INDEXING SCHEME FOR SERVER-CENTRIC CLOUD STORAGE SYSTEM
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
 
Clustering
ClusteringClustering
Clustering
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
 
On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
Data stream mining techniques: a review
Data stream mining techniques: a reviewData stream mining techniques: a review
Data stream mining techniques: a review
 
Clustering
ClusteringClustering
Clustering
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasets
 

Similar to A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability

K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxSaiPragnaKancheti
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxSaiPragnaKancheti
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Editor IJARCET
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Editor IJARCET
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
 ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATANexgen Technology
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspectiveপল্লব রায়
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화NAVER Engineering
 
A frame work for clustering time evolving data
A frame work for clustering time evolving dataA frame work for clustering time evolving data
A frame work for clustering time evolving dataiaemedu
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Experfy
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataIJSTA
 

Similar to A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability (20)

Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
 ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
 
A frame work for clustering time evolving data
A frame work for clustering time evolving dataA frame work for clustering time evolving data
A frame work for clustering time evolving data
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 

Recently uploaded

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 

Recently uploaded (20)

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 

A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability

  • 1. A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability Kamlesh Kumar Pandey Research Scholar Dept. of Computer Science & Applications Dr. HariSingh Gour Vishwavidyalaya (A Central University), Sagar, M.P. E-mail: kamleshamk@gmail.com International Conference on Social Networking and Computational Intelligence (Paper ID : 173) Paper Presentation on
  • 2. Content • Objectives • Big Data • Big Data Mining • Clustering taxonomy • Analysis of Clustering Algorithm for Big Data Mining • Summarization of Clustering Algorithm based on Three-Dimensional of Big Data • Proposed MapReduce Framework for the Clustering Algorithm • Experimental
  • 3. Objectives • The objective of this study is identifying a traditional clustering algorithms for big data respect to volume, variety, and velocity and built the common executable framework for clustering algorithm with the MapReduce approach under big data mining.
  • 4. Big Data • Present time technology is growing very fast. Every originations, industries or person moving towards Internet of things, cloud computing, warless sensor networks, social media, internet. These sources generated a data growing fast in per second, minutes or per hour in size of Terabytes or Petabytes . • Diebold et Al. (2000) is a first writer who discussed the word Big Data in his research paper. All of these authors define Big Data there means if the data set is large then gigabyte then these type of data set is known as Big Data. • Doug Laney et al (2001) was the first person who gave a proper definition for Big Data. He gave three characteristics Volume, Variety, and Velocity of Big Data and these characteristics known as 3 V’s of Big Data Management. If traditional data have met two basic characteristic at a time these data are come to under Big data. • Gartner (2012), “Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”
  • 5. Big Data V’s • In present time seven V’s used for Big Data where the first three V’s Volume, Variety, and Velocity are the main characteristics of big data. In addition to Veracity, Variability, Value, and Visualization are depending on the organization.
  • 6. Big Data Mining • Big Data Mining fetching on the requested information, uncovering hidden relationship or patterns or extracting for the needed information or knowledge from a dataset these datasets have to meet three V’s of Big Data with higher complexity.
  • 7. Clustering • Clustering is the one of the approaches for analysis and discovering the complex relation, pattern, and data in the form of underlying groups for the unlabeled object and Big Data perspective, the clustering algorithm must be deal high volume, high variety and high velocity with scalability.
  • 8. Clustering Taxonomy • Partitioning based Clustering: These clustering methods divided the dataset into K partition based on the distance function. • Hierarchical based Clustering: In this approach, large data are organized in a hierarchical manner based on the medium of proximity and its detect on easily relationship between data points. • Density Based Clustering: These clustering methods divided the dataset into based on the higher density of the data space. • Grid-Based Clustering: The core idea of grid clustering algorithms is that original data space is converted into a grid format which defines the size for clustering. • Model-Based Clustering: These clustering methods divided the data set into based on models such as mathematics, and statistical distribution.
  • 9. Analysis of Clustering Algorithm for Big Data Mining • Design of clustering algorithms needs some criteria for big data mining, which is defining to Volume, Velocity, and Variety and increases the efficiency of the clustering. • Volume related criteria such as cluster is must be dealt huge size, high dimensional and noisy of the dataset. • Variety related criteria such as cluster is must be recognized as dataset categorization and clusters Shape. • Velocity related criteria define the complexity, scalability, and performance of the clustering algorithm during the execution of real dataset.
  • 10. Summarization of Clustering Algorithm based on Three-Dimensional of Big Data Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Partition based Clustering K-Means Large No High Numerical Convex Medium 0 (knt) K-Medoies Small No Low Categorical Convex Low 0(k(n-k)2) PAM Small No Low Numerical Convex Low 0 (k3 * n2) CLARA Large No Low Numerical Convex High 0(ks2+k(n-k)) CLARANS Large No Low Numerical Convex Medium 0(n2)
  • 11. Summarization of Clustering Algorithm based on Three-Dimensional of Big Data(2) Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Hierarchic al based Clustering BIRCH Large No Low Numerical Convex High 0(n) CURE Large Yes High Numerical Arbitrary High 0(n2logn) ROKE Small Yes Low Numerical/Ca tegorical Arbitrary Medium 0(n2logn) Chameleon Small No Low All type Data Arbitrary High 0(n2) ECHIDNA Large No Low Multivariate Convex High 0(nb(1+logbm)
  • 12. Summarization of Clustering Algorithm based on Three-Dimensional of Big Data(3) Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Density based Clustering DBSCAN Large No Low Numerical Arbitrary Medium 0(nlogn) OPTICS Large No Low Numerical Arbitrary Medium 0(nlogn) Mean-shift Small No Low Numerical Arbitrary Low 0 (kernel) DENCLUE Large Yes High Numerical Arbitrary Medium 0(log |d|) GDBSCAN Large No Low Numerical Arbitrary Medium ----------------
  • 13. Summarization of Clustering Algorithm based on Three-Dimensional of Big Data(4) Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Grid based Clustering STING Large Yes Small Spatial Arbitrary High 0(n) CLIQUE Small Yes Medium Numerical Convex High 0(n+k2) Wave Cluster Large No High Spatial Arbitrary Medium 0(n) OptiGrid Large Yes High Spatial Arbitrary Medium 0(nd) to 0(nd-log n) MAFIA Large No High Numerical Arbitrary High 0(cp + pn)
  • 14. Summarization of Clustering Algorithm based on Three-Dimensional of Big Data(5) Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Model based Clustering COBWEB Large No Medium Numerical Arbitrary Medium 0(n2) SLINK Large No Medium Numerical Arbitrary Medium 0(n2) SOM Small Yes Low Multivariate Arbitrary Low 0(n2m) ART Large No High Multivariate Arbitrary High (type+layer) EM Large Yes Low Spatial Convex 0(knp)
  • 15. Proposed MapReduce Framework for the Clustering Algorithm • If any clustering algorithm works under huge dataset or high dimensional with scalability and heterogeneous data in the form of arbitrary shape so they suitable for big data mining. • Designing of a clustering algorithm for big data mining has a capability for parallel and distributed computing. MapReduce is one of the programming model for implementation of big data mining. • MapReduce techniques are inspired by the Map and Reduce function. • The idea of Map function is breakdown to a task into possible phases and executes these phases in parallel order without disturbing any phases. Map function also assigns appropriate key/value pairs in every data. • Reduce function collects all map results and combining all values based on the same key and given a final result of the MapReduce computational task. This concept reduces the computational time for big data mining
  • 16. Proposed MapReduce Framework for the Clustering Algorithm(2) Step 1: Big data set is transformed into <key, value> pairs because MapReduce used to HDFS with parallel and distributed computing. Step 2: Mapper function takes <key, value> pairs as input and executes on parallel order according to the existing clustering algorithm. Step 3: Combiner function combine all Map results and sort every <value> according to <key> and given to output as <key, list (value)> format. Step 4: Reduce function takes the output from Combiner function and maps to one <key, list (value)> to another <key, list (value)> according to existing clustering algorithm and calculate the final cluster result. Step 5: Reduce function given the accurate and unique number of cluster.
  • 17. Proposed MapReduce Framework for the Clustering Algorithm(3)
  • 18. Experimental • K-Means, BIRCH, CLARA, CURE, DBSCAN, DENCLUE, Wavecluster are some good clustering algorithm for big data mining because it fulfills the criteria of big data clustering. • Dataset: - Power ( 512,320 real data points with 7 dimensions) • System:- Intel I3 processor, 4 GB RAM, 320 GB hard disk, windows 7. we show execution time of existing K-Mean and MapReduce base K-Mean clustering algorithm. Algorithm Execution time in second K-mean (existing) 60 K-mean (Proposed MapReduce Based) 20
  • 19. References [1]. Sivarajah U. and Kamal M.M.: Critical analysis of Big Data challenges and analytical methods, Journal of Business Research (Elsevier), Vol 70, pp 263-286, DOI: 10.1016/j.jbusres.2016.08.001, (2017). [2]. Wasastjerna M.C.: The role of big data and digital privacy in merger review. European Competition Journal, vol. 14, no. 2-3, pp. 417- 444, DOI: 10.1080/17441056.2018.1533364, (2018). [3]. Gandomi A., and H. M.: Beyond the hype Big data concepts methods and analytics. I.J. of Info. Man., vol. 35, no. 2, pp. 137 -144, DOI: 10.1016/j.ijinfomgt.2014.10.007, (2015). [4]. Pandey K.K.: Mining on Relationship in Big Data era Using Apriori Algorithm, Proc. Of NCDAMLS, pp. 55-60, ISBN: 978-93-5291- 457-9, (2018). [5]. Che D., P. Z., and S.M., and From Big Data to Big Data Mining Challenges Issues and Opportunities. LNCS, vol. 7827, pp. 1-12 , doi 10.1007/978-3-642-40270-8_1, (2013). [6]. Li N., Zeng L., Qing H., and Zhongzhi S.: Parallel Implementation of Apriori Algorithm Based on MapReduce. Proc of 13th IEEE ACIS International Conference on SEAIPDC, DOI: 10.1109/SNPD.2012.31, (2017). [7]. Oussous A., Benjelloun F.Z., Lahcen A.A., and Belfkih S.: Big Data technologies: A survey, Journal of King Saud University – Computer and Information Sciences, Vol-30, pp 431–448, DOI: 10.1016/j.jksuci.2017.06.001, (2018). [8]. Chen M., M.S., and L.Y.: Big Data A Survey. Mob. Netw. Appl., vol. 19, no. 2, pp. 171–209, doi 10.1007/s11036-013-0489-0, (2014). [9]. Gole S., and Tidke B.: A survey of Big Data in social media using data mining techniques. Proc. of IEEE ICACCS, doi 10.1109/ICACCS.2015.7324059, (2015). [10]. Elgendy N., and E. A.: Big Data Analytics A Literature Review Paper. LNAI, vol. 8557, pp. 214–227, doi 10.1007/978-3-319-08976- 8_16, (2014). [11]. Ozkose H., Ari E.S., and Gencer C.: Yesterday, Today and Tomorrow of Big Data, Procedia - Social and Behavioral Sciences, vol. 195, pp. 1042-1050, doi 10.1016/j.sbspro.2015.06.147, (2015).
  • 20. References [12]. Kaur P. and Kaur K., :Comparative Study of Techniquesand Issues in Data Clustering, Lecture Notes in Networks and Systems, Vol-8, pp 1-8, DOI 10.1007/978-981-10-3818-1_1,(2017). [13]. Nagpal A., Jatain A. and Gaur D.:Review based on Data Clustering Algorithms, Proc. of IEEE Conference on ICT, published by IEEE Xplore,pp 298-303, DOI: 10.1109/CICT.2013.6558109, (2013). [14]. Berkhin P.,:Survey of Clustering Data Mining Techniques, M. (eds) Grouping Multidimensional Data, pp. 25-71, doi 10.1007/3-540- 28349-8_2, (2006). [15]. Chen W.,OliverioJ.,Kim H.O, and Shen J., The Modeling and Simulation of Data Clustering Algorithms in Data Mining with Big Data, Journal of Industrial Integration and Management: Innovation and Entrepreneurship, DOI:10.1142/S2424862218500173,(2018). [16]. Xu R.,and Wunsch D. : Survey of Clustering Algorithms, IEEE TRANSACTIONS ON NEURAL NETWORKS, Vol. 16, Issue 3, pp 645-678, (2005). [17]. Xu D., and Tian Y.: A Comprehensive Survey of Clustering Algorithms, Annals of Data Science, Vol 2, Issue 2, pp 165–193,DOI: 10.1007/s40745-015-0040-1,(2015). [18]. Pandove D.and Goel S.: A Comprehensive Study on ClusteringApproaches for Big Data Mining, Proc. Of IEEE ICECS, pp 1333- 1338,(2015). [19]. Fahad A; Alshatri N, Tari Z, Alamri A, Khalil I, AND ZomayaA.Y.,:A Survey of Clustering Algorithms for BigData: Taxonomy and Empirical Analysis, IEEE Transactions on Emerging Topics in Computing, Vol 2, Issue 3,pp 267 - 279, DOI: 10.1109/TETC.2014.2330519 , (2014). [20]. Jain A. K., Murty M. N. and Flynn P. J., Data clustering: a review, ACM Computing Surveys, Vol 31,Issue 3, pp 264-323, DOI: 10.1145/331499.331504,(1999). [21]. Shirkhorshidi A.S., Aghabozorgi S, Wah T.Y. and HerawanT.:Big Data Clustering: A Review, published by Lecture Notes in Computer Science(Springer), Vol 8583, DOI: 10.1007/978-3-319-09156-3_49,(2014).
  • 21. References [22]. Berkhin P., A Survey of Clustering Data Mining Techniques, Grouping Multidimensional Data (Springer), DOI: 10.1007/3-540-28349- 8_2 (2006). [23]. Pujari A.K, Rajesh K. & Reddy D.S.: Clustering Techniques in Data Mining—A Survey, IETE Journal of Research, vol 47, Issue 1-2, pp 19-28, DOI: 10.1080/03772063.2001.11416199,(2001). [24]. Dave M., and Gianey R. : Different Clustering Algorithms for Big Data Analytics: A Review, Proc of IEEE SMART, pp 328-333,(2016). [25]. Macqueen J.: Some methods for classification and analysis of multivariate observations. Proceedings 5th Berkeley Symposium on Mathematical Statistics Probability, Vol 1,pp 281–297,(1967). [26]. Emani C.K., Cullot N. and Nicolle C: Understandable Big Data: A survey, Computer Science Review, Vol-17, pp 70-81, DOI: dx.doi.org/10.1016/j.cosrev.2015.05.002, (2015).