SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
Which NoSQL Database to Combine with Spark for
Real Time Big Data Analytics ?
Abstract— Big Data is an evolution of Business Intelligence (BI).
Whereas traditional BI relies on data warehouses limited in size
(some terabytes) and it hardly manages unstructured data and
real-time analysis, the era of Big Data opens up a new technologi-
cal period offering advanced architectures and infrastructures
allowing sophisticated analyzes taking into account these new
data integrated into the ecosystem of the business . In this article,
we will present the results of an experimental study on the per-
formance of the best framework of Big Analytics (Spark) with the
most popular databases of NoSQL MongoDB and Hadoop. The
objective of this study is to determine the software combination
that allows sophisticated analysis in real time.
Keywords- big data analyticsy; NoSQL databases; Apache Spark ;
Hadoop; MongoDB, performance .
I. INTRODUCTION
The Big Data phenomenon, for companies, covers two real-
ities: on the one hand this explosion of data continuously, on
the other hand the capacity to process and analyze this great
mass of data to make a profit. With Big Data, organizations can
now manage and process massive data to extract value, decide
and act in real time.
NoSQL databases were developed to provide a set of new
data management features while overcoming some limitations
of currently used relational databases [1]. NoSQL databases are
not relational and they don’t require a model or structure for
data storage, which facilitates the storage and data search. In
addition, they allow horizontal scalability, it gives administra-
tors the ability of increasing the number of server machines to
minimize overall system load. The new nodes are integrated
and operated in an automatic manner by the system. Horizontal
scalability reduces the response time of queries with a low cost.
In relation to the NoSQL databases (Hadoop, MongoDB,
Cassandra, Hbase, Radis, Riak…., etc.), a new profession
appeared "the data scientist". Data science is the extraction of
knowledge from data sets [2, 3]. It employs techniques and
theories derived from several other broader areas of mathe-
matics, mainly statistics, probabilistic models, machine learn-
ing. Thus, to develop algorithms in a distributed environment,
the analyst must master tools of big data analytics (Mahout,
MapReduce, Spark and Storm) and learn the syntax of func-
tional languages to use Scala, Erlang or Clojure.
Big data analytics therefore favors a return to grace of
functional languages and robust methods: decision tree [4, 5],
and random forest [6], k-means [7], Naive Bayes classifier [8],
easily distributable (MapReduce) on thousands of nodes.
For collected data storage, any NoSQL database can fulfill
this role. However, the need to analyze this data pushes us to
choose this database carefully. Especially in the field of Big
Data, the analytic part becomes more and more important. For
advanced, real-time analytics, the best framework you can use
is Apache Spark [9, 10]. According to the official version,
Spark uses the hadoop HDFS file system.
In a previous study [11] based on a multicriteria analysis
method, the MongoDB system obtained the highest score.
Today, this result was confirmed. This system has become
popular [12]. According to a white paper [13] published by
MongoDB, The combination of the fastest analysis engine
(Spark) with the fastest-growing database (MongoDB) allows
companies to easily perform reliable real-time analysis. This
led us to compare Spark's performance against the most popu-
lar NoSQL databases, MongoDB and Hadoop. In this article,
we will present and discuss the results of our experimental
study. Thus, we will determine the software combination that
allows giving sophisticated analyzes in real time.
This paper is organized as follows: Section II presents big
data analytics on Hadoop and MongoDB. In section III, we
present the results of an experimental study on the perfor-
mance of the framework Spark with MongoDB and Hadoop.
Section IV provides a conclusion.
II. BIG DATA ANALYTICS
In this part, we will introduce the data analysis technologies
used on Hadoop and MongoDB.
A. Big Data Analytics on Hadoop
The first integrated solution with Hadoop for data analysis
is the MapReduce framework. MapReduce is not in itself an
element of databases. This distributed information processing
approach takes an input list, produces one in return.it can be
used for many situations; it is well suited for distributed pro-
cessing needs and decision-making processes.
Omar HAJOUI, Mohamed TALEA
LTI Laboratory, Faculty of Science Ben M’Sik
Hassan II University, Casablanca, Morocco
{hajouio, taleamohamed}@yahoo.fr
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
43 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
MapReduce defined in 2004 in an article written by
Google. The principle is simple: to distribute a treatment,
Google imagined a two-step operation. First, an assignment of
operations on each machine (Map) followed processing by a
grouping of results (Reduce). The needs of Google that gave
birth to MapReduce are twofold: how to handle gigantic vol-
umes of unstructured data (web pages to analyze to feed the
Google search engine, or the analysis of the logs produced by
the work of its indexing engines, for example), to derive re-
sults from calculations, aggregates, summaries ... in short,
from the analysis.
The free reference implementation of MapReduce is called
Hadoop, a system developed by a team led by Doug Cutting,
in Java, for the purposes of its Nutch distributed indexing
engine for Yahoo! Hadoop directly implements the Google
document on MapReduce, and bases its distributed storage on
HDFS (Hadoop File System), which implements the Google
document on GFS (Google File System). Then, the Hadoop
MapReduce Framework (YARN) implemented by several
NoSQL databases such as Hbase, Cassandra, etc.
Then, Facebook developed the HQL language (Hive lan-
guage query) on Hive. Close to SQL to query HDFS. Another
language, called Pig, developed by Yahoo similar in its syntax
to Perl and aimed at the goals of Hive. In addition, cloudera,
another Hadoop distribution, integrates Impala's queries en-
gine. Analysts and data scientists to perform analysis on data
stored in Hadoop via SQL tools or business intelligence tools
favor this latest one. The Mahout project provides algorithms
implementations for business intelligence. It provides, for
example, machine-learning algorithms (Kmeans, Random
Forest).
B. Big Data Analytics on MongoDB
MongoDB is an open-source document-oriented database
designed for exceptionally high performance and developed in
C ++. Data is stored and queried in BSON format similar to
JSON. It has dynamic and flexible schemas, making data inte-
gration easier and faster than traditional databases. Unlike
NoSQL databases that offer basic queries. Developers can use
MongoDB native queries and data mining capabilities to gen-
erate many classes of analysis, before having to adopt dedicat-
ed frameworks such as Spark or MapReduce for more special-
ized tasks.
Several organizations including McAfee, Salesforce,
Buzzfeed, Amadeus, KPMG and many others rely on Mon-
goDB's powerful query language, aggregations and indexing
to generate real-time analytics directly on their operational
data. MongoDB users have access to a wide range of queries,
projection and update operators that support real-time analytic
queries on operational data:
• The MongoDB Aggregation Pipeline is similar in
concept to the SQL GROUP BY statement, enabling
users to generate aggregations of values returned by
the query (e.g., count, minimum, maximum, average,
intersections) that can be used to power analytics
dashboards and visualizations.
• Range queries returning results based on values de-
fined as inequalities (e.g., greater than, less than or
equal to, between)
• Search queries return results in relevance order and
in faceted groups, based on text arguments using
Boolean operators (e.g., AND, OR, NOT), and
through bucketing, grouping and counting of query
results.
• MongoDB provides native support for MapReduce,
allowing complex JavaScript processing. Multiple
MapReduce jobs can run simultaneously on the same
server and on fragmented collections.
• JOINs , Graph queries , Key-value queries ...
C. Big Data Analytics on Hadoop
The MapReduce framework, despite being widely used by
companies for the analysis of Big Data, the response time is
not satisfactory and its programs executed only in the form of
a batch. After a map or reduce operation, the result must be
written to disk. This disk-written data allows mappers and
reducers to communicate with each other. It is also the write
on disk, which allows a certain tolerance to the failures: if a
map or reduce operation fails, it is enough to read the data
from the disk to take again, where we were. However, these
writings and readings are time consuming. In addition, the
expression set composed exclusively of map and reduce op-
erations is very limited and not very expressive. In other
words, it is difficult to express complex operations using only
this set of two operations.
Apache Spark is an alternative to Hadoop MapReduce for
distributed computing that aims to solve both of these prob-
lems. The fundamental difference between Hadoop MapRe-
duce and Spark is that Spark writes data in RAM, not on disk.
This has several important consequences on the speed of cal-
culation processing as well as on the overall architecture of
Spark.
Spark offers a complete and unified framework (Figure 1)
to meet the needs of Big Data processing for various datasets,
various by their nature (text, graph, etc.) as well as by the type
of source (batch or real-time flow). It allows to quickly write
applications in Java, Scala or Python and includes a set of
more than 80 high-level operators, it is possible to use it inter-
actively to query the data from a shell, in addition to the op-
erations of Map and Reduce, Spark supports SQL queries and
data streaming and offers machine learning and graph-oriented
processing functions. Developers can use these possibilities in
stand-alone or by combining them into a complex processing
chain.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
44 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Figure 1: Apache Spark Ecosystem
Spark's programming model is similar to MapReduce, ex-
cept that Spark introduces a new abstraction called Resilient
Distributed Datasets (RDDs). Using RDDs, Spark can provide
solutions for several applications that previously require the
integration of multiple technologies, including SQL, stream-
ing, machine learning and graph processing.
A Dataset is a distributed collection of data. It can be
viewed as a conceptual evolution of RDDs (Resilient Distrib-
uted Datasets), historically the first distributed data structure
used by Spark. A DataFrame is a Dataset organized into col-
umns that have names, such as tables in a database. With the
Scala programming interface, the DataFrame type is simply
the alias of the Dataset [Row] type.
It is possible to apply actions to the Datasets, which pro-
duce values, and transformations, which produce new Da-
tasets, as well as certain functions that do not fit into either
category.
Figure 2: Spark Command lines Example
Spark exposes RDDs through a functional programming
API in Scala, Java, Python, and R, where users can simply
pass local functions to run on the cluster.
III. COMPARISON
A. The Experiments Results
We made the comparison on files of the same size and type
(.CSV).The test files are available on this link
"https://catalog.data.gov/dataset/crimes-2001-to-present
398a4". We copied each file to the Hadoop file system. Then
the same file imported by MongoDB.
We did the test on one node, three nodes and four nodes.
The machines used having the following configuration:
• 8GB RAM
• Linux Fedora 26
• 120 GB SSD
• 6th generation i5 processor
Table 1: Spark's performance with Hadoop and MongoDB
Nodes
File size
(GB)
Action Hadoop MongoDB
1 1,55
first 96 ms 77 ms
count 10 s 2,0 min
3 3,11
first 90 ms 65 ms
count 19 s 3,4 min
4 4,66
first 0,1 s 57 s
count 29 s 5,3 min
These results are illustrated in the following figure:
Figure 3: Comparison of Spark's performance versus Hadoop
and MongoDB
B. Results Interpretation
According to the results of this study, the execution time of
the first operation that looks for the first record of the file is
the same on Hadoop or MongoDB, sometimes Spark is faster
with MongoDB, but the execution time of the operation count
that requires the change of the entire file in memory in a RDD,
Spark is much faster with Hadoop.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
45 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
For the moment, Hadoop remains the best global storage
solution with administration that is more advanced, security
and monitoring tools. This choice, Oracle did for its brand
new data discovery and analysis solution, Big Data Discovery.
The product installs on a Hadoop cluster (exclusively
Cloudera) and relies heavily on Spark for its treatments.
IV. CONCLUSION
In this article, we presented the results of an experimental
study on the performance of the best framework of Big Ana-
lytics (Spark) with the most popular databases of NoSQL
MongoDB and Hadoop. The aim of this study is to determine
the software combination that allows sophisticated analysis in
real time. According to the results of this study, Spark is much
faster with Hadoop.
REFERENCES
[1] NoSQL , http://nosql-database.org/, 2018
[2] Vasant Dhar, « Data Science and Prediction », Communications of the
ACM, no 12, décembre 2013, p. 64-73
[3] Davenport et DJ Patil « Data Scientist: The Sexiest Job of the 21st
Century », Harvard Business Review, 2012
[4] Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984).
Classification and Regression Trees. Chapman & Hall, New York.
[5] L. Breiman. Bagging predictors. Machine Learning, 24(2), 1996.
[6] L. Breiman. Random forests. Machine Learning, 45, 2001.
[7] MacQueen J. B. (1967). Some Methods for classification and Analysis
of Multivariate Observations. Proceedings of 5th Berkeley Symposium
on Mathematical Statistics and Probability 1. University of California
Press. pp. 281-297.
[8] Marron M. E. (1961). Automatic Indexing : An Experimental Inquiry,
Journal of the ACM (JACM), VOl. 8 : Iss. 3, pp 404-417.
[9] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., ...
& Ghodsi, A. (2016). Apache Spark: A unified engine for big data
processing. Communications of the ACM, 59(11), 56-65.
[10] Gopalani, S., & Arora, R. (2015). Comparing apache spark and map
reduce with performance analysis using K-means. International Journal
of Computer Applications, 113(1).
[11] Omar, H., Rachid, D., Mohammed, T., Zouhair, I., B.: An Advanced
Comparative Study of the Most Promising NoSQL and NewSQL
Databases With a Multi-Criteria Analysis Method. Journal of theoretical
and applied information technology, Vol81, No3.
[12] Solid IT (2018) , https://db-engines.com
[13] https://www.mongodb.com/collateral/apache-spark-and-mongodb-
turning-analytics-into-real-time-action
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
46 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmodwaqasm86
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationUT, San Antonio
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED
 
Introducing the hadoop ecosystem
Introducing the hadoop ecosystemIntroducing the hadoop ecosystem
Introducing the hadoop ecosystemGeert Van Landeghem
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd Iaetsd
 

Was ist angesagt? (18)

Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Hadoop
HadoopHadoop
Hadoop
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
 
No sql database
No sql databaseNo sql database
No sql database
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory Computation
 
Big data
Big dataBig data
Big data
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Introducing the hadoop ecosystem
Introducing the hadoop ecosystemIntroducing the hadoop ecosystem
Introducing the hadoop ecosystem
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasets
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 

Ähnlich wie Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?

Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionJoão Gabriel Lima
 
MongoDB_Spark
MongoDB_SparkMongoDB_Spark
MongoDB_SparkMat Keep
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 

Ähnlich wie Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics? (20)

B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
G017143640
G017143640G017143640
G017143640
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
 
Hadoop
HadoopHadoop
Hadoop
 
MongoDB_Spark
MongoDB_SparkMongoDB_Spark
MongoDB_Spark
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
B017320612
B017320612B017320612
B017320612
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 

Kürzlich hochgeladen

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Kürzlich hochgeladen (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?

  • 1. Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics ? Abstract— Big Data is an evolution of Business Intelligence (BI). Whereas traditional BI relies on data warehouses limited in size (some terabytes) and it hardly manages unstructured data and real-time analysis, the era of Big Data opens up a new technologi- cal period offering advanced architectures and infrastructures allowing sophisticated analyzes taking into account these new data integrated into the ecosystem of the business . In this article, we will present the results of an experimental study on the per- formance of the best framework of Big Analytics (Spark) with the most popular databases of NoSQL MongoDB and Hadoop. The objective of this study is to determine the software combination that allows sophisticated analysis in real time. Keywords- big data analyticsy; NoSQL databases; Apache Spark ; Hadoop; MongoDB, performance . I. INTRODUCTION The Big Data phenomenon, for companies, covers two real- ities: on the one hand this explosion of data continuously, on the other hand the capacity to process and analyze this great mass of data to make a profit. With Big Data, organizations can now manage and process massive data to extract value, decide and act in real time. NoSQL databases were developed to provide a set of new data management features while overcoming some limitations of currently used relational databases [1]. NoSQL databases are not relational and they don’t require a model or structure for data storage, which facilitates the storage and data search. In addition, they allow horizontal scalability, it gives administra- tors the ability of increasing the number of server machines to minimize overall system load. The new nodes are integrated and operated in an automatic manner by the system. Horizontal scalability reduces the response time of queries with a low cost. In relation to the NoSQL databases (Hadoop, MongoDB, Cassandra, Hbase, Radis, Riak…., etc.), a new profession appeared "the data scientist". Data science is the extraction of knowledge from data sets [2, 3]. It employs techniques and theories derived from several other broader areas of mathe- matics, mainly statistics, probabilistic models, machine learn- ing. Thus, to develop algorithms in a distributed environment, the analyst must master tools of big data analytics (Mahout, MapReduce, Spark and Storm) and learn the syntax of func- tional languages to use Scala, Erlang or Clojure. Big data analytics therefore favors a return to grace of functional languages and robust methods: decision tree [4, 5], and random forest [6], k-means [7], Naive Bayes classifier [8], easily distributable (MapReduce) on thousands of nodes. For collected data storage, any NoSQL database can fulfill this role. However, the need to analyze this data pushes us to choose this database carefully. Especially in the field of Big Data, the analytic part becomes more and more important. For advanced, real-time analytics, the best framework you can use is Apache Spark [9, 10]. According to the official version, Spark uses the hadoop HDFS file system. In a previous study [11] based on a multicriteria analysis method, the MongoDB system obtained the highest score. Today, this result was confirmed. This system has become popular [12]. According to a white paper [13] published by MongoDB, The combination of the fastest analysis engine (Spark) with the fastest-growing database (MongoDB) allows companies to easily perform reliable real-time analysis. This led us to compare Spark's performance against the most popu- lar NoSQL databases, MongoDB and Hadoop. In this article, we will present and discuss the results of our experimental study. Thus, we will determine the software combination that allows giving sophisticated analyzes in real time. This paper is organized as follows: Section II presents big data analytics on Hadoop and MongoDB. In section III, we present the results of an experimental study on the perfor- mance of the framework Spark with MongoDB and Hadoop. Section IV provides a conclusion. II. BIG DATA ANALYTICS In this part, we will introduce the data analysis technologies used on Hadoop and MongoDB. A. Big Data Analytics on Hadoop The first integrated solution with Hadoop for data analysis is the MapReduce framework. MapReduce is not in itself an element of databases. This distributed information processing approach takes an input list, produces one in return.it can be used for many situations; it is well suited for distributed pro- cessing needs and decision-making processes. Omar HAJOUI, Mohamed TALEA LTI Laboratory, Faculty of Science Ben M’Sik Hassan II University, Casablanca, Morocco {hajouio, taleamohamed}@yahoo.fr International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 43 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 2. MapReduce defined in 2004 in an article written by Google. The principle is simple: to distribute a treatment, Google imagined a two-step operation. First, an assignment of operations on each machine (Map) followed processing by a grouping of results (Reduce). The needs of Google that gave birth to MapReduce are twofold: how to handle gigantic vol- umes of unstructured data (web pages to analyze to feed the Google search engine, or the analysis of the logs produced by the work of its indexing engines, for example), to derive re- sults from calculations, aggregates, summaries ... in short, from the analysis. The free reference implementation of MapReduce is called Hadoop, a system developed by a team led by Doug Cutting, in Java, for the purposes of its Nutch distributed indexing engine for Yahoo! Hadoop directly implements the Google document on MapReduce, and bases its distributed storage on HDFS (Hadoop File System), which implements the Google document on GFS (Google File System). Then, the Hadoop MapReduce Framework (YARN) implemented by several NoSQL databases such as Hbase, Cassandra, etc. Then, Facebook developed the HQL language (Hive lan- guage query) on Hive. Close to SQL to query HDFS. Another language, called Pig, developed by Yahoo similar in its syntax to Perl and aimed at the goals of Hive. In addition, cloudera, another Hadoop distribution, integrates Impala's queries en- gine. Analysts and data scientists to perform analysis on data stored in Hadoop via SQL tools or business intelligence tools favor this latest one. The Mahout project provides algorithms implementations for business intelligence. It provides, for example, machine-learning algorithms (Kmeans, Random Forest). B. Big Data Analytics on MongoDB MongoDB is an open-source document-oriented database designed for exceptionally high performance and developed in C ++. Data is stored and queried in BSON format similar to JSON. It has dynamic and flexible schemas, making data inte- gration easier and faster than traditional databases. Unlike NoSQL databases that offer basic queries. Developers can use MongoDB native queries and data mining capabilities to gen- erate many classes of analysis, before having to adopt dedicat- ed frameworks such as Spark or MapReduce for more special- ized tasks. Several organizations including McAfee, Salesforce, Buzzfeed, Amadeus, KPMG and many others rely on Mon- goDB's powerful query language, aggregations and indexing to generate real-time analytics directly on their operational data. MongoDB users have access to a wide range of queries, projection and update operators that support real-time analytic queries on operational data: • The MongoDB Aggregation Pipeline is similar in concept to the SQL GROUP BY statement, enabling users to generate aggregations of values returned by the query (e.g., count, minimum, maximum, average, intersections) that can be used to power analytics dashboards and visualizations. • Range queries returning results based on values de- fined as inequalities (e.g., greater than, less than or equal to, between) • Search queries return results in relevance order and in faceted groups, based on text arguments using Boolean operators (e.g., AND, OR, NOT), and through bucketing, grouping and counting of query results. • MongoDB provides native support for MapReduce, allowing complex JavaScript processing. Multiple MapReduce jobs can run simultaneously on the same server and on fragmented collections. • JOINs , Graph queries , Key-value queries ... C. Big Data Analytics on Hadoop The MapReduce framework, despite being widely used by companies for the analysis of Big Data, the response time is not satisfactory and its programs executed only in the form of a batch. After a map or reduce operation, the result must be written to disk. This disk-written data allows mappers and reducers to communicate with each other. It is also the write on disk, which allows a certain tolerance to the failures: if a map or reduce operation fails, it is enough to read the data from the disk to take again, where we were. However, these writings and readings are time consuming. In addition, the expression set composed exclusively of map and reduce op- erations is very limited and not very expressive. In other words, it is difficult to express complex operations using only this set of two operations. Apache Spark is an alternative to Hadoop MapReduce for distributed computing that aims to solve both of these prob- lems. The fundamental difference between Hadoop MapRe- duce and Spark is that Spark writes data in RAM, not on disk. This has several important consequences on the speed of cal- culation processing as well as on the overall architecture of Spark. Spark offers a complete and unified framework (Figure 1) to meet the needs of Big Data processing for various datasets, various by their nature (text, graph, etc.) as well as by the type of source (batch or real-time flow). It allows to quickly write applications in Java, Scala or Python and includes a set of more than 80 high-level operators, it is possible to use it inter- actively to query the data from a shell, in addition to the op- erations of Map and Reduce, Spark supports SQL queries and data streaming and offers machine learning and graph-oriented processing functions. Developers can use these possibilities in stand-alone or by combining them into a complex processing chain. International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 44 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 3. Figure 1: Apache Spark Ecosystem Spark's programming model is similar to MapReduce, ex- cept that Spark introduces a new abstraction called Resilient Distributed Datasets (RDDs). Using RDDs, Spark can provide solutions for several applications that previously require the integration of multiple technologies, including SQL, stream- ing, machine learning and graph processing. A Dataset is a distributed collection of data. It can be viewed as a conceptual evolution of RDDs (Resilient Distrib- uted Datasets), historically the first distributed data structure used by Spark. A DataFrame is a Dataset organized into col- umns that have names, such as tables in a database. With the Scala programming interface, the DataFrame type is simply the alias of the Dataset [Row] type. It is possible to apply actions to the Datasets, which pro- duce values, and transformations, which produce new Da- tasets, as well as certain functions that do not fit into either category. Figure 2: Spark Command lines Example Spark exposes RDDs through a functional programming API in Scala, Java, Python, and R, where users can simply pass local functions to run on the cluster. III. COMPARISON A. The Experiments Results We made the comparison on files of the same size and type (.CSV).The test files are available on this link "https://catalog.data.gov/dataset/crimes-2001-to-present 398a4". We copied each file to the Hadoop file system. Then the same file imported by MongoDB. We did the test on one node, three nodes and four nodes. The machines used having the following configuration: • 8GB RAM • Linux Fedora 26 • 120 GB SSD • 6th generation i5 processor Table 1: Spark's performance with Hadoop and MongoDB Nodes File size (GB) Action Hadoop MongoDB 1 1,55 first 96 ms 77 ms count 10 s 2,0 min 3 3,11 first 90 ms 65 ms count 19 s 3,4 min 4 4,66 first 0,1 s 57 s count 29 s 5,3 min These results are illustrated in the following figure: Figure 3: Comparison of Spark's performance versus Hadoop and MongoDB B. Results Interpretation According to the results of this study, the execution time of the first operation that looks for the first record of the file is the same on Hadoop or MongoDB, sometimes Spark is faster with MongoDB, but the execution time of the operation count that requires the change of the entire file in memory in a RDD, Spark is much faster with Hadoop. International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 45 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 4. For the moment, Hadoop remains the best global storage solution with administration that is more advanced, security and monitoring tools. This choice, Oracle did for its brand new data discovery and analysis solution, Big Data Discovery. The product installs on a Hadoop cluster (exclusively Cloudera) and relies heavily on Spark for its treatments. IV. CONCLUSION In this article, we presented the results of an experimental study on the performance of the best framework of Big Ana- lytics (Spark) with the most popular databases of NoSQL MongoDB and Hadoop. The aim of this study is to determine the software combination that allows sophisticated analysis in real time. According to the results of this study, Spark is much faster with Hadoop. REFERENCES [1] NoSQL , http://nosql-database.org/, 2018 [2] Vasant Dhar, « Data Science and Prediction », Communications of the ACM, no 12, décembre 2013, p. 64-73 [3] Davenport et DJ Patil « Data Scientist: The Sexiest Job of the 21st Century », Harvard Business Review, 2012 [4] Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Chapman & Hall, New York. [5] L. Breiman. Bagging predictors. Machine Learning, 24(2), 1996. [6] L. Breiman. Random forests. Machine Learning, 45, 2001. [7] MacQueen J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1. University of California Press. pp. 281-297. [8] Marron M. E. (1961). Automatic Indexing : An Experimental Inquiry, Journal of the ACM (JACM), VOl. 8 : Iss. 3, pp 404-417. [9] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., ... & Ghodsi, A. (2016). Apache Spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56-65. [10] Gopalani, S., & Arora, R. (2015). Comparing apache spark and map reduce with performance analysis using K-means. International Journal of Computer Applications, 113(1). [11] Omar, H., Rachid, D., Mohammed, T., Zouhair, I., B.: An Advanced Comparative Study of the Most Promising NoSQL and NewSQL Databases With a Multi-Criteria Analysis Method. Journal of theoretical and applied information technology, Vol81, No3. [12] Solid IT (2018) , https://db-engines.com [13] https://www.mongodb.com/collateral/apache-spark-and-mongodb- turning-analytics-into-real-time-action International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 1, January 2018 46 https://sites.google.com/site/ijcsis/ ISSN 1947-5500