Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?

Which NoSQL Database to Combine with Spark for
Real Time Big Data Analytics ?
Abstract— Big Data is an evolution of Business Intelligence (BI).
Whereas traditional BI relies on data warehouses limited in size
(some terabytes) and it hardly manages unstructured data and
real-time analysis, the era of Big Data opens up a new technologi-
cal period offering advanced architectures and infrastructures
allowing sophisticated analyzes taking into account these new
data integrated into the ecosystem of the business . In this article,
we will present the results of an experimental study on the per-
formance of the best framework of Big Analytics (Spark) with the
most popular databases of NoSQL MongoDB and Hadoop. The
objective of this study is to determine the software combination
that allows sophisticated analysis in real time.
Keywords- big data analyticsy; NoSQL databases; Apache Spark ;
Hadoop; MongoDB, performance .
I. INTRODUCTION
The Big Data phenomenon, for companies, covers two real-
ities: on the one hand this explosion of data continuously, on
the other hand the capacity to process and analyze this great
mass of data to make a profit. With Big Data, organizations can
now manage and process massive data to extract value, decide
and act in real time.
NoSQL databases were developed to provide a set of new
data management features while overcoming some limitations
of currently used relational databases [1]. NoSQL databases are
not relational and they don’t require a model or structure for
data storage, which facilitates the storage and data search. In
addition, they allow horizontal scalability, it gives administra-
tors the ability of increasing the number of server machines to
minimize overall system load. The new nodes are integrated
and operated in an automatic manner by the system. Horizontal
scalability reduces the response time of queries with a low cost.
In relation to the NoSQL databases (Hadoop, MongoDB,
Cassandra, Hbase, Radis, Riak…., etc.), a new profession
appeared "the data scientist". Data science is the extraction of
knowledge from data sets [2, 3]. It employs techniques and
theories derived from several other broader areas of mathe-
matics, mainly statistics, probabilistic models, machine learn-
ing. Thus, to develop algorithms in a distributed environment,
the analyst must master tools of big data analytics (Mahout,
MapReduce, Spark and Storm) and learn the syntax of func-
tional languages to use Scala, Erlang or Clojure.
Big data analytics therefore favors a return to grace of
functional languages and robust methods: decision tree [4, 5],
and random forest [6], k-means [7], Naive Bayes classifier [8],
easily distributable (MapReduce) on thousands of nodes.
For collected data storage, any NoSQL database can fulfill
this role. However, the need to analyze this data pushes us to
choose this database carefully. Especially in the field of Big
Data, the analytic part becomes more and more important. For
advanced, real-time analytics, the best framework you can use
is Apache Spark [9, 10]. According to the official version,
Spark uses the hadoop HDFS file system.
In a previous study [11] based on a multicriteria analysis
method, the MongoDB system obtained the highest score.
Today, this result was confirmed. This system has become
popular [12]. According to a white paper [13] published by
MongoDB, The combination of the fastest analysis engine
(Spark) with the fastest-growing database (MongoDB) allows
companies to easily perform reliable real-time analysis. This
led us to compare Spark's performance against the most popu-
lar NoSQL databases, MongoDB and Hadoop. In this article,
we will present and discuss the results of our experimental
study. Thus, we will determine the software combination that
allows giving sophisticated analyzes in real time.
This paper is organized as follows: Section II presents big
data analytics on Hadoop and MongoDB. In section III, we
present the results of an experimental study on the perfor-
mance of the framework Spark with MongoDB and Hadoop.
Section IV provides a conclusion.
II. BIG DATA ANALYTICS
In this part, we will introduce the data analysis technologies
used on Hadoop and MongoDB.
A. Big Data Analytics on Hadoop
The first integrated solution with Hadoop for data analysis
is the MapReduce framework. MapReduce is not in itself an
element of databases. This distributed information processing
approach takes an input list, produces one in return.it can be
used for many situations; it is well suited for distributed pro-
cessing needs and decision-making processes.
Omar HAJOUI, Mohamed TALEA
LTI Laboratory, Faculty of Science Ben M’Sik
Hassan II University, Casablanca, Morocco
{hajouio, taleamohamed}@yahoo.fr
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 1, January 2018
43 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

MapReduce defined in 2004 in an article written by
Google. The principle is simple: to distribute a treatment,
Google imagined a two-step operation. First, an assignment of
operations on each machine (Map) followed processing by a
grouping of results (Reduce). The needs of Google that gave
birth to MapReduce are twofold: how to handle gigantic vol-
umes of unstructured data (web pages to analyze to feed the
Google search engine, or the analysis of the logs produced by
the work of its indexing engines, for example), to derive re-
sults from calculations, aggregates, summaries ... in short,
from the analysis.
The free reference implementation of MapReduce is called
Hadoop, a system developed by a team led by Doug Cutting,
in Java, for the purposes of its Nutch distributed indexing
engine for Yahoo! Hadoop directly implements the Google
document on MapReduce, and bases its distributed storage on
HDFS (Hadoop File System), which implements the Google
document on GFS (Google File System). Then, the Hadoop
MapReduce Framework (YARN) implemented by several
NoSQL databases such as Hbase, Cassandra, etc.
Then, Facebook developed the HQL language (Hive lan-
guage query) on Hive. Close to SQL to query HDFS. Another
language, called Pig, developed by Yahoo similar in its syntax
to Perl and aimed at the goals of Hive. In addition, cloudera,
another Hadoop distribution, integrates Impala's queries en-
gine. Analysts and data scientists to perform analysis on data
stored in Hadoop via SQL tools or business intelligence tools
favor this latest one. The Mahout project provides algorithms
implementations for business intelligence. It provides, for
example, machine-learning algorithms (Kmeans, Random
Forest).
B. Big Data Analytics on MongoDB
MongoDB is an open-source document-oriented database
designed for exceptionally high performance and developed in
C ++. Data is stored and queried in BSON format similar to
JSON. It has dynamic and flexible schemas, making data inte-
gration easier and faster than traditional databases. Unlike
NoSQL databases that offer basic queries. Developers can use
MongoDB native queries and data mining capabilities to gen-
erate many classes of analysis, before having to adopt dedicat-
ed frameworks such as Spark or MapReduce for more special-
ized tasks.
Several organizations including McAfee, Salesforce,
Buzzfeed, Amadeus, KPMG and many others rely on Mon-
goDB's powerful query language, aggregations and indexing
to generate real-time analytics directly on their operational
data. MongoDB users have access to a wide range of queries,
projection and update operators that support real-time analytic
queries on operational data:
• The MongoDB Aggregation Pipeline is similar in
concept to the SQL GROUP BY statement, enabling
users to generate aggregations of values returned by
the query (e.g., count, minimum, maximum, average,
intersections) that can be used to power analytics
dashboards and visualizations.
• Range queries returning results based on values de-
fined as inequalities (e.g., greater than, less than or
equal to, between)
• Search queries return results in relevance order and
in faceted groups, based on text arguments using
Boolean operators (e.g., AND, OR, NOT), and
through bucketing, grouping and counting of query
results.
• MongoDB provides native support for MapReduce,
allowing complex JavaScript processing. Multiple
MapReduce jobs can run simultaneously on the same
server and on fragmented collections.
• JOINs , Graph queries , Key-value queries ...
C. Big Data Analytics on Hadoop
The MapReduce framework, despite being widely used by
companies for the analysis of Big Data, the response time is
not satisfactory and its programs executed only in the form of
a batch. After a map or reduce operation, the result must be
written to disk. This disk-written data allows mappers and
reducers to communicate with each other. It is also the write
on disk, which allows a certain tolerance to the failures: if a
map or reduce operation fails, it is enough to read the data
from the disk to take again, where we were. However, these
writings and readings are time consuming. In addition, the
expression set composed exclusively of map and reduce op-
erations is very limited and not very expressive. In other
words, it is difficult to express complex operations using only
this set of two operations.
Apache Spark is an alternative to Hadoop MapReduce for
distributed computing that aims to solve both of these prob-
lems. The fundamental difference between Hadoop MapRe-
duce and Spark is that Spark writes data in RAM, not on disk.
This has several important consequences on the speed of cal-
culation processing as well as on the overall architecture of
Spark.
Spark offers a complete and unified framework (Figure 1)
to meet the needs of Big Data processing for various datasets,
various by their nature (text, graph, etc.) as well as by the type
of source (batch or real-time flow). It allows to quickly write
applications in Java, Scala or Python and includes a set of
more than 80 high-level operators, it is possible to use it inter-
actively to query the data from a shell, in addition to the op-
erations of Map and Reduce, Spark supports SQL queries and
data streaming and offers machine learning and graph-oriented
processing functions. Developers can use these possibilities in
stand-alone or by combining them into a complex processing
chain.
ISSN 1947-5500

Figure 1: Apache Spark Ecosystem
Spark's programming model is similar to MapReduce, ex-
cept that Spark introduces a new abstraction called Resilient
Distributed Datasets (RDDs). Using RDDs, Spark can provide
solutions for several applications that previously require the
integration of multiple technologies, including SQL, stream-
ing, machine learning and graph processing.
A Dataset is a distributed collection of data. It can be
viewed as a conceptual evolution of RDDs (Resilient Distrib-
uted Datasets), historically the first distributed data structure
used by Spark. A DataFrame is a Dataset organized into col-
umns that have names, such as tables in a database. With the
Scala programming interface, the DataFrame type is simply
the alias of the Dataset [Row] type.
It is possible to apply actions to the Datasets, which pro-
duce values, and transformations, which produce new Da-
tasets, as well as certain functions that do not fit into either
category.
Figure 2: Spark Command lines Example
Spark exposes RDDs through a functional programming
API in Scala, Java, Python, and R, where users can simply
pass local functions to run on the cluster.
III. COMPARISON
A. The Experiments Results
We made the comparison on files of the same size and type
(.CSV).The test files are available on this link
"https://catalog.data.gov/dataset/crimes-2001-to-present
398a4". We copied each file to the Hadoop file system. Then
the same file imported by MongoDB.
We did the test on one node, three nodes and four nodes.
The machines used having the following configuration:
• 8GB RAM
• Linux Fedora 26
• 120 GB SSD
• 6th generation i5 processor
Table 1: Spark's performance with Hadoop and MongoDB
Nodes
File size
(GB)
Action Hadoop MongoDB
1 1,55
first 96 ms 77 ms
count 10 s 2,0 min
3 3,11
first 90 ms 65 ms
count 19 s 3,4 min
4 4,66
first 0,1 s 57 s
count 29 s 5,3 min
These results are illustrated in the following figure:
Figure 3: Comparison of Spark's performance versus Hadoop
and MongoDB
B. Results Interpretation
According to the results of this study, the execution time of
the first operation that looks for the first record of the file is
the same on Hadoop or MongoDB, sometimes Spark is faster
with MongoDB, but the execution time of the operation count
that requires the change of the entire file in memory in a RDD,
Spark is much faster with Hadoop.
ISSN 1947-5500

For the moment, Hadoop remains the best global storage
solution with administration that is more advanced, security
and monitoring tools. This choice, Oracle did for its brand
new data discovery and analysis solution, Big Data Discovery.
The product installs on a Hadoop cluster (exclusively
Cloudera) and relies heavily on Spark for its treatments.
IV. CONCLUSION
In this article, we presented the results of an experimental
study on the performance of the best framework of Big Ana-
lytics (Spark) with the most popular databases of NoSQL
MongoDB and Hadoop. The aim of this study is to determine
the software combination that allows sophisticated analysis in
real time. According to the results of this study, Spark is much
faster with Hadoop.
REFERENCES
[1] NoSQL , http://nosql-database.org/, 2018
[2] Vasant Dhar, « Data Science and Prediction », Communications of the
ACM, no 12, décembre 2013, p. 64-73
[3] Davenport et DJ Patil « Data Scientist: The Sexiest Job of the 21st
Century », Harvard Business Review, 2012
[4] Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984).
Classification and Regression Trees. Chapman & Hall, New York.
[5] L. Breiman. Bagging predictors. Machine Learning, 24(2), 1996.
[6] L. Breiman. Random forests. Machine Learning, 45, 2001.
[7] MacQueen J. B. (1967). Some Methods for classification and Analysis
of Multivariate Observations. Proceedings of 5th Berkeley Symposium
on Mathematical Statistics and Probability 1. University of California
Press. pp. 281-297.
[8] Marron M. E. (1961). Automatic Indexing : An Experimental Inquiry,
Journal of the ACM (JACM), VOl. 8 : Iss. 3, pp 404-417.
[9] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., ...
& Ghodsi, A. (2016). Apache Spark: A unified engine for big data
processing. Communications of the ACM, 59(11), 56-65.
[10] Gopalani, S., & Arora, R. (2015). Comparing apache spark and map
reduce with performance analysis using K-means. International Journal
of Computer Applications, 113(1).
[11] Omar, H., Rachid, D., Mohammed, T., Zouhair, I., B.: An Advanced
Comparative Study of the Most Promising NoSQL and NewSQL
Databases With a Multi-Criteria Analysis Method. Journal of theoretical
and applied information technology, Vol81, No3.
[12] Solid IT (2018) , https://db-engines.com
[13] https://www.mongodb.com/collateral/apache-spark-and-mongodb-
turning-analytics-into-real-time-action
ISSN 1947-5500

Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Ähnlich wie Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?

Ähnlich wie Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics? (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?