JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
Map reduce advantages over parallel databases report
1. 1 | P a g e
Report Title:
MapReduce advantages over parallel databases
Name: Ahmad Ali Taweel
Lecturer: Dr. Rafiq Haque
Date: 30/12/2017
2. 2 | P a g e
Table Page
I. Introduction 3
II. State of the art 5
a. ParallelDatabase Technology 5
b. MapReduce Technology 7
III. Comparative study 10
IV. Conclusion 12
3. 3 | P a g e
1. Introduction
Big data refers to voluminous data objects that are variety in nature, generated at a high degree
of velocity with an uncertainty pattern whichmakes it hard to be processed with standard
software.To describe big data, there is no way we can avoid describing its major
characteristics 4V: volume, velocity, variety, veracity.
4. 4 | P a g e
Volume
Volume: This is the quantity of data that are generated not only from the Internet
but also from the transaction data from internal of companies. With the data growth,
the requirements of capacity for storing the data have increased. Volume is increasing
radically every next second.
Variety
Variety: This is the category of big data. Big data originate from messages, social
network, government data, and media outlets. Variety refers to different types or forms
of data: structured data, unstructured data, and semi-structured data. Structured like
relational data. Unstructured data like text, image, audio, and video. Semi- structured like
XMLdata.
Velocity
Velocity: This is the speed of data creation. Compare to volume, the velocity of the
data creation is even more important to many companies, because obtaining real
time information allows companies to react more quickly in the digital world [6].
Veracity
Veracity: This is accuracy, trustworthy and quality of the data. Big data quality,
which depends on the veracity of source data, is very important for the analyzer to
estimate their data accurately, affecting the accurate analysis.
Today, more and more people use the Internet to achieve their needs for communication,
shopping, transactions and so on. According to the research of IBM, there is 5 Exabyte data
generated every two days in 2012. This rapidly growing flood of big data represents huge
opportunities objectives. Determining how to quickly optimize the challenges such as
analysis, searching, sharing, storage, and transfer of these data is an essential key to achieving
success in the competitive digital world. And to optimize these challenges, two technologies
were provided: Mapreduce and Parallel database.
5. 5 | P a g e
2. State of the art
Back in the 1900s, initially the data were mainly generated by companies was not that huge,
so that the database management system could find the best approach to solve the data related
problems. With the Structured Query Language (SQL) becoming the standard query language,
data scientists found out it is quite effective to deal with data problems by using SQL.
However, with the development of technologies, data have been growing geometrically, and
this method becomes infeasible because of the size of data. Two decades ago, a terabyte of
data was considered an uncommonly large volume of data, but right now those sizes are
common, even in small company’s database or file system. For example, 20 petabytes are
processed per day by Google; more than one million transactions per hour are processed by
Walmart, and these transactions are more than 2.5 petabytes of data; AT&T has a 312
terabytes database which includes 1.9 trillion phone call records.
2.1 Parallel Database Systems
A parallel database system is a high-performance database system established using
massively parallel processing or parallel computing environments. It allows multiple
instances to share one physical database so that the shared device, software and data can
be accessed by multiple client instances.
Relational queries are more ideally suited to parallel executions. Every relational query
can be transferred into operations like scan or sort. From the Figure below, we can easily
see that each data stream comes from the data source and then becomes an input of an
operator1 which produces an output that is used as an input of operator2, and eventually
the output is generated from merged result of operator2.
6. 6 | P a g e
This approach requires a merge based server that can handle the parallel execution of
those operations. Without a high-speed network, this approach looks impossible. Right
now, most of the parallel database systems use high-speed LAN as their workstations.
Meanwhile, some of the companies use high-speed networks and distributed database
technology to construct their parallel database systems which would cost more money.
A variety of hardware architectures allow multiple computers to share access to data,
software, or peripheral devices. A parallel database is designed to take advantage of
such architectures by running multiple instances which "share" a single physical
database. In appropriate applications, a parallel server can allow access to a single
database by users on multiple machines, with increased performance.
Tools that support parallel database:
Speedment is an open-source Stream ORM Java Toolkit and Runtime Java tool that
wraps an existing database and its tables into Java 8 streams. We can use an existing
database and run the Speedment tool and it will generate POJO classes that
7. 7 | P a g e
correspond to the tables we have selected using the tool. One distinct feature with
Speedment is that it supports parallel database streams and that it can use different
parallel strategies to further optimize performance.
2.2 MapReduce
• Overthe past years at Google they implemented many computations that process large
amount of data but they found out that to process such huge amount of data it needs to be
distributed across hundreds or thousands of machines in order to finish in a reasonable
amount of time. They designed a new abstraction that allows us to express the simple
computations we were trying to perform but hides the messy details of parallelization,
fault-tolerance, data distribution and load balancing in a library. They realized that most of
their computations involvedapplying a map operation to each record in their input in order
to compute a set of intermediate key/value pairs, and then applying a reduce operation to
all the values that shared the same key. MapReduceis a programming model for
processing and generating large data sets.
ProgramingModel:
There are twofunctions Map & Reduce. Map takes an input and produces a set of
intermediate key/valuepairs. Reduce takes output of Map functions and merge all values
that have same key to produce set of key/values pairs. Both functions are written by the
user.
Ex: Occurrenceof each wordin a document:
Map (Stringkey, Stringvalue): Reduce (Stringkey, Iterator values):
// key: document name // key: aword
// value: document contents // values: alist of counts
8. 8 | P a g e
for each word w in value: int result = 0;
EmitIntermediate(w, "1"); for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Implementation:
1. The MapReduce library in the user program
9. 9 | P a g e
• Splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB)
per piece
• Starts up many copies of the program on a cluster of machines
• The master
• Rest are workers that are assigned workby the master
2. Master
• There are M map tasks and R reduce tasks to assign. The master picksidle workers
and assigns each one a map task or a reduce task
3. Worker whois assigned a map task
• Reads the contents of the corresponding input split. It parses key/value pairs out of
the input data and passes each pair to the user-defined Mapfunction.
• The intermediate key/valuepairs produced by the Mapfunctionare buffered in
memory
4. On Local disks
• Periodically,the Buffered pairs are written to localdisk
• Partitioned into R regions by the partitioning function
• The Locations of these buffered pairs on the localdisk are passed back to the
master, who is responsible for
• Forwarding these locations to the reduce workers
5. Worker whois assigned a map task
• After a reduce workeris Notified by the master about these locations, it uses remote
procedure calls to read the buffereddata from the local disks of the map workers.
When a reduce workerhas
• It Read all intermediate data
• It Sorts it by the intermediate keysso that all occurrences of the same key are
grouped together. The sorting is needed because typically many different keys map
to the same reduce task. If the amount of intermediate data is too large to fitin
memory, an external sort is used
• It Passes each unique intermediate key with the corresponding set of intermediate
values to the user's Reduce function
10. 10 | P a g e
• The output of the Reduce functionis appended to a final output file forthis reduce
partition
ToolsthatsupportMapReduce:
Hadoop Distributed File System
Hbase
HIVE
Zookeeper
CouchDB
MongoDB
Riak
3. Comparative Study:
In 2009 a paper By Andrew Pavloetalwas published. In this paper Pavloetal discussed the
difference in performance of MapReduce and parallel databases. This paper is known by the
name Comparison paper of Pavloetalwere he said that MapReduce is a major step backwards.
And here I will address several misconceptions about MapReduce.
HeterogeneousSystems:
MapReduce provides a simple model for analyzing data in such heterogeneous systems
by simply defining simple reader writer implementations that operate on the new
storage systems. Storage systems like relational database or file systems.
In parallel database, first input must be copied. And here where the issues start withthe
inconvenience in loading phase and the unacceptably slow speed. After that is done, the
analyzing begins.
ComplexFunctions:
Map & Reduce functions are simple and straight forwardSQL equivalent. Pavloetal
pointed that sometimes it very complicated to be expressed in SQL.
MapReduce workedto solve these problems by using user defined functions (UDFs)that
can be combined with SQLqueries. Whichis buggy some times in MapReduce but at
least it exist in it while it is missing in parallel database.
11. 11 | P a g e
MapReduce is a better framework for doing more complicated tasks (suchas those
listed earlier) than the selection and aggregation that are SQL’s forte.
Fault Tolerance:
There are twomodels to transfer data between mappers and reducers: Pull model and
Push model. Pullmodel is where data are being moved from mappers to reducers,
while Push model is where data are being written from mappers to reducers.
As Pavloetalsaid Pullmodel create many files and disks. But MapReduce used batching,
sorting and grouping of the intermediate data and smart scheduling forreads so they
can mitigate these costs.
MapReduce do not use push model due to fault-tolerance property required by Google's
developers; failure of reducer willforcere-execution of all Map tasks.
Fault tolerance willbe more important as data set grow larger and clearly now days
data set are getting much larger with time so the need for a fault tolerance system like
MapReduce is needed to process these data efficiently.
Performance:
Cost of merging results: As Pavloetalsaid that the final phase of MapReduce where all
results are merged to one file is very expensive. But merging isn’t necessary when the
next consumer of MapReduce is another MapReduce since it can operate on the files
produced by the first MapReduce or even if it is not another MapReduce because the
reducer processes in the initial MapReduce can write directly to a merged destination
(big table or parallel database table)
Data loading: Hadoop can analyze data 5 to 50 times faster than the time needed to load
data to parallel database. It is possible to run 50 separate MapReduce task to analyze
the data before it’s possible to load the data into the database and complete one
analysis.
12. 12 | P a g e
4. Conclusion
MapReduce is successfully used by Google since it is a highly effectiveand efficient toolfor
large-scale fault-tolerant data analysis.
MapReduce Model is easy to use, even forprogrammers without experience since it hides the
details of parallelization, fault-tolerance, locality optimization, and load balancing. Also many
problems are easily expressed in MapReduce such as sorting, data mining, and machine
learning.
MapReduce implementation can scale to large clusters of machines (100 1000 machines).
MapReduce is very useful forhandling data processing and data loading in a heterogeneous
system as it provides a good framework forthe execution of more complicated functions.