SlideShare ist ein Scribd-Unternehmen logo
1 von 12
1 | P a g e
Report Title:
MapReduce advantages over parallel databases
Name: Ahmad Ali Taweel
Lecturer: Dr. Rafiq Haque
Date: 30/12/2017
2 | P a g e
Table Page
I. Introduction 3
II. State of the art 5
a. ParallelDatabase Technology 5
b. MapReduce Technology 7
III. Comparative study 10
IV. Conclusion 12
3 | P a g e
1. Introduction
Big data refers to voluminous data objects that are variety in nature, generated at a high degree
of velocity with an uncertainty pattern whichmakes it hard to be processed with standard
software.To describe big data, there is no way we can avoid describing its major
characteristics 4V: volume, velocity, variety, veracity.
4 | P a g e
Volume
Volume: This is the quantity of data that are generated not only from the Internet
but also from the transaction data from internal of companies. With the data growth,
the requirements of capacity for storing the data have increased. Volume is increasing
radically every next second.
Variety
Variety: This is the category of big data. Big data originate from messages, social
network, government data, and media outlets. Variety refers to different types or forms
of data: structured data, unstructured data, and semi-structured data. Structured like
relational data. Unstructured data like text, image, audio, and video. Semi- structured like
XMLdata.
Velocity
Velocity: This is the speed of data creation. Compare to volume, the velocity of the
data creation is even more important to many companies, because obtaining real
time information allows companies to react more quickly in the digital world [6].
Veracity
Veracity: This is accuracy, trustworthy and quality of the data. Big data quality,
which depends on the veracity of source data, is very important for the analyzer to
estimate their data accurately, affecting the accurate analysis.
Today, more and more people use the Internet to achieve their needs for communication,
shopping, transactions and so on. According to the research of IBM, there is 5 Exabyte data
generated every two days in 2012. This rapidly growing flood of big data represents huge
opportunities objectives. Determining how to quickly optimize the challenges such as
analysis, searching, sharing, storage, and transfer of these data is an essential key to achieving
success in the competitive digital world. And to optimize these challenges, two technologies
were provided: Mapreduce and Parallel database.
5 | P a g e
2. State of the art
Back in the 1900s, initially the data were mainly generated by companies was not that huge,
so that the database management system could find the best approach to solve the data related
problems. With the Structured Query Language (SQL) becoming the standard query language,
data scientists found out it is quite effective to deal with data problems by using SQL.
However, with the development of technologies, data have been growing geometrically, and
this method becomes infeasible because of the size of data. Two decades ago, a terabyte of
data was considered an uncommonly large volume of data, but right now those sizes are
common, even in small company’s database or file system. For example, 20 petabytes are
processed per day by Google; more than one million transactions per hour are processed by
Walmart, and these transactions are more than 2.5 petabytes of data; AT&T has a 312
terabytes database which includes 1.9 trillion phone call records.
2.1 Parallel Database Systems
A parallel database system is a high-performance database system established using
massively parallel processing or parallel computing environments. It allows multiple
instances to share one physical database so that the shared device, software and data can
be accessed by multiple client instances.
Relational queries are more ideally suited to parallel executions. Every relational query
can be transferred into operations like scan or sort. From the Figure below, we can easily
see that each data stream comes from the data source and then becomes an input of an
operator1 which produces an output that is used as an input of operator2, and eventually
the output is generated from merged result of operator2.
6 | P a g e
This approach requires a merge based server that can handle the parallel execution of
those operations. Without a high-speed network, this approach looks impossible. Right
now, most of the parallel database systems use high-speed LAN as their workstations.
Meanwhile, some of the companies use high-speed networks and distributed database
technology to construct their parallel database systems which would cost more money.
A variety of hardware architectures allow multiple computers to share access to data,
software, or peripheral devices. A parallel database is designed to take advantage of
such architectures by running multiple instances which "share" a single physical
database. In appropriate applications, a parallel server can allow access to a single
database by users on multiple machines, with increased performance.
Tools that support parallel database:
Speedment is an open-source Stream ORM Java Toolkit and Runtime Java tool that
wraps an existing database and its tables into Java 8 streams. We can use an existing
database and run the Speedment tool and it will generate POJO classes that
7 | P a g e
correspond to the tables we have selected using the tool. One distinct feature with
Speedment is that it supports parallel database streams and that it can use different
parallel strategies to further optimize performance.
2.2 MapReduce
• Overthe past years at Google they implemented many computations that process large
amount of data but they found out that to process such huge amount of data it needs to be
distributed across hundreds or thousands of machines in order to finish in a reasonable
amount of time. They designed a new abstraction that allows us to express the simple
computations we were trying to perform but hides the messy details of parallelization,
fault-tolerance, data distribution and load balancing in a library. They realized that most of
their computations involvedapplying a map operation to each record in their input in order
to compute a set of intermediate key/value pairs, and then applying a reduce operation to
all the values that shared the same key. MapReduceis a programming model for
processing and generating large data sets.
ProgramingModel:
There are twofunctions Map & Reduce. Map takes an input and produces a set of
intermediate key/valuepairs. Reduce takes output of Map functions and merge all values
that have same key to produce set of key/values pairs. Both functions are written by the
user.
Ex: Occurrenceof each wordin a document:
Map (Stringkey, Stringvalue): Reduce (Stringkey, Iterator values):
// key: document name // key: aword
// value: document contents // values: alist of counts
8 | P a g e
for each word w in value: int result = 0;
EmitIntermediate(w, "1"); for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Implementation:
1. The MapReduce library in the user program
9 | P a g e
• Splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB)
per piece
• Starts up many copies of the program on a cluster of machines
• The master
• Rest are workers that are assigned workby the master
2. Master
• There are M map tasks and R reduce tasks to assign. The master picksidle workers
and assigns each one a map task or a reduce task
3. Worker whois assigned a map task
• Reads the contents of the corresponding input split. It parses key/value pairs out of
the input data and passes each pair to the user-defined Mapfunction.
• The intermediate key/valuepairs produced by the Mapfunctionare buffered in
memory
4. On Local disks
• Periodically,the Buffered pairs are written to localdisk
• Partitioned into R regions by the partitioning function
• The Locations of these buffered pairs on the localdisk are passed back to the
master, who is responsible for
• Forwarding these locations to the reduce workers
5. Worker whois assigned a map task
• After a reduce workeris Notified by the master about these locations, it uses remote
procedure calls to read the buffereddata from the local disks of the map workers.
When a reduce workerhas
• It Read all intermediate data
• It Sorts it by the intermediate keysso that all occurrences of the same key are
grouped together. The sorting is needed because typically many different keys map
to the same reduce task. If the amount of intermediate data is too large to fitin
memory, an external sort is used
• It Passes each unique intermediate key with the corresponding set of intermediate
values to the user's Reduce function
10 | P a g e
• The output of the Reduce functionis appended to a final output file forthis reduce
partition
ToolsthatsupportMapReduce:
 Hadoop Distributed File System
 Hbase
 HIVE
 Zookeeper
 CouchDB
 MongoDB
 Riak
3. Comparative Study:
In 2009 a paper By Andrew Pavloetalwas published. In this paper Pavloetal discussed the
difference in performance of MapReduce and parallel databases. This paper is known by the
name Comparison paper of Pavloetalwere he said that MapReduce is a major step backwards.
And here I will address several misconceptions about MapReduce.
HeterogeneousSystems:
MapReduce provides a simple model for analyzing data in such heterogeneous systems
by simply defining simple reader writer implementations that operate on the new
storage systems. Storage systems like relational database or file systems.
In parallel database, first input must be copied. And here where the issues start withthe
inconvenience in loading phase and the unacceptably slow speed. After that is done, the
analyzing begins.
ComplexFunctions:
Map & Reduce functions are simple and straight forwardSQL equivalent. Pavloetal
pointed that sometimes it very complicated to be expressed in SQL.
MapReduce workedto solve these problems by using user defined functions (UDFs)that
can be combined with SQLqueries. Whichis buggy some times in MapReduce but at
least it exist in it while it is missing in parallel database.
11 | P a g e
MapReduce is a better framework for doing more complicated tasks (suchas those
listed earlier) than the selection and aggregation that are SQL’s forte.
Fault Tolerance:
There are twomodels to transfer data between mappers and reducers: Pull model and
Push model. Pullmodel is where data are being moved from mappers to reducers,
while Push model is where data are being written from mappers to reducers.
As Pavloetalsaid Pullmodel create many files and disks. But MapReduce used batching,
sorting and grouping of the intermediate data and smart scheduling forreads so they
can mitigate these costs.
MapReduce do not use push model due to fault-tolerance property required by Google's
developers; failure of reducer willforcere-execution of all Map tasks.
Fault tolerance willbe more important as data set grow larger and clearly now days
data set are getting much larger with time so the need for a fault tolerance system like
MapReduce is needed to process these data efficiently.
Performance:
Cost of merging results: As Pavloetalsaid that the final phase of MapReduce where all
results are merged to one file is very expensive. But merging isn’t necessary when the
next consumer of MapReduce is another MapReduce since it can operate on the files
produced by the first MapReduce or even if it is not another MapReduce because the
reducer processes in the initial MapReduce can write directly to a merged destination
(big table or parallel database table)
Data loading: Hadoop can analyze data 5 to 50 times faster than the time needed to load
data to parallel database. It is possible to run 50 separate MapReduce task to analyze
the data before it’s possible to load the data into the database and complete one
analysis.
12 | P a g e
4. Conclusion
MapReduce is successfully used by Google since it is a highly effectiveand efficient toolfor
large-scale fault-tolerant data analysis.
MapReduce Model is easy to use, even forprogrammers without experience since it hides the
details of parallelization, fault-tolerance, locality optimization, and load balancing. Also many
problems are easily expressed in MapReduce such as sorting, data mining, and machine
learning.
MapReduce implementation can scale to large clusters of machines (100 1000 machines).
MapReduce is very useful forhandling data processing and data loading in a heterogeneous
system as it provides a good framework forthe execution of more complicated functions.

Weitere ähnliche Inhalte

Was ist angesagt?

Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Enhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniqueEnhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniquejournalBEEI
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
 
Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3mcacicio
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)MIT College Of Engineering,Pune
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
 
Comparison with Traditional databases
Comparison with Traditional databasesComparison with Traditional databases
Comparison with Traditional databasesGowriLatha1
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYAAditya Srinivasan
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 

Was ist angesagt? (18)

Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Enhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniqueEnhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce Technique
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
Comparison with Traditional databases
Comparison with Traditional databasesComparison with Traditional databases
Comparison with Traditional databases
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 

Ähnlich wie Map reduce advantages over parallel databases report

Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Labkevinflorian
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCEAM Publications,India
 
Aginity Big Data Research Lab
Aginity Big Data Research LabAginity Big Data Research Lab
Aginity Big Data Research Labasifahmed
 
Aginity Big Data Research Lab
Aginity Big Data Research LabAginity Big Data Research Lab
Aginity Big Data Research Labdkuhn
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce ijujournal
 
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEHMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEijujournal
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopIRJET Journal
 
Dataintensive
DataintensiveDataintensive
Dataintensivesulfath
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
 

Ähnlich wie Map reduce advantages over parallel databases report (20)

Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Lab
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
Hadoop
HadoopHadoop
Hadoop
 
IJET-V2I6P25
IJET-V2I6P25IJET-V2I6P25
IJET-V2I6P25
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
 
Aginity Big Data Research Lab
Aginity Big Data Research LabAginity Big Data Research Lab
Aginity Big Data Research Lab
 
Aginity Big Data Research Lab
Aginity Big Data Research LabAginity Big Data Research Lab
Aginity Big Data Research Lab
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
 
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEHMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
B017320612
B017320612B017320612
B017320612
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Dataintensive
DataintensiveDataintensive
Dataintensive
 
Big data
Big dataBig data
Big data
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 

Mehr von Ahmad El Tawil

Force sensors presentation
Force sensors presentationForce sensors presentation
Force sensors presentationAhmad El Tawil
 
Enabling Reusable and Adaptive Modeling,Provisioning & Execution of BPEL Proc...
Enabling Reusable and Adaptive Modeling,Provisioning & Execution of BPEL Proc...Enabling Reusable and Adaptive Modeling,Provisioning & Execution of BPEL Proc...
Enabling Reusable and Adaptive Modeling,Provisioning & Execution of BPEL Proc...Ahmad El Tawil
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationAhmad El Tawil
 
Map reduce advantages over parallel databases
Map reduce advantages over parallel databases Map reduce advantages over parallel databases
Map reduce advantages over parallel databases Ahmad El Tawil
 
Cloud computing risk assesment report
Cloud computing risk assesment reportCloud computing risk assesment report
Cloud computing risk assesment reportAhmad El Tawil
 
Cloud computing risk assesment
Cloud computing risk assesment Cloud computing risk assesment
Cloud computing risk assesment Ahmad El Tawil
 
Piper Alpha Disaster Report
Piper Alpha Disaster ReportPiper Alpha Disaster Report
Piper Alpha Disaster ReportAhmad El Tawil
 
Fruit detection using morphological
Fruit detection using morphological Fruit detection using morphological
Fruit detection using morphological Ahmad El Tawil
 
Cloud computing risk assesment presentation
Cloud computing risk assesment presentationCloud computing risk assesment presentation
Cloud computing risk assesment presentationAhmad El Tawil
 
Bhopal Disaster Presentation
Bhopal Disaster PresentationBhopal Disaster Presentation
Bhopal Disaster PresentationAhmad El Tawil
 
Security algorithms for manet
Security algorithms for manetSecurity algorithms for manet
Security algorithms for manetAhmad El Tawil
 
5G green communication
5G green communication5G green communication
5G green communicationAhmad El Tawil
 
A survey of ethical hacking process and security
A survey of ethical hacking process and securityA survey of ethical hacking process and security
A survey of ethical hacking process and securityAhmad El Tawil
 
Cybercriminals focus on Cryptocurrency
Cybercriminals focus on CryptocurrencyCybercriminals focus on Cryptocurrency
Cybercriminals focus on CryptocurrencyAhmad El Tawil
 

Mehr von Ahmad El Tawil (18)

Force sensors presentation
Force sensors presentationForce sensors presentation
Force sensors presentation
 
Enabling Reusable and Adaptive Modeling,Provisioning & Execution of BPEL Proc...
Enabling Reusable and Adaptive Modeling,Provisioning & Execution of BPEL Proc...Enabling Reusable and Adaptive Modeling,Provisioning & Execution of BPEL Proc...
Enabling Reusable and Adaptive Modeling,Provisioning & Execution of BPEL Proc...
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Map reduce advantages over parallel databases
Map reduce advantages over parallel databases Map reduce advantages over parallel databases
Map reduce advantages over parallel databases
 
Cloud computing risk assesment report
Cloud computing risk assesment reportCloud computing risk assesment report
Cloud computing risk assesment report
 
Cloud computing risk assesment
Cloud computing risk assesment Cloud computing risk assesment
Cloud computing risk assesment
 
Piper Alpha Disaster Report
Piper Alpha Disaster ReportPiper Alpha Disaster Report
Piper Alpha Disaster Report
 
Fruit detection using morphological
Fruit detection using morphological Fruit detection using morphological
Fruit detection using morphological
 
Piper Alpha Disaster
Piper Alpha DisasterPiper Alpha Disaster
Piper Alpha Disaster
 
Cloud computing risk assesment presentation
Cloud computing risk assesment presentationCloud computing risk assesment presentation
Cloud computing risk assesment presentation
 
Bhopal Disaster Presentation
Bhopal Disaster PresentationBhopal Disaster Presentation
Bhopal Disaster Presentation
 
Security algorithms for manet
Security algorithms for manetSecurity algorithms for manet
Security algorithms for manet
 
Bayesian network
Bayesian networkBayesian network
Bayesian network
 
AAA Implementation
AAA ImplementationAAA Implementation
AAA Implementation
 
5G green communication
5G green communication5G green communication
5G green communication
 
A survey of ethical hacking process and security
A survey of ethical hacking process and securityA survey of ethical hacking process and security
A survey of ethical hacking process and security
 
E-DHCP
E-DHCPE-DHCP
E-DHCP
 
Cybercriminals focus on Cryptocurrency
Cybercriminals focus on CryptocurrencyCybercriminals focus on Cryptocurrency
Cybercriminals focus on Cryptocurrency
 

Kürzlich hochgeladen

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 

Kürzlich hochgeladen (20)

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 

Map reduce advantages over parallel databases report

  • 1. 1 | P a g e Report Title: MapReduce advantages over parallel databases Name: Ahmad Ali Taweel Lecturer: Dr. Rafiq Haque Date: 30/12/2017
  • 2. 2 | P a g e Table Page I. Introduction 3 II. State of the art 5 a. ParallelDatabase Technology 5 b. MapReduce Technology 7 III. Comparative study 10 IV. Conclusion 12
  • 3. 3 | P a g e 1. Introduction Big data refers to voluminous data objects that are variety in nature, generated at a high degree of velocity with an uncertainty pattern whichmakes it hard to be processed with standard software.To describe big data, there is no way we can avoid describing its major characteristics 4V: volume, velocity, variety, veracity.
  • 4. 4 | P a g e Volume Volume: This is the quantity of data that are generated not only from the Internet but also from the transaction data from internal of companies. With the data growth, the requirements of capacity for storing the data have increased. Volume is increasing radically every next second. Variety Variety: This is the category of big data. Big data originate from messages, social network, government data, and media outlets. Variety refers to different types or forms of data: structured data, unstructured data, and semi-structured data. Structured like relational data. Unstructured data like text, image, audio, and video. Semi- structured like XMLdata. Velocity Velocity: This is the speed of data creation. Compare to volume, the velocity of the data creation is even more important to many companies, because obtaining real time information allows companies to react more quickly in the digital world [6]. Veracity Veracity: This is accuracy, trustworthy and quality of the data. Big data quality, which depends on the veracity of source data, is very important for the analyzer to estimate their data accurately, affecting the accurate analysis. Today, more and more people use the Internet to achieve their needs for communication, shopping, transactions and so on. According to the research of IBM, there is 5 Exabyte data generated every two days in 2012. This rapidly growing flood of big data represents huge opportunities objectives. Determining how to quickly optimize the challenges such as analysis, searching, sharing, storage, and transfer of these data is an essential key to achieving success in the competitive digital world. And to optimize these challenges, two technologies were provided: Mapreduce and Parallel database.
  • 5. 5 | P a g e 2. State of the art Back in the 1900s, initially the data were mainly generated by companies was not that huge, so that the database management system could find the best approach to solve the data related problems. With the Structured Query Language (SQL) becoming the standard query language, data scientists found out it is quite effective to deal with data problems by using SQL. However, with the development of technologies, data have been growing geometrically, and this method becomes infeasible because of the size of data. Two decades ago, a terabyte of data was considered an uncommonly large volume of data, but right now those sizes are common, even in small company’s database or file system. For example, 20 petabytes are processed per day by Google; more than one million transactions per hour are processed by Walmart, and these transactions are more than 2.5 petabytes of data; AT&T has a 312 terabytes database which includes 1.9 trillion phone call records. 2.1 Parallel Database Systems A parallel database system is a high-performance database system established using massively parallel processing or parallel computing environments. It allows multiple instances to share one physical database so that the shared device, software and data can be accessed by multiple client instances. Relational queries are more ideally suited to parallel executions. Every relational query can be transferred into operations like scan or sort. From the Figure below, we can easily see that each data stream comes from the data source and then becomes an input of an operator1 which produces an output that is used as an input of operator2, and eventually the output is generated from merged result of operator2.
  • 6. 6 | P a g e This approach requires a merge based server that can handle the parallel execution of those operations. Without a high-speed network, this approach looks impossible. Right now, most of the parallel database systems use high-speed LAN as their workstations. Meanwhile, some of the companies use high-speed networks and distributed database technology to construct their parallel database systems which would cost more money. A variety of hardware architectures allow multiple computers to share access to data, software, or peripheral devices. A parallel database is designed to take advantage of such architectures by running multiple instances which "share" a single physical database. In appropriate applications, a parallel server can allow access to a single database by users on multiple machines, with increased performance. Tools that support parallel database: Speedment is an open-source Stream ORM Java Toolkit and Runtime Java tool that wraps an existing database and its tables into Java 8 streams. We can use an existing database and run the Speedment tool and it will generate POJO classes that
  • 7. 7 | P a g e correspond to the tables we have selected using the tool. One distinct feature with Speedment is that it supports parallel database streams and that it can use different parallel strategies to further optimize performance. 2.2 MapReduce • Overthe past years at Google they implemented many computations that process large amount of data but they found out that to process such huge amount of data it needs to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. They designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. They realized that most of their computations involvedapplying a map operation to each record in their input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key. MapReduceis a programming model for processing and generating large data sets. ProgramingModel: There are twofunctions Map & Reduce. Map takes an input and produces a set of intermediate key/valuepairs. Reduce takes output of Map functions and merge all values that have same key to produce set of key/values pairs. Both functions are written by the user. Ex: Occurrenceof each wordin a document: Map (Stringkey, Stringvalue): Reduce (Stringkey, Iterator values): // key: document name // key: aword // value: document contents // values: alist of counts
  • 8. 8 | P a g e for each word w in value: int result = 0; EmitIntermediate(w, "1"); for each v in values: result += ParseInt(v); Emit(AsString(result)); Implementation: 1. The MapReduce library in the user program
  • 9. 9 | P a g e • Splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece • Starts up many copies of the program on a cluster of machines • The master • Rest are workers that are assigned workby the master 2. Master • There are M map tasks and R reduce tasks to assign. The master picksidle workers and assigns each one a map task or a reduce task 3. Worker whois assigned a map task • Reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Mapfunction. • The intermediate key/valuepairs produced by the Mapfunctionare buffered in memory 4. On Local disks • Periodically,the Buffered pairs are written to localdisk • Partitioned into R regions by the partitioning function • The Locations of these buffered pairs on the localdisk are passed back to the master, who is responsible for • Forwarding these locations to the reduce workers 5. Worker whois assigned a map task • After a reduce workeris Notified by the master about these locations, it uses remote procedure calls to read the buffereddata from the local disks of the map workers. When a reduce workerhas • It Read all intermediate data • It Sorts it by the intermediate keysso that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fitin memory, an external sort is used • It Passes each unique intermediate key with the corresponding set of intermediate values to the user's Reduce function
  • 10. 10 | P a g e • The output of the Reduce functionis appended to a final output file forthis reduce partition ToolsthatsupportMapReduce:  Hadoop Distributed File System  Hbase  HIVE  Zookeeper  CouchDB  MongoDB  Riak 3. Comparative Study: In 2009 a paper By Andrew Pavloetalwas published. In this paper Pavloetal discussed the difference in performance of MapReduce and parallel databases. This paper is known by the name Comparison paper of Pavloetalwere he said that MapReduce is a major step backwards. And here I will address several misconceptions about MapReduce. HeterogeneousSystems: MapReduce provides a simple model for analyzing data in such heterogeneous systems by simply defining simple reader writer implementations that operate on the new storage systems. Storage systems like relational database or file systems. In parallel database, first input must be copied. And here where the issues start withthe inconvenience in loading phase and the unacceptably slow speed. After that is done, the analyzing begins. ComplexFunctions: Map & Reduce functions are simple and straight forwardSQL equivalent. Pavloetal pointed that sometimes it very complicated to be expressed in SQL. MapReduce workedto solve these problems by using user defined functions (UDFs)that can be combined with SQLqueries. Whichis buggy some times in MapReduce but at least it exist in it while it is missing in parallel database.
  • 11. 11 | P a g e MapReduce is a better framework for doing more complicated tasks (suchas those listed earlier) than the selection and aggregation that are SQL’s forte. Fault Tolerance: There are twomodels to transfer data between mappers and reducers: Pull model and Push model. Pullmodel is where data are being moved from mappers to reducers, while Push model is where data are being written from mappers to reducers. As Pavloetalsaid Pullmodel create many files and disks. But MapReduce used batching, sorting and grouping of the intermediate data and smart scheduling forreads so they can mitigate these costs. MapReduce do not use push model due to fault-tolerance property required by Google's developers; failure of reducer willforcere-execution of all Map tasks. Fault tolerance willbe more important as data set grow larger and clearly now days data set are getting much larger with time so the need for a fault tolerance system like MapReduce is needed to process these data efficiently. Performance: Cost of merging results: As Pavloetalsaid that the final phase of MapReduce where all results are merged to one file is very expensive. But merging isn’t necessary when the next consumer of MapReduce is another MapReduce since it can operate on the files produced by the first MapReduce or even if it is not another MapReduce because the reducer processes in the initial MapReduce can write directly to a merged destination (big table or parallel database table) Data loading: Hadoop can analyze data 5 to 50 times faster than the time needed to load data to parallel database. It is possible to run 50 separate MapReduce task to analyze the data before it’s possible to load the data into the database and complete one analysis.
  • 12. 12 | P a g e 4. Conclusion MapReduce is successfully used by Google since it is a highly effectiveand efficient toolfor large-scale fault-tolerant data analysis. MapReduce Model is easy to use, even forprogrammers without experience since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing. Also many problems are easily expressed in MapReduce such as sorting, data mining, and machine learning. MapReduce implementation can scale to large clusters of machines (100 1000 machines). MapReduce is very useful forhandling data processing and data loading in a heterogeneous system as it provides a good framework forthe execution of more complicated functions.