SlideShare ist ein Scribd-Unternehmen logo
1 von 28
MAPREDUCE
INTRODUCTION
Boris Lublinsky
Doing things by the book
Boris Lublinsky, Kevin T.
Smith and Alexey
Yakubovich.
―Professional Hadoop
Solutions‖
What is MapReduce?
• MapReduce is a framework for executing highly parallelizable

and distributable algorithms across huge datasets using a large
number of commodity computers.
• MapReduce model originates from the map and reduce
combinators concept in functional programming languages, for
example, Lisp. In Lisp, a map takes as input a function and a
sequence of values. It then applies the function to each value
in the sequence. A reduce combines all the elements of a
sequence using a binary operation.
• MapReduce is based on divide-and-conquer principles—the
input data sets are split into independent chunks, which are
processed by the mappers in parallel. Additionally execution of
the maps is typically co-located with the data. The framework
then sorts the outputs of the maps, and uses them as an input
to the reducers.
MapReduce in a nutshell

• A mapper takes input in a form of key/value pairs (k1, v1) and

transforms them into another key/value pair (k2, v2).
• MapReduce framework sorts mapper’s output key/value pairs and
combines each unique key with all its values (k2, {v2, v2,…}) and
deliver these key/value combinations to reducers
• Reducer takes (k2, {v2, v2,…}) key/value combinations and translates
them into yet another key/value pair (k3, v3).
Hadoop execution
MapReduce framework features
• Scheduling ensures that multiple tasks from multiple jobs are

executed on the cluster. Another aspect of scheduling is
speculative execution, which is an optimization that is
implemented by MapReduce. If the job tracker notices that one
of the tasks is taking too long to execute, it can start additional
instance of the same task (using different task tracker).
• Synchronization. MapReduce execution provides
synchronization between the map and reduce phases of
processing (reduce phase cannot start until all map’ key/value
pairs are emitted).
• Error and fault handling. In order to accomplish job execution
in the environment where errors and faults are the norm job
tracker will attempt to restart failed tasks execution
MapReduce execution
MapReduce design
• How to break up a large problem into smaller tasks? More

specifically, how to decompose the problem so that the
smaller tasks can be executed in parallel?
• What are key/value pairs, that can be viewed as
inputs/outputs of every task
• How to bring together all data required for calculation?
More specifically how to organize processing the way that
all the data necessary for calculation is in memory at the
same time?
Examples of MapReduce design
• MapReduce for parallel processing (face recognition
•
•
•
•

example)
Simple data processing with MapReduce(inverted indexes
example)
Building joins using MapReduce(Road Enrichment
Example)
Building bucketing joins using MapReduce (Link elevation
example)
Iterative MapReduce applications (Stranding Bivalent
Links Example)
Face Recognition Example
To run face recognition on millions of pictures stored in
Hadoop in a form of sequence file a simple Map only job
can be used to parallelize execution.
Process Phase

Description

Mapper

In this job, mapper is first initialized with the
set of recognizable features. For every
image a map function invokes face
recognition algorithm’s implementation,
passing it an image itself along with a list of
recognizable features. The result of
recognition along with the original imageID
is outputted from the map.
A result of this job execution is results of
recognitions of all the images, contained in
the original images.

Result
Building Joins using MapReduce
• Generic join problem can be described as follows. Given multiple data sets (S1

•
•

•

•

through Sn), sharing the same key (join key) build records, containing a key and
all required data from every record.
There are two ―standard‖ implementations for joining data in MapReduce –
reduce-side join and map-side join.
A most common implementation of join is a reduce side join. In this case, all
datasets are processed by the mapper that emits a join key as the intermediate
key, and the value as the intermediate record capable of holding either of the
set’s values. Shuffle and sort will group all intermediate records by the join key.
A special case of a reducer side join is called a ―Bucketing‖ join. In this case, the
datasets might not have common keys, but support the notion of proximity, for
example geographic proximity. A proximity value can be used as a join key for the
data sets.
In the cases when one of the data sets is small enough (―baby‖ joins) to fit in
memory – a map in memory side join can be used. In this case a small data set is
brought in memory in a form of key-based hash providing fast lookups. Now it is
possible iterate over the larger data set, do all the joins in memory and output the
resulting records.
Inverted indexes

• An inverted index is a data structure storing a mapping
from content, such as words or numbers, to its
locations in a document or a set of documents
• To build an inverted index, it is possible to feed the
mapper each document (or lines within a document).
The Mapper will parse the words in the document to
emit [word, descriptor] pairs. The reducer can simply
be an identity function that just writes out the list, or it
can perform some statistic aggregation per word.
Calculation of Inverted Indexes Job
Process Phase
Mapper

Description
In this job, the role of the mapper is to build a unique record
containing a word-index and information describing this word
occurrence in the document. It reads every input document,
parses it and for every unique word in the document builds an
index descriptor. This descriptor contains a document ID,
number of times index occurs in the document and any
additional information, for example index positions as offset from
the beginning of the document. Every index descriptor pair is
written out.

Shuffle and Sort

MapReduce’ shuffle and sort will sort all the records based on
the index value, which will guarantee that a reducer will get all
the indexes with a given key together.

Reducer

The role of reducer, in this job, is to build inverted indexes
structure(s). Depending on the system requirements there can
be one or more reducers. Reducer gets all the descriptors for a
given index and builds an index record, which is written to the
desired index storage.

Result

A result of this job execution is an inverted index (s) for a set of
the original documents.
Road Enrichment

• Find all links connected to a given node, for example,

node N1 has links L1, L2, L3 and L4 and node N2 – L4,
L5 and L6.
• Based on the amount of lanes for every link at the node
calculate the road width at the intersection.
• Based on the road width calculate the intersection
geometry
• Based on the intersection geometry move road’s end
point to tie it to the intersection geometry.
Calculation of Intersection Geometry job
Process Phase
Mapper

Shuffle and Sort

Reducer

Result

Description
In this job, the role of the mapper is to build LNNi records out of the
source records - NNi and LLi. It reads every input record from both source
files and then looks at the object type. If it is a node with the key NNi, a
new record of the type LNNi is written to the output. If it is a link, keys of
the both adjacent nodes (NNi and NNj) are extracted from the link and two
records – LNNi and LNNj are written out.
MapReduce’ shuffle and sort will sort all the records based on the node’s
keys, which will guarantee that every nodes with all adjacent links will be
processed by a single reducer and that a reducer will get all the LN
records for a given key together.
The role of reducer, in this job, is to calculate intersection geometry and
move road’s ends. It gets all LN records for a given node key and stores
them in memory. Once all of them are in memory, intersection’s geometry
can be calculated and written out as a intersection record with node key
SNi. At the same time, all the link records, connected to a given node can
be converted to road records and written out with link key – R Li.
A result of this job execution is a set of intersection records ready to use
and road records that have to be merged together (every road is
connected to 2 intersections, dead ends are typically modeled with a node
that has a single link connected to it.). It is necessary to implement the
second map reduce job to merge them
Merge Roads Job
Process Phase

Description

Mapper

The role of mapper in this job is very simple. It is just
a pass-through (or identity mapper in Hadoop’s
terminology). It reads roads records and writes them
directly to the output
MapReduce’ shuffle and sort will sort all the records
based on the link’s keys, which will guarantee that
both road records with the same key will come to the
same reducer together.
The role of reducer, in this job, is to merge roads
with same id. Once reducer reads both records for a
given road it merges them together and writes them
out as a single road record.
A result of this job execution is a set of road records
ready to use.

Shuffle and Sort

Reducer

Result
Link Elevation
Given links graph and terrain
model convert two dimensional
(x,y) links into 3 dimensional (x,
y, z)
• Split every link into fixed length
(for example 10 meters)
fragments.
• For every piece. calculate
heights, from terrain model, for
both start and end point of
each link
• Combine pieces together into
original links.
Link Elevation in MapReduce
Process Phase
Mapper

Shuffle and Sort

Reducer

Result

Description
In this job, the role of the mapper is to build link pieces and assign
them to the individual tiles. It reads every link record and splits it into
fixed length fragments, which effectively converts link representation
into a set of points. For every point it calculates a tile this point belongs
to and produce one or more link piece records which look as follows
(tid, {lid [point]}), where tid is tile id, lid – original link id, points array is
an array of points from the original link, that belong to a given tile.
MapReduce’ shuffle and sort will sort all the records based on the tile’s
ids, which will guarantee that all of the link’s piece belonging to the
same tile will be processed by a single reducer.
The role of reducer, in this job, is to calculate elevation for every link
piece. For every tile id it loads terrain information for a given tile from
HBase and then reads all of link piece and calculates elevation for
every point. It outputs the record (lid, [point]), where lid is original link id
and points array is an array of three-dimensional points.
A result of this job execution will contain fully elevated link pieces. It is
necessary to implement the second map reduce job to merge them.
Stranding bivalent links

• Two connected links are called bivalent if they are connected via bivalent

node. A Bivalent node is a node that has only 2 connections. For example
nodes N6, N7, N8 and N9 are bivalent; also links L5, L6, L7, L8 and L9 are
bivalent. A degenerate case of bivalent link is link L4.
• An algorithm for calculating bivalent links stretches (Figure looks very
straightforward – any stretch of links between two not bivalent nodes is a
stretch of bivalent links.
• Build set of partial bivalent strands.
• Loop
• Combine partial strands
• If no partial strands were combined, exit the loop
• End of loop
Elimination of Non-Bivalent Nodes Job
Process Phase

Description

In this job, the role of the mapper is to build LNNi records out of the source
records - NNi and LLi. It reads every input record and then looks at the object
type. If it is a node with the key Ni, a new record of the type LNNi is written to
the output. If it is a link, keys of the both adjacent nodes (Ni and Nj) are
extracted from the link and two records – LNNi and LNNj are written out.
Shuffle and Sort MapReduce’ shuffle and sort will sort all the records based on the node’s
keys, which will guarantee that every nodes with all adjacent links will be
processed by a single reducer and that a reducer will get all the LN records
for a given key together.
Reducer
The role of reducer, in this job, is to eliminate non bivalent nodes and build
partial link strands. It reads all LN records for a given node key and stores
them in memory. If the amount of links for this node is 2, then this is a
bivalent node and a new strand, combining these two nodes is written to the
output (see, for example, pairs L5, L6 or L7, L8). If the amount of links is not
equal to 2 (it can be either a dead end or a node, at which multiple links
intersect), then this is a non-bivalent node. For this type of node a special
strands containing a single link, connected to this non bivalent node (see, for
example, L4 or L5) is created. The amount of strands for such node is equal
to the amount of links connected to this node.
Result
A result of this job execution contains partial strand records some of which
can be repeated (L4, for example, will appear twice – from processing N4
and N5).
Mapper
Merging Partial Strands Job
Process Phase
Mapper

Shuffle and
Sort
Reducer

Result

Description
The role of mapper in this job is to bring together strands sharing the
same links. For every strand it reads, it outputs several key value pairs
The key values are the keys of the links in the strand and the values
are the strands themselves.
MapReduce’ shuffle and sort will sort all the records based on end link
keys, which will guarantee that all strand records containing the same
link ID will come to the same reducer together.
The role of reducer, in this job, is to merge strands sharing the same
link id. For every link id all strands are loaded in memory and
processed based on the strand types:
If there are two strands containing the same links, the produced
strand is complete and can be written directly to the final result
directory.
Otherwise the resulting strand, strand containing all unique links, is
created and is written to the output for further processing. In this
case ―to be processed‖ counter is incremented.
A result of this job execution contains all complete strands in a separate
directory. It also contains a directory of all bivalent partial strands along
with value of ―to be processed‖ counter.
When to MapReduce?
MapReduce is a technique aimed to solve a relatively simple
problem; there is a lot of data and it has to be processed in parallel,
preferably on multiple machines. Alternatively it can be used for
parallelizing compute intensive calculations, where it’s not necessarily
about amount of data, but rather about overall calculation time The
following has to be true in order for MapReduce to be applicable:
• The calculations that need to be run have to be composable. This
means it should be possible to run the calculation on a subset of
data, and merge partial results.
• The dataset size is big enough (or the calculations are long enough)
that the infrastructure overhead of splitting it up for independent
computations and merging results will not hurt overall performance.
• The calculation depends mostly on the dataset being processed;
additional small data sets can be added using HBase, Distributed
cache or some other techniques.
When not to MapReduce?
• MapReduce is not applicable, in scenarios when the dataset has to

be accessed randomly to perform the operation. MapReduce
calculations need to be composable - it should be possible to run
the calculation on a subset of data, and merge partial results.
• MapReduce is not applicable for general recursive problems
because computation of the current value requires the knowledge
of the previous one, which means that you can’t break it apart to
sub computations that can be run independently..
• If a data size is small enough to fit on a single machine memory, it
is probably going to be faster to process it as a standalone
application. Usage of MapReduce, in this case, makes
implementation unnecessary complex and typically slower.
• MapReduce is an inherently batch implementation and as such is
not applicable for online computations
General design guidelines - 1
When splitting data between map tasks make sure that you do not
create too many or too few of them. Having the right number of maps
has the following advantages for applications:
• Having too many mappers creates scheduling and infrastructure
overhead and in extreme cases can even kill job tracker. Also, having
too many mappers typically increases overall resource utilization (due
to excessive JVMs creation) and execution time (due to the limited
amount of execution slots).
• Having too few mappers can lead to underutilization of the cluster and
creation of too much load on some of the nodes (where maps actually
run). Additionally, in the case of very large map tasks, retries and
speculative execution becomes very expensive and takes longer.
• Large amount of small mappers creates a lot of seeks during shuffle
of the map output to the reducers. It also creates too many
connections delivering map outputs to the reducers.
General design guidelines - 2
• The number of reduces configured for the application is

another crucial factor. Having too many or too few reduces can
be anti-productive:
• Having too many reducers, in addition to scheduling and infrastructure

overhead, results in creation of too many output files (remember each
reducer creates its own output file), which has negative impact on
name node. It will also make more complicated to leverage results
from this MapReduce job by other ones.
• Having too few reducers will have the same negative effects as having
too few mappers - underutilization of the cluster and very expensive
retries
• Use the job counters appropriately.
• Counters are appropriate for tracking few, important, global bits of
information. They are definitely not meant to aggregate very finegrained statistics of applications.
• Counters are fairly expensive since the JobTracker has to maintain
every counter of every map/reduce task for the entire duration of the
application.
General design guidelines - 3
• Do not create applications writing out multiple, small, output

files from each reduce.
• Consider compressing the application's output with an appropriate

compressor (compression speed v/s efficiency) to improve writeperformance.
• Use an appropriate file-format for the output of reduces. Using
SequenceFiles is often a best option since they are both compressable
and splittable.
• Consider using a larger output block size when the individual
input/output files are large (multiple GBs).
• Try to avoid creation of new objects instances inside map and

reduce methods. These methods are executed many times
inside the loop, which means objects creation and disposal will
increase execution time and create extra work for garbage
collector. A better approach is to create the majority of your
intermediate classes in corresponding setup method and then
repopulating them inside map/reduce methods.
General design guidelines - 4
• Do not write directly to the user-defined files from either mapper or

reducer. The current implementation of file writer in Hadoop is singe
threaded, which means that execution of multiple mappers/reducers
trying to write to this file will be serialized.
• Do not create MapReduce implementations scanning HBase table to
create a new HBase table (or write to the same table).
TableInputFormat used for HBase MapReduce implementation are
based on table scan which is time sensitive. HBase writes, on another
hand, can cause a significant write delays due to HBase table splits.
As a result region servers can hang up or even some data can be
lost. A better solution is to split such job into two – one scanning a
table and writing intermediate results into HDFS and another reading
from HDFS and writing to HBase.
• Do not try to re implement existing Hadoop classes - extend them and
explicitly specify your implementation in configuration. Unlike
application servers, Hadoop command is specifying user classes last,
which means that existing Hadoop classes always have precedence.
Thank you

Any Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing Luay AL-Assadi
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersAbolfazl Asudeh
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...Edureka!
 
Predicting house prices_Regression
Predicting house prices_RegressionPredicting house prices_Regression
Predicting house prices_RegressionSruti Jain
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 

Was ist angesagt? (20)

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large Clusters
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
 
Predicting house prices_Regression
Predicting house prices_RegressionPredicting house prices_Regression
Predicting house prices_Regression
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 

Andere mochten auch

Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoopColin Su
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Stock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationStock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationMaruthi Nataraj K
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduceShrihari Rathod
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceUwe Printz
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joinsShalish VJ
 

Andere mochten auch (20)

Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoop
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Stock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationStock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce Implementation
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joins
 

Ähnlich wie Introduction to MapReduce

Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingcoolmirza143
 
Map reduce
Map reduceMap reduce
Map reducexydii
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceLeonidas Akritidis
 

Ähnlich wie Introduction to MapReduce (20)

MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Map reduce
Map reduceMap reduce
Map reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Unit3 MapReduce
Unit3 MapReduceUnit3 MapReduce
Unit3 MapReduce
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
Main map reduce
Main map reduceMain map reduce
Main map reduce
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 

Mehr von Chicago Hadoop Users Group

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChicago Hadoop Users Group
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieChicago Hadoop Users Group
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Chicago Hadoop Users Group
 

Mehr von Chicago Hadoop Users Group (17)

Kinetica master chug_9.12
Kinetica master chug_9.12Kinetica master chug_9.12
Kinetica master chug_9.12
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Financial Data Analytics with Hadoop
Financial Data Analytics with HadoopFinancial Data Analytics with Hadoop
Financial Data Analytics with Hadoop
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 

Kürzlich hochgeladen

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Kürzlich hochgeladen (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Introduction to MapReduce

  • 2. Doing things by the book Boris Lublinsky, Kevin T. Smith and Alexey Yakubovich. ―Professional Hadoop Solutions‖
  • 3. What is MapReduce? • MapReduce is a framework for executing highly parallelizable and distributable algorithms across huge datasets using a large number of commodity computers. • MapReduce model originates from the map and reduce combinators concept in functional programming languages, for example, Lisp. In Lisp, a map takes as input a function and a sequence of values. It then applies the function to each value in the sequence. A reduce combines all the elements of a sequence using a binary operation. • MapReduce is based on divide-and-conquer principles—the input data sets are split into independent chunks, which are processed by the mappers in parallel. Additionally execution of the maps is typically co-located with the data. The framework then sorts the outputs of the maps, and uses them as an input to the reducers.
  • 4. MapReduce in a nutshell • A mapper takes input in a form of key/value pairs (k1, v1) and transforms them into another key/value pair (k2, v2). • MapReduce framework sorts mapper’s output key/value pairs and combines each unique key with all its values (k2, {v2, v2,…}) and deliver these key/value combinations to reducers • Reducer takes (k2, {v2, v2,…}) key/value combinations and translates them into yet another key/value pair (k3, v3).
  • 6. MapReduce framework features • Scheduling ensures that multiple tasks from multiple jobs are executed on the cluster. Another aspect of scheduling is speculative execution, which is an optimization that is implemented by MapReduce. If the job tracker notices that one of the tasks is taking too long to execute, it can start additional instance of the same task (using different task tracker). • Synchronization. MapReduce execution provides synchronization between the map and reduce phases of processing (reduce phase cannot start until all map’ key/value pairs are emitted). • Error and fault handling. In order to accomplish job execution in the environment where errors and faults are the norm job tracker will attempt to restart failed tasks execution
  • 8. MapReduce design • How to break up a large problem into smaller tasks? More specifically, how to decompose the problem so that the smaller tasks can be executed in parallel? • What are key/value pairs, that can be viewed as inputs/outputs of every task • How to bring together all data required for calculation? More specifically how to organize processing the way that all the data necessary for calculation is in memory at the same time?
  • 9. Examples of MapReduce design • MapReduce for parallel processing (face recognition • • • • example) Simple data processing with MapReduce(inverted indexes example) Building joins using MapReduce(Road Enrichment Example) Building bucketing joins using MapReduce (Link elevation example) Iterative MapReduce applications (Stranding Bivalent Links Example)
  • 10. Face Recognition Example To run face recognition on millions of pictures stored in Hadoop in a form of sequence file a simple Map only job can be used to parallelize execution. Process Phase Description Mapper In this job, mapper is first initialized with the set of recognizable features. For every image a map function invokes face recognition algorithm’s implementation, passing it an image itself along with a list of recognizable features. The result of recognition along with the original imageID is outputted from the map. A result of this job execution is results of recognitions of all the images, contained in the original images. Result
  • 11. Building Joins using MapReduce • Generic join problem can be described as follows. Given multiple data sets (S1 • • • • through Sn), sharing the same key (join key) build records, containing a key and all required data from every record. There are two ―standard‖ implementations for joining data in MapReduce – reduce-side join and map-side join. A most common implementation of join is a reduce side join. In this case, all datasets are processed by the mapper that emits a join key as the intermediate key, and the value as the intermediate record capable of holding either of the set’s values. Shuffle and sort will group all intermediate records by the join key. A special case of a reducer side join is called a ―Bucketing‖ join. In this case, the datasets might not have common keys, but support the notion of proximity, for example geographic proximity. A proximity value can be used as a join key for the data sets. In the cases when one of the data sets is small enough (―baby‖ joins) to fit in memory – a map in memory side join can be used. In this case a small data set is brought in memory in a form of key-based hash providing fast lookups. Now it is possible iterate over the larger data set, do all the joins in memory and output the resulting records.
  • 12. Inverted indexes • An inverted index is a data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents • To build an inverted index, it is possible to feed the mapper each document (or lines within a document). The Mapper will parse the words in the document to emit [word, descriptor] pairs. The reducer can simply be an identity function that just writes out the list, or it can perform some statistic aggregation per word.
  • 13. Calculation of Inverted Indexes Job Process Phase Mapper Description In this job, the role of the mapper is to build a unique record containing a word-index and information describing this word occurrence in the document. It reads every input document, parses it and for every unique word in the document builds an index descriptor. This descriptor contains a document ID, number of times index occurs in the document and any additional information, for example index positions as offset from the beginning of the document. Every index descriptor pair is written out. Shuffle and Sort MapReduce’ shuffle and sort will sort all the records based on the index value, which will guarantee that a reducer will get all the indexes with a given key together. Reducer The role of reducer, in this job, is to build inverted indexes structure(s). Depending on the system requirements there can be one or more reducers. Reducer gets all the descriptors for a given index and builds an index record, which is written to the desired index storage. Result A result of this job execution is an inverted index (s) for a set of the original documents.
  • 14. Road Enrichment • Find all links connected to a given node, for example, node N1 has links L1, L2, L3 and L4 and node N2 – L4, L5 and L6. • Based on the amount of lanes for every link at the node calculate the road width at the intersection. • Based on the road width calculate the intersection geometry • Based on the intersection geometry move road’s end point to tie it to the intersection geometry.
  • 15. Calculation of Intersection Geometry job Process Phase Mapper Shuffle and Sort Reducer Result Description In this job, the role of the mapper is to build LNNi records out of the source records - NNi and LLi. It reads every input record from both source files and then looks at the object type. If it is a node with the key NNi, a new record of the type LNNi is written to the output. If it is a link, keys of the both adjacent nodes (NNi and NNj) are extracted from the link and two records – LNNi and LNNj are written out. MapReduce’ shuffle and sort will sort all the records based on the node’s keys, which will guarantee that every nodes with all adjacent links will be processed by a single reducer and that a reducer will get all the LN records for a given key together. The role of reducer, in this job, is to calculate intersection geometry and move road’s ends. It gets all LN records for a given node key and stores them in memory. Once all of them are in memory, intersection’s geometry can be calculated and written out as a intersection record with node key SNi. At the same time, all the link records, connected to a given node can be converted to road records and written out with link key – R Li. A result of this job execution is a set of intersection records ready to use and road records that have to be merged together (every road is connected to 2 intersections, dead ends are typically modeled with a node that has a single link connected to it.). It is necessary to implement the second map reduce job to merge them
  • 16. Merge Roads Job Process Phase Description Mapper The role of mapper in this job is very simple. It is just a pass-through (or identity mapper in Hadoop’s terminology). It reads roads records and writes them directly to the output MapReduce’ shuffle and sort will sort all the records based on the link’s keys, which will guarantee that both road records with the same key will come to the same reducer together. The role of reducer, in this job, is to merge roads with same id. Once reducer reads both records for a given road it merges them together and writes them out as a single road record. A result of this job execution is a set of road records ready to use. Shuffle and Sort Reducer Result
  • 17. Link Elevation Given links graph and terrain model convert two dimensional (x,y) links into 3 dimensional (x, y, z) • Split every link into fixed length (for example 10 meters) fragments. • For every piece. calculate heights, from terrain model, for both start and end point of each link • Combine pieces together into original links.
  • 18. Link Elevation in MapReduce Process Phase Mapper Shuffle and Sort Reducer Result Description In this job, the role of the mapper is to build link pieces and assign them to the individual tiles. It reads every link record and splits it into fixed length fragments, which effectively converts link representation into a set of points. For every point it calculates a tile this point belongs to and produce one or more link piece records which look as follows (tid, {lid [point]}), where tid is tile id, lid – original link id, points array is an array of points from the original link, that belong to a given tile. MapReduce’ shuffle and sort will sort all the records based on the tile’s ids, which will guarantee that all of the link’s piece belonging to the same tile will be processed by a single reducer. The role of reducer, in this job, is to calculate elevation for every link piece. For every tile id it loads terrain information for a given tile from HBase and then reads all of link piece and calculates elevation for every point. It outputs the record (lid, [point]), where lid is original link id and points array is an array of three-dimensional points. A result of this job execution will contain fully elevated link pieces. It is necessary to implement the second map reduce job to merge them.
  • 19. Stranding bivalent links • Two connected links are called bivalent if they are connected via bivalent node. A Bivalent node is a node that has only 2 connections. For example nodes N6, N7, N8 and N9 are bivalent; also links L5, L6, L7, L8 and L9 are bivalent. A degenerate case of bivalent link is link L4. • An algorithm for calculating bivalent links stretches (Figure looks very straightforward – any stretch of links between two not bivalent nodes is a stretch of bivalent links. • Build set of partial bivalent strands. • Loop • Combine partial strands • If no partial strands were combined, exit the loop • End of loop
  • 20. Elimination of Non-Bivalent Nodes Job Process Phase Description In this job, the role of the mapper is to build LNNi records out of the source records - NNi and LLi. It reads every input record and then looks at the object type. If it is a node with the key Ni, a new record of the type LNNi is written to the output. If it is a link, keys of the both adjacent nodes (Ni and Nj) are extracted from the link and two records – LNNi and LNNj are written out. Shuffle and Sort MapReduce’ shuffle and sort will sort all the records based on the node’s keys, which will guarantee that every nodes with all adjacent links will be processed by a single reducer and that a reducer will get all the LN records for a given key together. Reducer The role of reducer, in this job, is to eliminate non bivalent nodes and build partial link strands. It reads all LN records for a given node key and stores them in memory. If the amount of links for this node is 2, then this is a bivalent node and a new strand, combining these two nodes is written to the output (see, for example, pairs L5, L6 or L7, L8). If the amount of links is not equal to 2 (it can be either a dead end or a node, at which multiple links intersect), then this is a non-bivalent node. For this type of node a special strands containing a single link, connected to this non bivalent node (see, for example, L4 or L5) is created. The amount of strands for such node is equal to the amount of links connected to this node. Result A result of this job execution contains partial strand records some of which can be repeated (L4, for example, will appear twice – from processing N4 and N5). Mapper
  • 21. Merging Partial Strands Job Process Phase Mapper Shuffle and Sort Reducer Result Description The role of mapper in this job is to bring together strands sharing the same links. For every strand it reads, it outputs several key value pairs The key values are the keys of the links in the strand and the values are the strands themselves. MapReduce’ shuffle and sort will sort all the records based on end link keys, which will guarantee that all strand records containing the same link ID will come to the same reducer together. The role of reducer, in this job, is to merge strands sharing the same link id. For every link id all strands are loaded in memory and processed based on the strand types: If there are two strands containing the same links, the produced strand is complete and can be written directly to the final result directory. Otherwise the resulting strand, strand containing all unique links, is created and is written to the output for further processing. In this case ―to be processed‖ counter is incremented. A result of this job execution contains all complete strands in a separate directory. It also contains a directory of all bivalent partial strands along with value of ―to be processed‖ counter.
  • 22. When to MapReduce? MapReduce is a technique aimed to solve a relatively simple problem; there is a lot of data and it has to be processed in parallel, preferably on multiple machines. Alternatively it can be used for parallelizing compute intensive calculations, where it’s not necessarily about amount of data, but rather about overall calculation time The following has to be true in order for MapReduce to be applicable: • The calculations that need to be run have to be composable. This means it should be possible to run the calculation on a subset of data, and merge partial results. • The dataset size is big enough (or the calculations are long enough) that the infrastructure overhead of splitting it up for independent computations and merging results will not hurt overall performance. • The calculation depends mostly on the dataset being processed; additional small data sets can be added using HBase, Distributed cache or some other techniques.
  • 23. When not to MapReduce? • MapReduce is not applicable, in scenarios when the dataset has to be accessed randomly to perform the operation. MapReduce calculations need to be composable - it should be possible to run the calculation on a subset of data, and merge partial results. • MapReduce is not applicable for general recursive problems because computation of the current value requires the knowledge of the previous one, which means that you can’t break it apart to sub computations that can be run independently.. • If a data size is small enough to fit on a single machine memory, it is probably going to be faster to process it as a standalone application. Usage of MapReduce, in this case, makes implementation unnecessary complex and typically slower. • MapReduce is an inherently batch implementation and as such is not applicable for online computations
  • 24. General design guidelines - 1 When splitting data between map tasks make sure that you do not create too many or too few of them. Having the right number of maps has the following advantages for applications: • Having too many mappers creates scheduling and infrastructure overhead and in extreme cases can even kill job tracker. Also, having too many mappers typically increases overall resource utilization (due to excessive JVMs creation) and execution time (due to the limited amount of execution slots). • Having too few mappers can lead to underutilization of the cluster and creation of too much load on some of the nodes (where maps actually run). Additionally, in the case of very large map tasks, retries and speculative execution becomes very expensive and takes longer. • Large amount of small mappers creates a lot of seeks during shuffle of the map output to the reducers. It also creates too many connections delivering map outputs to the reducers.
  • 25. General design guidelines - 2 • The number of reduces configured for the application is another crucial factor. Having too many or too few reduces can be anti-productive: • Having too many reducers, in addition to scheduling and infrastructure overhead, results in creation of too many output files (remember each reducer creates its own output file), which has negative impact on name node. It will also make more complicated to leverage results from this MapReduce job by other ones. • Having too few reducers will have the same negative effects as having too few mappers - underutilization of the cluster and very expensive retries • Use the job counters appropriately. • Counters are appropriate for tracking few, important, global bits of information. They are definitely not meant to aggregate very finegrained statistics of applications. • Counters are fairly expensive since the JobTracker has to maintain every counter of every map/reduce task for the entire duration of the application.
  • 26. General design guidelines - 3 • Do not create applications writing out multiple, small, output files from each reduce. • Consider compressing the application's output with an appropriate compressor (compression speed v/s efficiency) to improve writeperformance. • Use an appropriate file-format for the output of reduces. Using SequenceFiles is often a best option since they are both compressable and splittable. • Consider using a larger output block size when the individual input/output files are large (multiple GBs). • Try to avoid creation of new objects instances inside map and reduce methods. These methods are executed many times inside the loop, which means objects creation and disposal will increase execution time and create extra work for garbage collector. A better approach is to create the majority of your intermediate classes in corresponding setup method and then repopulating them inside map/reduce methods.
  • 27. General design guidelines - 4 • Do not write directly to the user-defined files from either mapper or reducer. The current implementation of file writer in Hadoop is singe threaded, which means that execution of multiple mappers/reducers trying to write to this file will be serialized. • Do not create MapReduce implementations scanning HBase table to create a new HBase table (or write to the same table). TableInputFormat used for HBase MapReduce implementation are based on table scan which is time sensitive. HBase writes, on another hand, can cause a significant write delays due to HBase table splits. As a result region servers can hang up or even some data can be lost. A better solution is to split such job into two – one scanning a table and writing intermediate results into HDFS and another reading from HDFS and writing to HBase. • Do not try to re implement existing Hadoop classes - extend them and explicitly specify your implementation in configuration. Unlike application servers, Hadoop command is specifying user classes last, which means that existing Hadoop classes always have precedence.