SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Hadoop Training
MapReduce
Page 2Classification: Restricted
Agenda
Input Split & HDFS Block
Relation between Split & Block
Data locality Optimization
Speculative Execution
MR Flow with Single Reduce
Task
MR flow with multiple Reducers
Input Format & Hierarchy
Output Format & Hierarchy
Meet MapReduce
Word Count Algorithm – Traditional
approach
Traditional approach on a
Distributed System
Traditional approach – Drawbacks
MapReduce Approach
Input & Output Forms of a MR
program
Map, Shuffle & Sort, Reduce Phase
WordCount Code walkthrough
Workflow & Transformation of
Data
Page 3Classification: Restricted
Page 4Classification: Restricted
Page 5Classification: Restricted
In pioneer days they used oxen for heavy pulling, and when on ox couldn’t
budge a log,they didn’t try to grow a larger ox. We shouldn’t be trying for
bigger computers, but for more systems of computers.
Page 6Classification: Restricted
Page 7Classification: Restricted
Page 8Classification: Restricted
Meet MapReduce
• MapReduce is a programming model for distributed processing
• Advantage - easy scaling of data processing over multiple computing nodes
• The basic entities in this model are – mappers & reducers
• Decomposing a data processing application into mappers and reducers
is the task of developer
• once you write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of
machines in a cluster is merely a configuration change
Page 9Classification: Restricted
Page 10Classification: Restricted
Page 11Classification: Restricted
WordCount – Traditional Approach
• Input: do as I say not as I do
• Output:
Word Count
as 2
do 2
I 2
not 1
say 1
Page 12Classification: Restricted
WordCount – Traditional Approach
• The program loops through all the documents. For each document, the
words are extracted one by one using a tokenization process. For each
word, its corresponding entry in a multiset called wordCount is
incremented by one. At the end, a display ()function prints out all the
entries in wordCount.
• A multiset is a set where each element also has a count. The word count
we’re trying to generate is a canonical example of a multiset. In practice, it’s
usually implemented as a hash table.
define wordCount as Multiset;
for each document in documentSet {
T = tokenize(document);
for each token in T {
wordCount[token]++;
}
}
display(wordCount);
Page 13Classification: Restricted
Traditional Approach – Distributed Processing
define wordCount as Multiset;
for each document in documentSubset {
< same code as in perv.slide>
}
sendToSecondPhase(wordCount);
define totalWordCount as Multiset;
for each wordCount received from firstPhase {
multisetAdd (totalWordCount, wordCount);
}
Page 14Classification: Restricted
Traditional Approach – Drawbacks
• Central Storage – bottleneck in bandwidth of the server
• Multiple Storage – handling splits
• Program runs in memory
• When processing large document sets, the number of unique words can
exceed the RAM storage of a machine
• Phase 2 handling by one machine?
• If Multiple machines are used for phase-2, how to partition the data?
Page 15Classification: Restricted
Mapreduce Approach
• Has two execution phases – mapping & reducing
• These phases are defined by data processing functions called – mapper &
reducer
• Mapping phase – MR takes the input data and feeds each data element to
the mapper
• Reducing phase – reducer processes all the outputs from the mapper and
arrives at a final result
Page 16Classification: Restricted
Input & Output forms:
• In order for mapping, reducing, partitioning, and shuffling (and a few others
that were not mentioned) to seamlessly work together, we need to agree
on a common structure for the data being processed
• InputFormat class is responsible for creating input splits and dividing them
into records()
Input Output
map() <k1, v1> list(<k2, v2>)
reduce() <k2, list(v2)> list(<k3, v3>)
Page 17Classification: Restricted
Input & Output forms:
• Input & output forms should be flexible and powerful enough to handle
most of the targeted data processing applications. MapReduce
uses lists and(key/value) pairs as its main data primitives.
• The keys and values are often integers or strings but can also be dummy
values to be ignored or complex object types.
Page 18Classification: Restricted
Map Phase
Page 19Classification: Restricted
Reduce Phase
Page 20Classification: Restricted
Shuffle & sort Phase
• The Default partitioning is Hash-Partitioning
Page 21Classification: Restricted
MR - Work flow & Transformation of data
From i/p
files to the
mapper
From the
Mapper to
the
intermediate
results
From
intermediate
results to
the reducer
From the
reducer to
output files
Page 22Classification: Restricted
Word Count: Source Code
• Key points to note:
1.In MR, map() processes one record at a time, where as traditional
approaches process one document at a time.
2.The new classes that we have seen (Text, IntWritable, LongWritable etc.,)
have additional serialization capabilities. (Will discuss in detail later)
• Source Code: http://hadoop.apache.org/docs/current/hadoop-mapreduce-
client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Page 23Classification: Restricted
Input Split & Hdfs Block
Data Chunk
HDFS Block
(Physical Division)
Input Split
(Logical Division)
Page 24Classification: Restricted
Relation Between Input Split & Hdfs Block
1 2 3 4 76 8 1
0
95
File
Lin
es
Block
Bounda
ry
Block
Bounda
ry
Block
Bounda
ry
Block
Bounda
ry
Split Split Split
• Logical records do not fit neatly into the HDFS blocks.
• Logical records are lines that cross the boundary of the blocks.
• First split contains line 5 although it spans across blocks.
Page 25Classification: Restricted
Data locality Optimization
• MR job is split into various map &
reduce tasks
• Map tasks run on the input splits
• Ideally, the task JVM would be
initiated in the node where the
split/block of data exists
• While in some scenarios, JVMs might
not be free to accept another task.
• In that case, Task Tracker will be
initiated at a different location.
• Scenario a) Same node execution
• Scenario b) Off-node execution
• Scenario c) Off-rack execution
Page 26Classification: Restricted
Speculative execution
• MR job is split into various map & reduce tasks and they get executed in
parallel.
• Overall job execution time is pulled down by the slowest task.
• Hadoop doesn’t try to diagnose and fix slow-running tasks; instead, it tries
to detect when a task is running slower than expected and launches
another equivalent task as a backup. This is
termed speculative execution of tasks.
Page 27Classification: Restricted
MapReduce Dataflow With A Single Reduce Task
Page 28Classification: Restricted
Map Reduce Dataflow With Multiple Reduce Tasks
Page 29Classification: Restricted
MapReduce Dataflow With No Reduce Tasks
Page 30Classification: Restricted
Combiner
• A combiner is a mini-reducer
• It gets executed on the mapper output at the mapper side
• Combiner’s output is fed to Reducer
• As the mapper output is further refined using combiner, data that has to be
shuffled across the cluster is minimized
• Because the combiner function is an optimization, Hadoop does not
provide a guarantee of how many times it will call it for a particular map
output record,
if at all
• So, calling the combiner function zero, one, or many times should produce
the same output from the reducer.
Page 31Classification: Restricted
Combiner’s Contract
• Only those functions that obey commutative & associative properties can
use combiners.
• Because
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20,
25) = 25
where as,
mean(0, 20, 10, 25, 15) = 14 and
mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
Can a combiner replace a reducer?
Page 32Classification: Restricted
Partitioner
• We know that a unique key will always go to a unique reducer.
• Partitioner is responsible for sending key, value pairs to a reducer based on
the key content.
• The default partitioner is Hash-partitioner. It takes mapper output, create a
Hash value for each key and divide it modulo by the number of reducers.
The output of this calculation will determine the reducer that this particular
key would go to
Page 33Classification: Restricted
Partitioner
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
2%3 = 1
3%3=0
4%3=1
5%3=2
6%3=0
2%4 = 2
3%4=1
4%4=0
5%4=1
6%4=2
7%4=3
Page 34Classification: Restricted
Partitioner
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Partitione
r
Partitione
r
Partitione
r
Page 35Classification: Restricted
InputFormat Hierarchy
Page 36Classification: Restricted
InputFormat Hierarchy
• An Input split is a chunk of the input that is processed by a single map. Each
map processes a single split. Each split is divided into records, and the map
processes each record—a key-value pair—in turn. Splits and records are
logical: there is nothing that requires them to be tied to files, for example,
although in their most common incarnations, they are.
• In a database context, a split might correspond to a range of rows from a
table and a record to a row in that range.
• An InputFormat is responsible for creating the input splits and dividing
them into records.
Page 37Classification: Restricted
InputFormat Hierarchy
public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context) throws
IOException, InterruptedException;
public abstract RecordReader<K, V> createRecordReader(InputSplit split,
TaskAttemptContext context) throws
IOException, InterruptedException;
}
Client calls getSplits() & map task calls createRecordReader()
• FileInputFormat is the base class for all implementations
of InputFormat that use files as their data source
• It provides two things: a place to define which files are included as the input
to a job, and an implementation for generating splits for the input files. The
job of dividing splits into records is performed by subclasses.
Page 38Classification: Restricted
InputFormat Hierarchy
public static void addInputPath(Job job, Path path)
public static void addInputPaths(Job job, String commaSeparatedPaths)
public static void setInputPaths(Job job, Path... inputPaths)
public static void setInputPaths(Job job, String commaSeparatedPaths)
Page 39Classification: Restricted
InputFormat
Input Split Input SplitInput SplitInput Split
Record
Reader
Record
Reader
Record
Reader
Record
Reader
Mapper MapperMapperMapper
Page 40Classification: Restricted
OutputFormat
Reducer
Output File
Reducer ReducerReducer
RcordWriter RcordWriterRcordWriterRcordWriter
Output FileOutput FileOutput File
Page 41Classification: Restricted
OutputFormat Hierarchy
Page 42Classification: Restricted
Counters
• Map input records
• Map output records
• Filesystem bytes read
• Launched map tasks
• Failed map tasks
• Killed reduce tasks
• Counters are a useful channel for
gathering statistics about the job:
for quality control or for
application-level statistics.
• Often used for debugging purpose.
• eg: Count number of Good records,
bad records in the input
• Two types – Built-in & Custom
Counters
• Examples of Built-in Counters:
Page 43Classification: Restricted
Joins
• Map-side join(Replication): A map-side join that works in situations
where one of the datasets is small enough to cache
• Reduce-side join(Repartition join): A reduce-side join for situations where
you’re joining two or more large datasets together
• Semi-join(A map-side join): Another map-side join where one dataset is
initially too large to fit into memory, but after some filtering
can be reduced down to a size that can fit in memory
Page 44Classification: Restricted
Distributed Cache
• Side data can be defined as extra read-only data needed by a job to process
the main dataset
• To make side data available to all map or reduce tasks, we distribute those
datasets using Hadoop’s Distributed Cache mechanism.
Page 45Classification: Restricted
Map Join (Using Distributed Cache)
Page 46Classification: Restricted
Some Useful Links:
• http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
• http://hadoop.apache.org/docs/current/hadoop-mapreduce-
client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Page 47Classification: Restricted
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalogAdam Muise
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 

Was ist angesagt? (17)

Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
Hadoop
Hadoop Hadoop
Hadoop
 
Cppt
CpptCppt
Cppt
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 

Ähnlich wie MapReduce

11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersCleverence Kombe
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfTSANKARARAO
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentationNoha Elprince
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce AnandMHadoop
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxssuser30e7d2
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 

Ähnlich wie MapReduce (20)

MapReduce
MapReduceMapReduce
MapReduce
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptx
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 

Kürzlich hochgeladen

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Kürzlich hochgeladen (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

MapReduce

  • 2. Page 2Classification: Restricted Agenda Input Split & HDFS Block Relation between Split & Block Data locality Optimization Speculative Execution MR Flow with Single Reduce Task MR flow with multiple Reducers Input Format & Hierarchy Output Format & Hierarchy Meet MapReduce Word Count Algorithm – Traditional approach Traditional approach on a Distributed System Traditional approach – Drawbacks MapReduce Approach Input & Output Forms of a MR program Map, Shuffle & Sort, Reduce Phase WordCount Code walkthrough Workflow & Transformation of Data
  • 5. Page 5Classification: Restricted In pioneer days they used oxen for heavy pulling, and when on ox couldn’t budge a log,they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.
  • 8. Page 8Classification: Restricted Meet MapReduce • MapReduce is a programming model for distributed processing • Advantage - easy scaling of data processing over multiple computing nodes • The basic entities in this model are – mappers & reducers • Decomposing a data processing application into mappers and reducers is the task of developer • once you write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change
  • 11. Page 11Classification: Restricted WordCount – Traditional Approach • Input: do as I say not as I do • Output: Word Count as 2 do 2 I 2 not 1 say 1
  • 12. Page 12Classification: Restricted WordCount – Traditional Approach • The program loops through all the documents. For each document, the words are extracted one by one using a tokenization process. For each word, its corresponding entry in a multiset called wordCount is incremented by one. At the end, a display ()function prints out all the entries in wordCount. • A multiset is a set where each element also has a count. The word count we’re trying to generate is a canonical example of a multiset. In practice, it’s usually implemented as a hash table. define wordCount as Multiset; for each document in documentSet { T = tokenize(document); for each token in T { wordCount[token]++; } } display(wordCount);
  • 13. Page 13Classification: Restricted Traditional Approach – Distributed Processing define wordCount as Multiset; for each document in documentSubset { < same code as in perv.slide> } sendToSecondPhase(wordCount); define totalWordCount as Multiset; for each wordCount received from firstPhase { multisetAdd (totalWordCount, wordCount); }
  • 14. Page 14Classification: Restricted Traditional Approach – Drawbacks • Central Storage – bottleneck in bandwidth of the server • Multiple Storage – handling splits • Program runs in memory • When processing large document sets, the number of unique words can exceed the RAM storage of a machine • Phase 2 handling by one machine? • If Multiple machines are used for phase-2, how to partition the data?
  • 15. Page 15Classification: Restricted Mapreduce Approach • Has two execution phases – mapping & reducing • These phases are defined by data processing functions called – mapper & reducer • Mapping phase – MR takes the input data and feeds each data element to the mapper • Reducing phase – reducer processes all the outputs from the mapper and arrives at a final result
  • 16. Page 16Classification: Restricted Input & Output forms: • In order for mapping, reducing, partitioning, and shuffling (and a few others that were not mentioned) to seamlessly work together, we need to agree on a common structure for the data being processed • InputFormat class is responsible for creating input splits and dividing them into records() Input Output map() <k1, v1> list(<k2, v2>) reduce() <k2, list(v2)> list(<k3, v3>)
  • 17. Page 17Classification: Restricted Input & Output forms: • Input & output forms should be flexible and powerful enough to handle most of the targeted data processing applications. MapReduce uses lists and(key/value) pairs as its main data primitives. • The keys and values are often integers or strings but can also be dummy values to be ignored or complex object types.
  • 20. Page 20Classification: Restricted Shuffle & sort Phase • The Default partitioning is Hash-Partitioning
  • 21. Page 21Classification: Restricted MR - Work flow & Transformation of data From i/p files to the mapper From the Mapper to the intermediate results From intermediate results to the reducer From the reducer to output files
  • 22. Page 22Classification: Restricted Word Count: Source Code • Key points to note: 1.In MR, map() processes one record at a time, where as traditional approaches process one document at a time. 2.The new classes that we have seen (Text, IntWritable, LongWritable etc.,) have additional serialization capabilities. (Will discuss in detail later) • Source Code: http://hadoop.apache.org/docs/current/hadoop-mapreduce- client/hadoop-mapreduce-client-core/MapReduceTutorial.html
  • 23. Page 23Classification: Restricted Input Split & Hdfs Block Data Chunk HDFS Block (Physical Division) Input Split (Logical Division)
  • 24. Page 24Classification: Restricted Relation Between Input Split & Hdfs Block 1 2 3 4 76 8 1 0 95 File Lin es Block Bounda ry Block Bounda ry Block Bounda ry Block Bounda ry Split Split Split • Logical records do not fit neatly into the HDFS blocks. • Logical records are lines that cross the boundary of the blocks. • First split contains line 5 although it spans across blocks.
  • 25. Page 25Classification: Restricted Data locality Optimization • MR job is split into various map & reduce tasks • Map tasks run on the input splits • Ideally, the task JVM would be initiated in the node where the split/block of data exists • While in some scenarios, JVMs might not be free to accept another task. • In that case, Task Tracker will be initiated at a different location. • Scenario a) Same node execution • Scenario b) Off-node execution • Scenario c) Off-rack execution
  • 26. Page 26Classification: Restricted Speculative execution • MR job is split into various map & reduce tasks and they get executed in parallel. • Overall job execution time is pulled down by the slowest task. • Hadoop doesn’t try to diagnose and fix slow-running tasks; instead, it tries to detect when a task is running slower than expected and launches another equivalent task as a backup. This is termed speculative execution of tasks.
  • 27. Page 27Classification: Restricted MapReduce Dataflow With A Single Reduce Task
  • 28. Page 28Classification: Restricted Map Reduce Dataflow With Multiple Reduce Tasks
  • 29. Page 29Classification: Restricted MapReduce Dataflow With No Reduce Tasks
  • 30. Page 30Classification: Restricted Combiner • A combiner is a mini-reducer • It gets executed on the mapper output at the mapper side • Combiner’s output is fed to Reducer • As the mapper output is further refined using combiner, data that has to be shuffled across the cluster is minimized • Because the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all • So, calling the combiner function zero, one, or many times should produce the same output from the reducer.
  • 31. Page 31Classification: Restricted Combiner’s Contract • Only those functions that obey commutative & associative properties can use combiners. • Because max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25 where as, mean(0, 20, 10, 25, 15) = 14 and mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15 Can a combiner replace a reducer?
  • 32. Page 32Classification: Restricted Partitioner • We know that a unique key will always go to a unique reducer. • Partitioner is responsible for sending key, value pairs to a reducer based on the key content. • The default partitioner is Hash-partitioner. It takes mapper output, create a Hash value for each key and divide it modulo by the number of reducers. The output of this calculation will determine the reducer that this particular key would go to
  • 33. Page 33Classification: Restricted Partitioner public class HashPartitioner<K, V> extends Partitioner<K, V> { public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; } } 2%3 = 1 3%3=0 4%3=1 5%3=2 6%3=0 2%4 = 2 3%4=1 4%4=0 5%4=1 6%4=2 7%4=3
  • 36. Page 36Classification: Restricted InputFormat Hierarchy • An Input split is a chunk of the input that is processed by a single map. Each map processes a single split. Each split is divided into records, and the map processes each record—a key-value pair—in turn. Splits and records are logical: there is nothing that requires them to be tied to files, for example, although in their most common incarnations, they are. • In a database context, a split might correspond to a range of rows from a table and a record to a row in that range. • An InputFormat is responsible for creating the input splits and dividing them into records.
  • 37. Page 37Classification: Restricted InputFormat Hierarchy public abstract class InputFormat<K, V> { public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException; public abstract RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException; } Client calls getSplits() & map task calls createRecordReader() • FileInputFormat is the base class for all implementations of InputFormat that use files as their data source • It provides two things: a place to define which files are included as the input to a job, and an implementation for generating splits for the input files. The job of dividing splits into records is performed by subclasses.
  • 38. Page 38Classification: Restricted InputFormat Hierarchy public static void addInputPath(Job job, Path path) public static void addInputPaths(Job job, String commaSeparatedPaths) public static void setInputPaths(Job job, Path... inputPaths) public static void setInputPaths(Job job, String commaSeparatedPaths)
  • 39. Page 39Classification: Restricted InputFormat Input Split Input SplitInput SplitInput Split Record Reader Record Reader Record Reader Record Reader Mapper MapperMapperMapper
  • 40. Page 40Classification: Restricted OutputFormat Reducer Output File Reducer ReducerReducer RcordWriter RcordWriterRcordWriterRcordWriter Output FileOutput FileOutput File
  • 42. Page 42Classification: Restricted Counters • Map input records • Map output records • Filesystem bytes read • Launched map tasks • Failed map tasks • Killed reduce tasks • Counters are a useful channel for gathering statistics about the job: for quality control or for application-level statistics. • Often used for debugging purpose. • eg: Count number of Good records, bad records in the input • Two types – Built-in & Custom Counters • Examples of Built-in Counters:
  • 43. Page 43Classification: Restricted Joins • Map-side join(Replication): A map-side join that works in situations where one of the datasets is small enough to cache • Reduce-side join(Repartition join): A reduce-side join for situations where you’re joining two or more large datasets together • Semi-join(A map-side join): Another map-side join where one dataset is initially too large to fit into memory, but after some filtering can be reduced down to a size that can fit in memory
  • 44. Page 44Classification: Restricted Distributed Cache • Side data can be defined as extra read-only data needed by a job to process the main dataset • To make side data available to all map or reduce tasks, we distribute those datasets using Hadoop’s Distributed Cache mechanism.
  • 45. Page 45Classification: Restricted Map Join (Using Distributed Cache)
  • 46. Page 46Classification: Restricted Some Useful Links: • http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html • http://hadoop.apache.org/docs/current/hadoop-mapreduce- client/hadoop-mapreduce-client-core/MapReduceTutorial.html