SlideShare ist ein Scribd-Unternehmen logo
1 von 39
The Story of Sam
Saliya Ekanayake
SALSA HPC Group
http://salsahpc.indiana.edu
Pervasive Technology Institute, Indiana University, Bloomington
 Believed “an apple a day keeps a doctor away”
Mother
Sam
An Apple
 Sam thought of “drinking” the apple
 He used a to cut
the and a
to make juice.
 (map „( ))
( )
 Sam applied his invention to all the
fruits he could find in the fruit basket
 (reduce „( )) Classical Notion of MapReduce in
Functional Programming
A list of values mapped into
another list of values, which gets
reduced into a single value
 Sam got his first job in JuiceRUs for his
talent in making juice
 Now, it‟s not just one basket
but a whole container of fruits
 Also, they produce a list of
juice types separately
NOT ENOUGH !!
 But, Sam had just ONE
and ONE
Large data and list of values
for output
Wait!
 Implemented a parallel version of his innovation
(<a, > , <o, > , <p, > , …)
Each input to a map is a list of <key, value> pairs
Each output of a map is a list of <key, value> pairs
(<a‟, > , <o‟, > , <p‟, > , …)
Grouped by key
Each input to a reduce is a <key, value-list>
(possibly a list of these, depending on the
grouping/hashing mechanism)
e.g. <a‟, ( …)>
Reduced into a list of values
The idea of MapReduce in Data
Intensive Computing
A list of <key, value> pairs mapped
into another list of <key, value> pairs
which gets grouped by the key and
reduced into a list of values
 Sam realized,
◦ To create his favorite mix fruit juice he can use a combiner after the
reducers
◦ If several <key, value-list> fall into the same group (based on the
grouping/hashing algorithm) then use the blender (reducer) separately on
each of them
◦ The knife (mapper) and blender (reducer) should not contain residue after
use – Side Effect Free
◦ In general reducer should be associative and commutative
 We think Sam was you 
Hadoop: Nuts and Bolts
Big Data Mining and Analytics
Dr. Latifur Khan
Department of Computer Science
University of Texas at Dallas
Source:
http://developer.yahoo.com/hadoop/tutorial/module4.html
 MapReduce is designed to efficiently process large
volumes of data by connecting many commodity
computers together to
work in parallel
 A theoretical 1000-CPU machine would cost a very
large amount of money, far more than 1000 single-CPU
or 250 quad-core machines
 MapReduce ties smaller and more reasonably priced
machines together into a single cost-effective
commodity cluster
10
 MapReduce divides the workload into multiple independent
tasks and schedule them across cluster nodes
 A work performed by each task is done in isolation from one
another
 The amount of communication which can be performed by
tasks is mainly limited for scalability reasons
11
 In a MapReduce cluster, data is distributed to all the nodes of
the cluster as it is being loaded in
 An underlying distributed file systems (e.g., GFS) splits large
data files into chunks which are managed by different nodes
in the cluster
 Even though the file chunks are distributed across several
machines, they form a single namesapce
12
Input data: A large file
Node 1
Chunk of input
data
Node 2
Chunk of input
data
Node 3
Chunk of input
data
 In MapReduce, chunks are processed in
isolation by tasks called Mappers
 The outputs from the mappers are denoted as
intermediate outputs (IOs) and are brought
into a second set of tasks called Reducers
 The process of bringing together IOs into a set
of Reducers is known as shuffling process
 The Reducers produce the final outputs (FOs)
 Overall, MapReduce breaks the data flow into two phases,
map phase and reduce phase
C0 C1 C2 C3
M
0
M
1
M
2
M
3
IO0 IO1 IO2 IO3
R0 R1
FO0 FO1
chunks
mappers
Reducers
MapPhaseReducePhase
Shuffling Data
 The programmer in MapReduce has to specify two functions,
the map function and the reduce function that implement the
Mapper and the Reducer in a MapReduce program
 In MapReduce data elements are always structured as
key-value (i.e., (K, V)) pairs
 The map and reduce functions receive and emit (K, V) pairs
(K, V)
Pairs
Map
Functio
n
(K‟, V‟)
Pairs
Reduce
Functio
n
(K‟‟,
V‟‟)
Pairs
Input Splits Intermediate Outputs Final Outputs
 In MapReduce, intermediate output values are not usually
reduced together
 All values with the same key are presented to a single
Reducer together
 More specifically, a different subset of intermediate key space
is assigned to each Reducer
 These subsets are known as partitions
Different colors represent
different keys (potentially)
from different Mappers
Partitions are the input to Reducers
 In this part, the following concepts of MapReduce
will
be described:
 Basics
 A close look at MapReduce data flow
 Additional functionality
 Scheduling and fault-tolerance in MapReduce
 Comparison with existing techniques and models
16
 Since its debut on the computing stage, MapReduce has
frequently been associated with Hadoop
 Hadoop is an open source implementation of
MapReduce and is currently enjoying wide popularity
 Hadoop presents MapReduce as an analytics engine and
under the hood uses a distributed storage layer referred
to as Hadoop Distributed File System (HDFS)
 HDFS mimics Google File System (GFS)
17
 Highly scalable distributed file system for
large data-intensive applications.
◦ E.g. 10K nodes, 100 million files, 10 PB
 Provides redundant storage of massive
amounts of data on cheap and unreliable
computers
◦ Files are replicated to handle hardware failure
◦ Detect failures and recovers from them
 Provides a platform over which other
systems like MapReduce, BigTable operate.
 Single Namespace for entire cluster
 Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
 Files are broken up into blocks
– Typically 128 MB block size
– Each block replicated on multiple DataNodes
 Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
Secondary
NameNode
Client
HDFS Architecture
NameNode
DataNodes
NameNode : Maps a file to a file-id and list of MapNodes
DataNode : Maps a block-id to a physical location on disk
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
file
file
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
Node 1 Node 2
Shuffling
Process
Intermediate
(K,V) pairs
exchanged by
all nodes
 Input files are where the data for a MapReduce task is
initially stored
 The input files typically reside in a distributed file
system
(e.g. HDFS)
 The format of input files is arbitrary
 Line-based log files
 Binary files
 Multi-line input records
 Or something else entirely
22
file
file
 How the input files are split up and read is defined by
the InputFormat
 InputFormat is a class that does the following:
 Selects the files that should be used
for input
 Defines the InputSplits that break
a file
 Provides a factory for RecordReader objects that
read the file
23
file
file
InputFormat
Files loaded from local HDFS store
 Several InputFormats are provided with Hadoop:
24
InputFormat Description Key Value
TextInputFormat Default format;
reads lines of text
files
The byte
offset of the
line
The line contents
KeyValueInputFormat Parses lines into
(K, V) pairs
Everything up
to the first
tab character
The remainder of
the line
SequenceFileInputFormat A Hadoop-
specific high-
performance
binary format
user-defined user-defined
InputFormat Description Key Value
TextInputFormat Default format;
reads lines of text
files
The byte
offset of the
line
The line contents
KeyValueInputFormat Parses lines into
(K, V) pairs
Everything up
to the first
tab character
The remainder of
the line
SequenceFileInputFormat A Hadoop-
specific high-
performance
binary format
user-defined user-defined
InputFormat Description Key Value
TextInputFormat Default format;
reads lines of text
files
The byte
offset of the
line
The line contents
KeyValueInputFormat Parses lines into
(K, V) pairs
Everything up
to the first
tab character
The remainder of
the line
SequenceFileInputFormat A Hadoop-
specific high-
performance
binary format
user-defined user-defined
InputFormat Description Key Value
TextInputFormat Default format;
reads lines of text
files
The byte
offset of the
line
The line contents
KeyValueInputFormat Parses lines into
(K, V) pairs
Everything up
to the first
tab character
The remainder of
the line
SequenceFileInputFormat A Hadoop-
specific high-
performance
binary format
user-defined user-defined
 An input split describes a unit of work that comprises a single
map task in a MapReduce program
 By default, the InputFormat breaks a file up into 64MB splits
 By dividing the file into splits, we allow
several map tasks to operate on a single
file in parallel
 If the file is very large, this can improve
performance significantly through parallelism
 Each map task corresponds to a single input split
file
file
InputFormat
Split Split Split
Files loaded from local HDFS store
 The input split defines a slice of work but does not describe
how
to access it
 The RecordReader class actually loads data from its source
and converts it into (K, V) pairs suitable for reading by
Mappers
 The RecordReader is invoked repeatedly
on the input until the entire split is consumed
 Each invocation of the RecordReader leads
to another call of the map function defined
by the programmer
file
file
InputFormat
Split Split Split
Files loaded from local HDFS store
RR RR RR
 The Mapper performs the user-defined work of the first phase of the
MapReduce program
 A new instance of Mapper is created for each split
 The Reducer performs the user-defined work of
the second phase of the MapReduce program
 A new instance of Reducer is created for each partition
 For each key in the partition assigned to a Reducer, the
Reducer is called once
file
file
InputFormat
Split Split Split
Files loaded from local HDFS store
RR RR RR
Map Map Map
Partitioner
Sort
Reduce
 Each mapper may emit (K, V) pairs to any partition
 Therefore, the map nodes must all agree on
where to send different pieces of
intermediate data
 The partitioner class determines which
partition a given (K,V) pair will go to
 The default partitioner computes a hash value for a
given key and assigns it to a partition based on
this result
file
file
InputFormat
Split Split Split
Files loaded from local HDFS store
RR RR RR
Map Map Map
Partitioner
Sort
Reduce
 Each Reducer is responsible for reducing
the values associated with (several)
intermediate keys
 The set of intermediate keys on a single
node is automatically sorted by
MapReduce before they are presented
to the Reducer
file
file
InputFormat
Split Split Split
Files loaded from local HDFS store
RR RR RR
Map Map Map
Partitioner
Sort
Reduce
 The OutputFormat class defines the way (K,V) pairs
produced by Reducers are written to output files
 The instances of OutputFormat provided by
Hadoop write to files on the local disk or in HDFS
 Several OutputFormats are provided by Hadoop:
file
file
InputFormat
Split Split Split
Files loaded from local HDFS store
RR RR RR
Map Map Map
Partitioner
Sort
Reduce
OutputFormat
OutputFormat Description
TextOutputFormat Default; writes lines in "key t
value" format
SequenceFileOutputFormat Writes binary files suitable for
reading into subsequent
MapReduce jobs
NullOutputFormat Generates no output files
OutputFormat Description
TextOutputFormat Default; writes lines in "key t
value" format
SequenceFileOutputFormat Writes binary files suitable for
reading into subsequent
MapReduce jobs
NullOutputFormat Generates no output files
OutputFormat Description
TextOutputFormat Default; writes lines in "key t
value" format
SequenceFileOutputFormat Writes binary files suitable for
reading into subsequent
MapReduce jobs
NullOutputFormat Generates no output files
OutputFormat Description
TextOutputFormat Default; writes lines in "key t
value" format
SequenceFileOutputFormat Writes binary files suitable for
reading into subsequent
MapReduce jobs
NullOutputFormat Generates no output files
 In this part, the following concepts of MapReduce
will
be described:
 Basics
 A close look at MapReduce data flow
 Additional functionality
 Scheduling and fault-tolerance in MapReduce
 Comparison with existing techniques and models
31
 MapReduce applications are limited by the bandwidth available
on the cluster
 It pays to minimize the data shuffled between map and reduce tasks
 Hadoop allows the user to specify a combiner function (just like the
reduce function) to be run on a map output
MT
MT
MT
MT
MT
MT
RT
LEGEND:
• R = Rack
• N = Node
• MT = Map Task
• RT = Reduce Task
• Y = Year
• T = Temperature
MT
(1950, 0)
(1950, 20)
(1950, 10)
(1950, 20
Map
output
Combiner
output
N
N
R
N
N
R
(Y, T)
 In this part, the following concepts of MapReduce
will
be described:
 Basics
 A close look at MapReduce data flow
 Additional functionality
 Scheduling and fault-tolerance in MapReduce
 Comparison with existing techniques and models
33
 MapReduce adopts a master-slave architecture
 The master node in MapReduce is referred
to as Job Tracker (JT)
 Each slave node in MapReduce is referred
to as Task Tracker (TT)
 MapReduce adopts a pull scheduling strategy rather than
a push one
 I.e., JT does not push map and reduce tasks to TTs but rather
TTs pull them by making pertaining requests
34
JT
T0 T1 T2
Tasks Queue
TT
Task Slots
TT
Task Slots
T0 T1
 Every TT sends a heartbeat message periodically to JT
encompassing a request for a map or a reduce task to run
I. Map Task Scheduling:
 JT satisfies requests for map tasks via attempting to schedule
mappers in the vicinity of their input splits (i.e., it considers
locality)
II. Reduce Task Scheduling:
 However, JT simply assigns the next yet-to-run reduce task to a
requesting TT regardless of TT‟s network location and its implied
effect on the reducer‟s shuffle time (i.e., it does not consider
locality)
35
 In MapReduce, an application is represented as a job
 A job encompasses multiple map and reduce tasks
 MapReduce in Hadoop comes with a choice of schedulers:
 The default is the FIFO scheduler which schedules jobs
in order of submission
 There is also a multi-user scheduler called the Fair scheduler
which aims to give every user a fair share of the cluster
capacity over time
36
 MapReduce can guide jobs toward a successful completion even
when jobs are run on a large cluster where probability of failures
increases
 The primary way that MapReduce achieves fault tolerance is through
restarting tasks
 If a TT fails to communicate with JT for a period of time (by default,
1 minute in Hadoop), JT will assume that TT in question has crashed
 If the job is still in the map phase, JT asks another TT to re-
execute all Mappers that previously ran at the failed TT
 If the job is in the reduce phase, JT asks another TT to re-
execute all Reducers that were in progress on the failed TT
37
 A MapReduce job is dominated by the slowest task
 MapReduce attempts to locate slow tasks (stragglers) and run
redundant (speculative) tasks that will optimistically commit
before the corresponding stragglers
 This process is known as speculative execution
 Only one copy of a straggler is allowed to be speculated
 Whichever copy (among the two copies) of a task commits
first, it becomes the definitive copy, and the other copy is
killed by JT
 How does Hadoop locate stragglers?
 Hadoop monitors each task progress using a progress score
between 0 and 1
 If a task‟s progress score is less than (average – 0.2), and the
task has run for at least 1 minute, it is marked as a straggler
PS= 2/3
PS= 1/12
 Not a stragglerT1
T2
Time
A straggler

Weitere ähnliche Inhalte

Was ist angesagt?

Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture Rupak Roy
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Soumee Maschatak
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components Rupak Roy
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-trainingGeohedrick
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R StudioRupak Roy
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hiveSubhas Kumar Ghosh
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringGeorge Ang
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 

Was ist angesagt? (18)

Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
Unit 2
Unit 2Unit 2
Unit 2
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
 
Cppt
CpptCppt
Cppt
 
Unit 5-apache hive
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R Studio
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 

Ähnlich wie Map reducefunnyslide

Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analyticsAvinash Pandu
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 

Ähnlich wie Map reducefunnyslide (20)

Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
hadoop
hadoophadoop
hadoop
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 

Kürzlich hochgeladen

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Kürzlich hochgeladen (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Map reducefunnyslide

  • 1. The Story of Sam Saliya Ekanayake SALSA HPC Group http://salsahpc.indiana.edu Pervasive Technology Institute, Indiana University, Bloomington
  • 2.  Believed “an apple a day keeps a doctor away” Mother Sam An Apple
  • 3.  Sam thought of “drinking” the apple  He used a to cut the and a to make juice.
  • 4.  (map „( )) ( )  Sam applied his invention to all the fruits he could find in the fruit basket  (reduce „( )) Classical Notion of MapReduce in Functional Programming A list of values mapped into another list of values, which gets reduced into a single value
  • 5.  Sam got his first job in JuiceRUs for his talent in making juice  Now, it‟s not just one basket but a whole container of fruits  Also, they produce a list of juice types separately NOT ENOUGH !!  But, Sam had just ONE and ONE Large data and list of values for output Wait!
  • 6.  Implemented a parallel version of his innovation (<a, > , <o, > , <p, > , …) Each input to a map is a list of <key, value> pairs Each output of a map is a list of <key, value> pairs (<a‟, > , <o‟, > , <p‟, > , …) Grouped by key Each input to a reduce is a <key, value-list> (possibly a list of these, depending on the grouping/hashing mechanism) e.g. <a‟, ( …)> Reduced into a list of values The idea of MapReduce in Data Intensive Computing A list of <key, value> pairs mapped into another list of <key, value> pairs which gets grouped by the key and reduced into a list of values
  • 7.  Sam realized, ◦ To create his favorite mix fruit juice he can use a combiner after the reducers ◦ If several <key, value-list> fall into the same group (based on the grouping/hashing algorithm) then use the blender (reducer) separately on each of them ◦ The knife (mapper) and blender (reducer) should not contain residue after use – Side Effect Free ◦ In general reducer should be associative and commutative
  • 8.  We think Sam was you 
  • 9. Hadoop: Nuts and Bolts Big Data Mining and Analytics Dr. Latifur Khan Department of Computer Science University of Texas at Dallas Source: http://developer.yahoo.com/hadoop/tutorial/module4.html
  • 10.  MapReduce is designed to efficiently process large volumes of data by connecting many commodity computers together to work in parallel  A theoretical 1000-CPU machine would cost a very large amount of money, far more than 1000 single-CPU or 250 quad-core machines  MapReduce ties smaller and more reasonably priced machines together into a single cost-effective commodity cluster 10
  • 11.  MapReduce divides the workload into multiple independent tasks and schedule them across cluster nodes  A work performed by each task is done in isolation from one another  The amount of communication which can be performed by tasks is mainly limited for scalability reasons 11
  • 12.  In a MapReduce cluster, data is distributed to all the nodes of the cluster as it is being loaded in  An underlying distributed file systems (e.g., GFS) splits large data files into chunks which are managed by different nodes in the cluster  Even though the file chunks are distributed across several machines, they form a single namesapce 12 Input data: A large file Node 1 Chunk of input data Node 2 Chunk of input data Node 3 Chunk of input data
  • 13.  In MapReduce, chunks are processed in isolation by tasks called Mappers  The outputs from the mappers are denoted as intermediate outputs (IOs) and are brought into a second set of tasks called Reducers  The process of bringing together IOs into a set of Reducers is known as shuffling process  The Reducers produce the final outputs (FOs)  Overall, MapReduce breaks the data flow into two phases, map phase and reduce phase C0 C1 C2 C3 M 0 M 1 M 2 M 3 IO0 IO1 IO2 IO3 R0 R1 FO0 FO1 chunks mappers Reducers MapPhaseReducePhase Shuffling Data
  • 14.  The programmer in MapReduce has to specify two functions, the map function and the reduce function that implement the Mapper and the Reducer in a MapReduce program  In MapReduce data elements are always structured as key-value (i.e., (K, V)) pairs  The map and reduce functions receive and emit (K, V) pairs (K, V) Pairs Map Functio n (K‟, V‟) Pairs Reduce Functio n (K‟‟, V‟‟) Pairs Input Splits Intermediate Outputs Final Outputs
  • 15.  In MapReduce, intermediate output values are not usually reduced together  All values with the same key are presented to a single Reducer together  More specifically, a different subset of intermediate key space is assigned to each Reducer  These subsets are known as partitions Different colors represent different keys (potentially) from different Mappers Partitions are the input to Reducers
  • 16.  In this part, the following concepts of MapReduce will be described:  Basics  A close look at MapReduce data flow  Additional functionality  Scheduling and fault-tolerance in MapReduce  Comparison with existing techniques and models 16
  • 17.  Since its debut on the computing stage, MapReduce has frequently been associated with Hadoop  Hadoop is an open source implementation of MapReduce and is currently enjoying wide popularity  Hadoop presents MapReduce as an analytics engine and under the hood uses a distributed storage layer referred to as Hadoop Distributed File System (HDFS)  HDFS mimics Google File System (GFS) 17
  • 18.  Highly scalable distributed file system for large data-intensive applications. ◦ E.g. 10K nodes, 100 million files, 10 PB  Provides redundant storage of massive amounts of data on cheap and unreliable computers ◦ Files are replicated to handle hardware failure ◦ Detect failures and recovers from them  Provides a platform over which other systems like MapReduce, BigTable operate.
  • 19.  Single Namespace for entire cluster  Data Coherency – Write-once-read-many access model – Client can only append to existing files  Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes  Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode
  • 20. Secondary NameNode Client HDFS Architecture NameNode DataNodes NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk
  • 21. file file InputFormat Split Split Split RR RR RR Map Map Map Input (K, V) pairs Partitioner Intermediate (K, V) pairs Sort Reduce OutputFormat Files loaded from local HDFS store RecordReaders Final (K, V) pairs Writeback to local HDFS store file file InputFormat Split Split Split RR RR RR Map Map Map Input (K, V) pairs Partitioner Intermediate (K, V) pairs Sort Reduce OutputFormat Files loaded from local HDFS store RecordReaders Final (K, V) pairs Writeback to local HDFS store Node 1 Node 2 Shuffling Process Intermediate (K,V) pairs exchanged by all nodes
  • 22.  Input files are where the data for a MapReduce task is initially stored  The input files typically reside in a distributed file system (e.g. HDFS)  The format of input files is arbitrary  Line-based log files  Binary files  Multi-line input records  Or something else entirely 22 file file
  • 23.  How the input files are split up and read is defined by the InputFormat  InputFormat is a class that does the following:  Selects the files that should be used for input  Defines the InputSplits that break a file  Provides a factory for RecordReader objects that read the file 23 file file InputFormat Files loaded from local HDFS store
  • 24.  Several InputFormats are provided with Hadoop: 24 InputFormat Description Key Value TextInputFormat Default format; reads lines of text files The byte offset of the line The line contents KeyValueInputFormat Parses lines into (K, V) pairs Everything up to the first tab character The remainder of the line SequenceFileInputFormat A Hadoop- specific high- performance binary format user-defined user-defined InputFormat Description Key Value TextInputFormat Default format; reads lines of text files The byte offset of the line The line contents KeyValueInputFormat Parses lines into (K, V) pairs Everything up to the first tab character The remainder of the line SequenceFileInputFormat A Hadoop- specific high- performance binary format user-defined user-defined InputFormat Description Key Value TextInputFormat Default format; reads lines of text files The byte offset of the line The line contents KeyValueInputFormat Parses lines into (K, V) pairs Everything up to the first tab character The remainder of the line SequenceFileInputFormat A Hadoop- specific high- performance binary format user-defined user-defined InputFormat Description Key Value TextInputFormat Default format; reads lines of text files The byte offset of the line The line contents KeyValueInputFormat Parses lines into (K, V) pairs Everything up to the first tab character The remainder of the line SequenceFileInputFormat A Hadoop- specific high- performance binary format user-defined user-defined
  • 25.  An input split describes a unit of work that comprises a single map task in a MapReduce program  By default, the InputFormat breaks a file up into 64MB splits  By dividing the file into splits, we allow several map tasks to operate on a single file in parallel  If the file is very large, this can improve performance significantly through parallelism  Each map task corresponds to a single input split file file InputFormat Split Split Split Files loaded from local HDFS store
  • 26.  The input split defines a slice of work but does not describe how to access it  The RecordReader class actually loads data from its source and converts it into (K, V) pairs suitable for reading by Mappers  The RecordReader is invoked repeatedly on the input until the entire split is consumed  Each invocation of the RecordReader leads to another call of the map function defined by the programmer file file InputFormat Split Split Split Files loaded from local HDFS store RR RR RR
  • 27.  The Mapper performs the user-defined work of the first phase of the MapReduce program  A new instance of Mapper is created for each split  The Reducer performs the user-defined work of the second phase of the MapReduce program  A new instance of Reducer is created for each partition  For each key in the partition assigned to a Reducer, the Reducer is called once file file InputFormat Split Split Split Files loaded from local HDFS store RR RR RR Map Map Map Partitioner Sort Reduce
  • 28.  Each mapper may emit (K, V) pairs to any partition  Therefore, the map nodes must all agree on where to send different pieces of intermediate data  The partitioner class determines which partition a given (K,V) pair will go to  The default partitioner computes a hash value for a given key and assigns it to a partition based on this result file file InputFormat Split Split Split Files loaded from local HDFS store RR RR RR Map Map Map Partitioner Sort Reduce
  • 29.  Each Reducer is responsible for reducing the values associated with (several) intermediate keys  The set of intermediate keys on a single node is automatically sorted by MapReduce before they are presented to the Reducer file file InputFormat Split Split Split Files loaded from local HDFS store RR RR RR Map Map Map Partitioner Sort Reduce
  • 30.  The OutputFormat class defines the way (K,V) pairs produced by Reducers are written to output files  The instances of OutputFormat provided by Hadoop write to files on the local disk or in HDFS  Several OutputFormats are provided by Hadoop: file file InputFormat Split Split Split Files loaded from local HDFS store RR RR RR Map Map Map Partitioner Sort Reduce OutputFormat OutputFormat Description TextOutputFormat Default; writes lines in "key t value" format SequenceFileOutputFormat Writes binary files suitable for reading into subsequent MapReduce jobs NullOutputFormat Generates no output files OutputFormat Description TextOutputFormat Default; writes lines in "key t value" format SequenceFileOutputFormat Writes binary files suitable for reading into subsequent MapReduce jobs NullOutputFormat Generates no output files OutputFormat Description TextOutputFormat Default; writes lines in "key t value" format SequenceFileOutputFormat Writes binary files suitable for reading into subsequent MapReduce jobs NullOutputFormat Generates no output files OutputFormat Description TextOutputFormat Default; writes lines in "key t value" format SequenceFileOutputFormat Writes binary files suitable for reading into subsequent MapReduce jobs NullOutputFormat Generates no output files
  • 31.  In this part, the following concepts of MapReduce will be described:  Basics  A close look at MapReduce data flow  Additional functionality  Scheduling and fault-tolerance in MapReduce  Comparison with existing techniques and models 31
  • 32.  MapReduce applications are limited by the bandwidth available on the cluster  It pays to minimize the data shuffled between map and reduce tasks  Hadoop allows the user to specify a combiner function (just like the reduce function) to be run on a map output MT MT MT MT MT MT RT LEGEND: • R = Rack • N = Node • MT = Map Task • RT = Reduce Task • Y = Year • T = Temperature MT (1950, 0) (1950, 20) (1950, 10) (1950, 20 Map output Combiner output N N R N N R (Y, T)
  • 33.  In this part, the following concepts of MapReduce will be described:  Basics  A close look at MapReduce data flow  Additional functionality  Scheduling and fault-tolerance in MapReduce  Comparison with existing techniques and models 33
  • 34.  MapReduce adopts a master-slave architecture  The master node in MapReduce is referred to as Job Tracker (JT)  Each slave node in MapReduce is referred to as Task Tracker (TT)  MapReduce adopts a pull scheduling strategy rather than a push one  I.e., JT does not push map and reduce tasks to TTs but rather TTs pull them by making pertaining requests 34 JT T0 T1 T2 Tasks Queue TT Task Slots TT Task Slots T0 T1
  • 35.  Every TT sends a heartbeat message periodically to JT encompassing a request for a map or a reduce task to run I. Map Task Scheduling:  JT satisfies requests for map tasks via attempting to schedule mappers in the vicinity of their input splits (i.e., it considers locality) II. Reduce Task Scheduling:  However, JT simply assigns the next yet-to-run reduce task to a requesting TT regardless of TT‟s network location and its implied effect on the reducer‟s shuffle time (i.e., it does not consider locality) 35
  • 36.  In MapReduce, an application is represented as a job  A job encompasses multiple map and reduce tasks  MapReduce in Hadoop comes with a choice of schedulers:  The default is the FIFO scheduler which schedules jobs in order of submission  There is also a multi-user scheduler called the Fair scheduler which aims to give every user a fair share of the cluster capacity over time 36
  • 37.  MapReduce can guide jobs toward a successful completion even when jobs are run on a large cluster where probability of failures increases  The primary way that MapReduce achieves fault tolerance is through restarting tasks  If a TT fails to communicate with JT for a period of time (by default, 1 minute in Hadoop), JT will assume that TT in question has crashed  If the job is still in the map phase, JT asks another TT to re- execute all Mappers that previously ran at the failed TT  If the job is in the reduce phase, JT asks another TT to re- execute all Reducers that were in progress on the failed TT 37
  • 38.  A MapReduce job is dominated by the slowest task  MapReduce attempts to locate slow tasks (stragglers) and run redundant (speculative) tasks that will optimistically commit before the corresponding stragglers  This process is known as speculative execution  Only one copy of a straggler is allowed to be speculated  Whichever copy (among the two copies) of a task commits first, it becomes the definitive copy, and the other copy is killed by JT
  • 39.  How does Hadoop locate stragglers?  Hadoop monitors each task progress using a progress score between 0 and 1  If a task‟s progress score is less than (average – 0.2), and the task has run for at least 1 minute, it is marked as a straggler PS= 2/3 PS= 1/12  Not a stragglerT1 T2 Time A straggler

Hinweis der Redaktion

  1. By processing a file in chunks, we allow several map tasks to operate on a single file in parallel. If the file is very large, this can improve performance significantly through parallelism
  2. Every value is identified by an associated key