SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Hadoop Map/Reduce

    Owen O’Malley
      July 2006
Map/Reduce Goals
– Distribution
   • The data is available where needed.
   • Application does not care how many computers
     are being used.
– Reliability
   • Application does not care that computers or
     networks may have temporary or permanent
     failures.


                                                   2
Application Perspective
• Define Mapper and Reducer classes and a
  “launching” program.
• Mapper
  – Is given a stream of key1,value1 pairs
  – Generates a stream of key2, value2 pairs
• Reducer
  – Is given a key2 and a stream of value2’s
  – Generates a stream of key3, value3 pairs
• Launching Program
  – Creates a JobConf to define a job.
  – Submits JobConf to JobTracker and waits for
    completion.                                   3
Application Dataflow




                       4
Input & Output Formats
• The application also chooses input and output
  formats, which define how the persistent data
  is read and written. These are interfaces and
  can be defined by the application.
• InputFormat
  – Splits the input to determine the input to each map
    task.
  – Defines a RecordReader that reads key, value
    pairs that are passed to the map task
• OutputFormat
  – Given the key, value pairs and a filename, writes
    the reduce task output to persistent store.
                                                        5
Output Ordering
• The application can control the sort order and
  partitions of the output via
  OutputKeyComparator and Partitioner.
• OutputKeyComparator
   – Defines how to compare serialized keys.
   – Defaults to WritableComparable, but should be
     defined for any application defined key types.
      • key1.compareTo(key2)
• Partitioner
   – Given a map output key and the number of
     reduces, chooses a reduce.
   – Defaults to HashPartitioner
                                                      6
      • key.hashCode % numReduces
Combiners
• Combiners are an optimization for jobs with
  reducers that can merge multiple values into
  a single value.
• Typically, the combiner is the same as the
  reducer and runs on the map outputs before it
  is transferred to the reducer’s machine.
• For example, WordCount’s mapper generates
  (word, count) and the combiner and reducer
  generate the sum for each word.
  – Input: “hi Owen bye Owen”
  – Map output: (“hi”, 1), (“Owen”, 1), (“bye”,1), (“Owen”,1)
  – Combiner output: (“Owen”, 2), (“bye”, 1), (“hi”, 1)         7
Process Communication
• Use a custom RPC implementation
  –   Easy to change/extend
  –   Defined as Java interfaces
  –   Server objects implement the interface
  –   Client proxy objects automatically created
• All messages originate at the client
  – Prevents cycles and therefore deadlocks
• Errors
  – Include timeouts and communication problems.
  – Are signaled to client via IOException.
  – Are NEVER signaled to the server.
                                                   8
Map/Reduce Processes
• Launching Application
  – User application code
  – Submits a specific kind of Map/Reduce job
• JobTracker
  – Handles all jobs
  – Makes all scheduling decisions
• TaskTracker
  – Manager for all tasks on a given node
• Task
  – Runs an individual map or reduce fragment for a
    given job
  – Forks from the TaskTracker
                                                      9
Process Diagram




                  10
Job Control Flow
• Application launcher creates and submits job.
• JobTracker initializes job, creates FileSplits,
  and adds tasks to queue.
• TaskTrackers ask for a new map or reduce
  task every 10 seconds or when the previous
  task finishes.
• As tasks run, the TaskTracker reports status
  to the JobTracker every 10 seconds.
• When job completes, the JobTracker tells the
  TaskTrackers to delete temporary files.
• Application launcher notices job completion
  and stops waiting.                              11
Application Launcher
• Application code to create JobConf and set
  the parameters.
  – Mapper, Reducer classes
  – InputFormat and OutputFormat classes
  – Combiner class, if desired
• Writes JobConf and the application jar to DFS
  and submits job to JobTracker.
• Can exit immediately or wait for the job to
  complete or fail.

                                               12
JobTracker
• Takes JobConf and creates an instance of
  the InputFormat. Calls the getSplits method to
  generate map inputs.
• Creates a JobInProgress object and a bunch
  of TaskInProgress “TIP” and Task objects.
  – JobInProgress is the status of the job.
  – TaskInProgress is the status of a fragment of
    work.
  – Task is an attempt to do a TIP.
• As TaskTrackers request work, they are given
  Tasks to execute.                          13
TaskTracker
• All Tasks
  –   Create the TaskRunner
  –   Copy the job.jar and job.xml from DFS.
  –   Localize the JobConf for this Task.
  –   Call task.prepare() (details later)
  –   Launch the Task in a new JVM under
      TaskTracker.Child.
  –   Catch output from Task and log it at the info level.
  –   Take Task status updates and send to JobTracker
      every 10 seconds.
  –   If job is killed, kill the task.
  –   If task dies or completes, tell the JobTracker.    14
TaskTracker for Reduces
• For Reduces, the task.prepare() fetches all of
  the relevant map outputs for this reduce.
• Files are fetched using http from the map’s
  TaskTracker’s Jetty.
• Files are fetched in parallel threads, but only
  1 to each host.
• When fetches fail, a backoff scheme is used
  to keep from overloading TaskTrackers.
• Fetching accounts for the first 33% of the
  reduce’s progress.
                                                15
Map Tasks
• Use the InputFormat object to create a
  RecordReader from the FileSplit.
• Loop through the keys and values in the
  FileSplit and feed each to the mapper.
• For no combiner, a SequenceFile is written
  for the keys to each reduce.
• With a combiner, the frameworks buffers
  100,000 keys and values, sorts, combines,
  and writes them to SequenceFile’s for each
  reduce.
                                               16
Reduce Tasks: Sort
• Sort
  – 33% to 66% of reduce’s progress
  – Base
     • Read 100 (io.sort.mb) meg of keys and values into
       memory.
     • Sort the memory
     • Write to disk
  – Merge
     • Read 10 (io.sort.factor) files and do a merge into 1 file.
     • Repeat as many times as required (2 levels for 100 files,
       3 levels for 1000 files, etc.)

                                                                17
Reduce Tasks: Reduce
• Reduce
  – 66% to 100% of reduce’s progress
  – Use a SequenceFile.Reader to read sorted input
    and pass to reducer one key at a time along with
    the associated values.
  – Output keys and values are written to the
    OutputFormat object, which usually writes a file to
    DFS.
  – The output from the reduce is NOT resorted, so it
    is in the order and fragmentation of the map output
    keys.
                                                     18

Weitere ähnliche Inhalte

Was ist angesagt?

Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduceHassan A-j
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReducePietro Michiardi
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operationSubhas Kumar Ghosh
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersAshraf Uddin
 

Was ist angesagt? (20)

Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
 

Andere mochten auch

Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceKrishna Sangeeth KS
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGPradeep MG
 
Pyshark in Network Packet analysis
Pyshark in Network Packet analysisPyshark in Network Packet analysis
Pyshark in Network Packet analysisRengaraj D
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)MIT College Of Engineering,Pune
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 

Andere mochten auch (12)

Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
 
Pyshark in Network Packet analysis
Pyshark in Network Packet analysisPyshark in Network Packet analysis
Pyshark in Network Packet analysis
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Ähnlich wie Hadoop Map Reduce Arch

Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptxShimoFcis
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationAhmad El Tawil
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Rajesh Ananda Kumar
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model examIndhujeni
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersCleverence Kombe
 
Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Aljoscha Krettek
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamVerverica
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large ClustersIRJET Journal
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
Q2.12: Debugging with GDB
Q2.12: Debugging with GDBQ2.12: Debugging with GDB
Q2.12: Debugging with GDBLinaro
 

Ähnlich wie Hadoop Map Reduce Arch (20)

MapReduce
MapReduceMapReduce
MapReduce
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Map reduce
Map reduceMap reduce
Map reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
 
MapReduce
MapReduceMapReduce
MapReduce
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Q2.12: Debugging with GDB
Q2.12: Debugging with GDBQ2.12: Debugging with GDB
Q2.12: Debugging with GDB
 

Mehr von Jeff Hammerbacher (20)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100301icde
20100301icde20100301icde
20100301icde
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20081022cca
20081022cca20081022cca
20081022cca
 

Kürzlich hochgeladen

MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 

Kürzlich hochgeladen (20)

MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 

Hadoop Map Reduce Arch

  • 1. Hadoop Map/Reduce Owen O’Malley July 2006
  • 2. Map/Reduce Goals – Distribution • The data is available where needed. • Application does not care how many computers are being used. – Reliability • Application does not care that computers or networks may have temporary or permanent failures. 2
  • 3. Application Perspective • Define Mapper and Reducer classes and a “launching” program. • Mapper – Is given a stream of key1,value1 pairs – Generates a stream of key2, value2 pairs • Reducer – Is given a key2 and a stream of value2’s – Generates a stream of key3, value3 pairs • Launching Program – Creates a JobConf to define a job. – Submits JobConf to JobTracker and waits for completion. 3
  • 5. Input & Output Formats • The application also chooses input and output formats, which define how the persistent data is read and written. These are interfaces and can be defined by the application. • InputFormat – Splits the input to determine the input to each map task. – Defines a RecordReader that reads key, value pairs that are passed to the map task • OutputFormat – Given the key, value pairs and a filename, writes the reduce task output to persistent store. 5
  • 6. Output Ordering • The application can control the sort order and partitions of the output via OutputKeyComparator and Partitioner. • OutputKeyComparator – Defines how to compare serialized keys. – Defaults to WritableComparable, but should be defined for any application defined key types. • key1.compareTo(key2) • Partitioner – Given a map output key and the number of reduces, chooses a reduce. – Defaults to HashPartitioner 6 • key.hashCode % numReduces
  • 7. Combiners • Combiners are an optimization for jobs with reducers that can merge multiple values into a single value. • Typically, the combiner is the same as the reducer and runs on the map outputs before it is transferred to the reducer’s machine. • For example, WordCount’s mapper generates (word, count) and the combiner and reducer generate the sum for each word. – Input: “hi Owen bye Owen” – Map output: (“hi”, 1), (“Owen”, 1), (“bye”,1), (“Owen”,1) – Combiner output: (“Owen”, 2), (“bye”, 1), (“hi”, 1) 7
  • 8. Process Communication • Use a custom RPC implementation – Easy to change/extend – Defined as Java interfaces – Server objects implement the interface – Client proxy objects automatically created • All messages originate at the client – Prevents cycles and therefore deadlocks • Errors – Include timeouts and communication problems. – Are signaled to client via IOException. – Are NEVER signaled to the server. 8
  • 9. Map/Reduce Processes • Launching Application – User application code – Submits a specific kind of Map/Reduce job • JobTracker – Handles all jobs – Makes all scheduling decisions • TaskTracker – Manager for all tasks on a given node • Task – Runs an individual map or reduce fragment for a given job – Forks from the TaskTracker 9
  • 11. Job Control Flow • Application launcher creates and submits job. • JobTracker initializes job, creates FileSplits, and adds tasks to queue. • TaskTrackers ask for a new map or reduce task every 10 seconds or when the previous task finishes. • As tasks run, the TaskTracker reports status to the JobTracker every 10 seconds. • When job completes, the JobTracker tells the TaskTrackers to delete temporary files. • Application launcher notices job completion and stops waiting. 11
  • 12. Application Launcher • Application code to create JobConf and set the parameters. – Mapper, Reducer classes – InputFormat and OutputFormat classes – Combiner class, if desired • Writes JobConf and the application jar to DFS and submits job to JobTracker. • Can exit immediately or wait for the job to complete or fail. 12
  • 13. JobTracker • Takes JobConf and creates an instance of the InputFormat. Calls the getSplits method to generate map inputs. • Creates a JobInProgress object and a bunch of TaskInProgress “TIP” and Task objects. – JobInProgress is the status of the job. – TaskInProgress is the status of a fragment of work. – Task is an attempt to do a TIP. • As TaskTrackers request work, they are given Tasks to execute. 13
  • 14. TaskTracker • All Tasks – Create the TaskRunner – Copy the job.jar and job.xml from DFS. – Localize the JobConf for this Task. – Call task.prepare() (details later) – Launch the Task in a new JVM under TaskTracker.Child. – Catch output from Task and log it at the info level. – Take Task status updates and send to JobTracker every 10 seconds. – If job is killed, kill the task. – If task dies or completes, tell the JobTracker. 14
  • 15. TaskTracker for Reduces • For Reduces, the task.prepare() fetches all of the relevant map outputs for this reduce. • Files are fetched using http from the map’s TaskTracker’s Jetty. • Files are fetched in parallel threads, but only 1 to each host. • When fetches fail, a backoff scheme is used to keep from overloading TaskTrackers. • Fetching accounts for the first 33% of the reduce’s progress. 15
  • 16. Map Tasks • Use the InputFormat object to create a RecordReader from the FileSplit. • Loop through the keys and values in the FileSplit and feed each to the mapper. • For no combiner, a SequenceFile is written for the keys to each reduce. • With a combiner, the frameworks buffers 100,000 keys and values, sorts, combines, and writes them to SequenceFile’s for each reduce. 16
  • 17. Reduce Tasks: Sort • Sort – 33% to 66% of reduce’s progress – Base • Read 100 (io.sort.mb) meg of keys and values into memory. • Sort the memory • Write to disk – Merge • Read 10 (io.sort.factor) files and do a merge into 1 file. • Repeat as many times as required (2 levels for 100 files, 3 levels for 1000 files, etc.) 17
  • 18. Reduce Tasks: Reduce • Reduce – 66% to 100% of reduce’s progress – Use a SequenceFile.Reader to read sorted input and pass to reducer one key at a time along with the associated values. – Output keys and values are written to the OutputFormat object, which usually writes a file to DFS. – The output from the reduce is NOT resorted, so it is in the order and fragmentation of the map output keys. 18