This presentation explains the Life Cycle of Classic MapReduce1 Job" and it's working as an efficient data processing model. For more detailed info, kindly visit http://hashprompt.blogspot.in/search/label/Hadoop
6. SHUFFLE / SORT - MAP SIDE
Mappers are run on unsorted data in input splits.
Mappers generate multiple key/value pairs.
Mapper
Mapper
Mapper
Mapper
Unsorted Data Key/Value Pairs
Mappers write the output to circular memory buffer (default size=100MB).
Circular
Memory
Buffer
The buffer spills the output after reaching the threshold limit (default=80MB).
Spilled data is first partitioned.
Circular
Memory
Buffer
Circular
Memory
Buffer
Circular
Memory
Buffer
PartitionSort
Data in each partition is sorted by grouping the same keys together (k1,v1) (k1,v1).
Data in each partition is combined with the same key together (k1,v2).
Combine
Finally is the ouput is spilled to the local disk of tasktracker
Spill to
Local
Disk
Input Splits
Map
Outputs
# HashPrompt
7. SHUFFLE / SORT - REDUCE SIDE
Map Outputs Reduce Inputs
Sort / Copy
MergeTasktracker
Local Disk
Tasktracker
Local Disk
Reducer
Reducer
Reducer
Reducer
Reducer
Reducer
Reducer
Reducer
Reducer
Reduce Outputs
Tasktracker
Local Disk
# HashPrompt
8. Input Files
HDFS
Input Split 1
Input Split 2
Input Split 3
Input Split 4
Map Tasks
Reduce Tasks
Input Split 1
Mapper
Partition 1
Sort
Partition 2
Partition 3
Input
Spill to Disk Spill to
Disk
Merge
Reduce
Output
Ouput Files
SUMMARY OF MAPREDUCE DATA
FLOW
Shuffle
& Sort
1. Client stores input
files into HDFS
2. Client submits job
3. Input files are
split by the client
5. Input splits are assigned map
tasks by jobtracker
5. Map outputs
are sorted and
shuffled.
User MapReduce
Application
Job Client
Client JVM
Client Node
Datanode JVM
Datanode
Child JVM
Map Task
Child JVM
Reduce
Task
Tasktracker JVM
Launch
Namenode JVM
Namenode
Jobtracker JVM
4. Jobtracker retrieves
input splits
# HashPrompt