2. What is Hadoop
Hadoop is a Framework & System for
parallel processing of
large amounts of data in
a distributed computing environment
http://searchbusinessintelligence.techtarget.in/tutorial/Apache-Hadoop-FAQ-for-BI-professionals
Apache project
open source
java based
google system clone
GFS -> HDFS
MapReduce -> MapReduce
3. Distributed Processing System
How to process data in distributed environment
how to read/write data
how to control nodes
load balancing
Monitoring
node status
task status
Fault tolerance
error detection
process error, network error, hardware error, …
error handling
temporary error: retry -> duplication, data corruption, …
permanent error: fail over(which one?)
process hang: timeout & retry
• too long -> long response time
• too short -> infinite loop
4. Hadoop System Architecture
HDFS + MapReduce
Secondary
Job Name
Name
Tracker Node
Node
Task Data Task Data Task Data
Tracker Node Tracker Node Tracker Node
: Node : Process : Heart Beat : Data Read/Write
5. HDFS
vs. Filesystem
inode – namespace
cylinder / track – data node
blocks(bytes) – blocks(Mbytes)
Features
very large files
write once, read many times
support for usual file system operations
ls, cp, mv, rm, chmod, chown, put, cat, …
no support for multiple writers or arbitrary modifications
7. HDFS - Read
Data Read
1. Read Request
Name
Client
Node
2. Response
3. Reqeust 4. Read
Data Data
Data Node Data Node Data Node
: Node : Data Block : Data I/O : Operation Message
8. HDFS - Write
Data Write
1. Write Request
Name
Client
Node
2. Response
3. Write 5. Write
Data Done
Data Node 4. Write Data Node 4. Write Data Node
Replica Replica
: Node : Data Block : Data I/O : Operation Message
9. HDFS – Write (Failure)
Data Write
1. Write Request
Name
Client
Node
2. Response
3. Write 5. Write
Data Done
Data Node Data Node Data Node
4. Write
Replica
: Node : Data Block : Data I/O : Operation Message
10. HDFS – Write (Failure)
Data Write
Name Data Node
Client
Node
Replica
Arrangement
Delete Write
Partial block Replica
Data Node Data Node Data Node
: Node : Data Block : Data I/O : Operation Message
11. MapReduce
Definition
map: (+1) [ 1, 2, 3, 4, …, 10 ] -> [ 2, 3, 4, 5, …, 11 ]
reduce: (+) [ 2, 3, 4, 5, …, 11 ] -> 65
Programming Model for processing data sets in Hadoop
projection, filter -> map task
aggregation, join -> reduce task
sort -> partitioning
Job Tracker & Task Trackers
master / slave
job = many tasks
# of map tasks = # of file splits (default: # of blocks)
# of reduce tasks = user configuration
12. MapReduce
Map / Reduce Task
: Distributed File System : Map Task : Map Output Record (Key/Value pair)
: Split : Reduce Task : Reduce Output Record (Key/Value pair)
: Input Data Record : Shuffling & Sorting : Partition
13. MapReduce
Map / Reduce Task
: Distributed File System : Map Task : Map Output Record (Key/Value pair)
: Split : Reduce Task : Reduce Output Record (Key/Value pair)
: Input Data Record : Shuffling & Sorting : Partition
14. MapReduce
Map / Reduce Task
: Distributed File System : Map Task : Map Output Record (Key/Value pair)
: Split : Reduce Task : Reduce Output Record (Key/Value pair)
: Input Data Record : Shuffling & Sorting : Partition
15. MapReduce
Map / Reduce Task
: Distributed File System : Map Task : Map Output Record (Key/Value pair)
: Split : Reduce Task : Reduce Output Record (Key/Value pair)
: Input Data Record : Shuffling & Sorting : Partition
16. MapReduce
Map / Reduce Task
: Distributed File System : Map Task : Map Output Record (Key/Value pair)
: Split : Reduce Task : Reduce Output Record (Key/Value pair)
: Input Data Record : Shuffling & Sorting : Partition
17. MapReduce
Map / Reduce Task
: Distributed File System : Map Task : Map Output Record (Key/Value pair)
: Split : Reduce Task : Reduce Output Record (Key/Value pair)
: Input Data Record : Shuffling & Sorting : Partition
18. MapReduce
Map / Reduce Task
: Distributed File System : Map Task : Map Output Record (Key/Value pair)
: Split : Reduce Task : Reduce Output Record (Key/Value pair)
: Input Data Record : Shuffling & Sorting : Partition
19. Mapper - partitioning
double indexed structure
Output Buffer key value key value … key value
(default: 100Mb)
1st Index partition key value partition key value …
offset offset offset offset
2nd Index key key key ….
offset offset offset
Spill Thread
data sorting: 2nd index (quick sort)
spill file generating
spill data file & index file
flush
merge sort (by key) per partition
29. Distributed Processing System
How to process data in distributed environment
how to read/write data
how to control nodes
load balancing
Monitoring
node status
task status
Fault tolerance
error detection
process error, network error, hardware error, …
error handling
temporary error: retry -> duplication, data corruption, …
permanent error: fail over(which one?)
process hang: timeout & retry
• too long -> long response time
• too short -> infinite loop
30. Distributed Processing System
How to process data in distributed environment
how to read/write data
how to control nodes HDFS Client
load balancing master / slave
Monitoring replication / rack awareness
node status job scheduler
task status
Fault tolerance
error detection
process error, network error, hardware error, …
error handling
temporary error: retry -> duplication, data corruption, …
permanent error: fail over(which one?)
process hang: timeout & retry
• too long -> long response time
• too short -> infinite loop
31. Distributed Processing System
How to process data in distributed environment
how to read/write data
how to control nodes
load balancing
Monitoring
node status
heart beat
task status
job/task status
Fault tolerance reporter / metrics
error detection
process error, network error, hardware error, …
error handling
temporary error: retry -> duplication, data corruption, …
permanent error: fail over(which one?)
process hang: timeout & retry
• too long -> long response time
• too short -> infinite loop
32. Distributed Processing System
How to process data in distributed environment
how to read/write data
how to control nodes
load balancing
Monitoring
node status
black list
task status time out & retry
Fault tolerance speculative execution
error detection
process error, network error, hardware error, …
error handling
temporary error: retry -> duplication, data corruption, …
permanent error: fail over(which one?)
process hang: timeout & retry
• too long -> long response time
• too short -> infinite loop
33. Limitations
map -> reduce network overhead
iterative processing
full(or theta) join
small size but many splits data
Low latency
polling & pulling
job initializing
optimized for throughput
job scheduling
data access