This document provides an overview of Apache Hadoop, a distributed processing framework for large datasets. It describes how Hadoop uses the Hadoop Distributed File System (HDFS) to provide a unified view of large amounts of data across clusters of computers. It also explains how the MapReduce programming model allows distributed computations to be run efficiently across large datasets in parallel. Key aspects of Hadoop's architecture like scalability, fault tolerance and the MapReduce programming model are discussed at a high level.
2. Author
WANdisco, Chief Architect
- NonStop Hadoop
- Free Training
Founder of AltoStor and AltoScale
Hadoop, HDFS at Yahoo! & eBay. Since 2005
Data structures and algorithms for large-scale distributed
storage systems
Apache Hadoop committer and member of PMC
2
3. Computing
The domain of computing is changing so does the computing itself
History of computing started long time ago
Fascination with numbers
-
Vast universe with simple strict rules
-
Computing devices
-
Crunch numbers
The Internet
-
Universe of words, fuzzy rules
-
Different type of computing
-
Understand meaning of things
-
Human thinking
-
Errors & deviations are a
part of study
Computer History Museum, San Jose
3
4. Words vs. Numbers
From Big Numbers to Big Data
In 1997 IBM built Deep Blue
supercomputer
-
Playing chess game with the
champion G. Kasparov
-
Human race defeated
-
Strict rules for Chess
-
Fast deep analyses of current state
In 2011 IBM built Watson computer to
play Jeopardy
-
Questions and hints in human terms
-
Natural language processing
-
Reborn as diagnostics machine:
Oncology.
4
5. Big Data
Computations that need the power of many computers
-
Large datasets: hundreds of TBs, PBs
-
Or use of thousands of CPUs in parallel
-
Or both
Cluster as a computer
What is a PB?
1 KB = 1000 Bytes
1 MB = 1000 KB
1 GB = 1000 MB
1 TB = 1000 GB
1 PB = 1000 TB
???? = 1000 PB
5
6. Examples – Science
Fundamental physics: Large Hadron Collider (LHC)
-
Smashing high-energy protons at the speed of light
-
1 PB of event data per sec, most filtered out
-
15 PB of data per year
-
160 PB of disk + 90 PB of tape storage
Math: Big Numbers
-
2 quadrillionth (1015) digit of π is 0
-
Pure CPU workload: 12 days of cluster time
-
208 years of CPU-time on a cluster with 7600 CPU cores
Healthcare
-
Patient records, Sensors, Drug design
-
Genome
7. Examples – Web
Search engine
-
Webmap - Map of the Internet
-
2008 @ Yahoo, 1500 nodes, 5 PB raw storage
-
Internet Search Index
-
Traditional Big Data applications
Behavioural Analysis
-
Recommendation engine: You may buy this too
-
Intelligence: fraud detection
-
Sentiment analysis: who will win elections
-
Matching interests: you should like him / her
7
8. The Sorting Problem
Turns into a Big Data problem as the data set grows
Classic in-memory sorting
-
Complexity: number of comparisons
Worst
Average
Space
Bubble Sort
O(n2)
O(n2)
In-place
Quicksort
O(n2)
O(n log n)
In-place
Merge Sort
O(n log n)
O(n log n)
Double
External sorting
-
Cannot load all data in memory
-
16 GB RAM vs. 200 GB file
-
Complexity: + disk IOs (bytes read or written)
Distributed sorting
-
Cannot load data on a single server
-
12 drives * 2 TB = 24 TB disc space vs. 200 TB data set
-
Complexity: + network transfers
8
10. Hadoop
A reliable, scalable, high performance distributed computing system
Apache Hadoop is an ecosystem of tools for processing “Big Data”
-
Started in 2005 by D. Cutting and M. Cafarella
-
Scaled by Yahoo! Hadoop team from few nodes to thousands (4K-node cluster)
Consists of two main components: Providing unified cluster view
1. HDFS – a distributed file system
•
File system API connecting thousands of drives
2. MapReduce – a framework for distributed computations
•
Splitting jobs into parts executable on one node
•
Scheduling and monitoring of job execution
Today used everywhere: Becoming a standard of distributed computing
Hadoop is an open source project
10
11. Hadoop: Architecture Principles
Linear scalability: more nodes can do more work within the same time
-
Linear on data size:
-
Linear on compute resources:
Move computation to data
-
Minimize expensive data transfers
-
Data are large, programs are small
Reliability and Availability: Commodity hardware
-
1 drive fails every 3 years => Probability of failing today 1/1000
-
How many drives per day fail on 1000 node cluster with 10 drives per node?
Sequential data processing: avoid random reads / writes
Simple computational model
-
hides complexity in efficient execution framework
11
12. The Hadoop Family
Ecosystem of tools for processing BigData
HDFS
YARN, MapReduce
Distributed file system
Computational Framework
Zookeeper
Distributed coordination
HBase
Key-Value store
Pig
Dataflow language, SQL
Hive
Data warehouse, SQL
Oozie
Complex job workflow
BigTop
Packaging and testing
12
14. MapReduce
MapReduce
-
2004 Jeffrey Dean, Sanjay Ghemawat. Google.
-
“MapReduce: Simplified Data Processing on Large Clusters”
Parallel Computational Model
-
Examples of computational models
• Turing or Post machines. Programming languages – C++, Java
• Finite automaton, lambda calculus
-
Split large input data into small enough pieces, process in parallel
Distributed Execution Framework
-
Compilers, interpreters
-
Scheduling, Processing, Coordination
-
Failure recovery
14
15. Functional Programming
Map a higher-order function
-
applies a given function to each element of a list
-
returns the list of results
Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
15
16. Functional Programming: reduce
Map a higher-order function
-
applies a given function to each element of a list
-
returns the list of results
Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
Reduce / fold a higher-order function
-
Iterates given function over a list of elements
-
Applies function to previous result and current element
-
Return single result
Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
16
17. Functional Programming
Map a higher-order function
-
applies a given function to each element of a list
-
returns the list of results
Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
Reduce / fold a higher-order function
-
Iterates given function over a list of elements
-
Applies function to previous result and current element
-
Return single result
Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
Reduce( x * y, [0,1,2,3,4,5] ) =
?
17
18. Functional Programming
Map a higher-order function
-
applies a given function to each element of a list
-
returns the list of results
Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
Reduce / fold a higher-order function
-
Iterates given function over a list of elements
-
Applies function to previous result and current element
-
Return single result
Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
Reduce( x * y, [0,1,2,3,4,5] ) =
0
18
19. Example: Sum of Squares
Composition of
-
a map followed by
-
a reduce applied to the results of the map
Square Pyramid Number
1 + 4 + … + n2 =
n(n+1)(2n+1) / 6
Example.
-
Map( x2, [1,2,3,4,5] ) = [0,1,4,9,16,25]
-
Reduce( x + y, [1,4,9,16,25] ) = ((((1 + 4) + 9) + 16) + 25) = 55
Map easily parallelizable
-
Compute x2 for 1,2,3 on one node and for 4,5 on another
Reduce notoriously sequential
-
Need all squares at one node to compute the total sum.
19
20. Computational Model
Map-Reduce is a Parallel Computational Model
Map-Reduce algorithm = job
Operates with key-value pairs: (k, V)
-
Primitive types, Strings or more complex Structures
Map-Reduce job input and output are collections of pairs {(k, V)}
MR Job is defined by 2 functions
map: (k1;; v1) → {(k2;; v2)}
reduce: (k2;; {v2}) → {(k3;; v3)}
20
23. Computation Framework
Job is executed on a cluster of computers
Two virtual clusters: HDFS and MapReduce
-
Physically tightly coupled
-
Designed to work together
The Hadoop Distributed File System
-
JobTracker
Reliable storage layer
-
NameNode
View data as files and directories
MapReduce as Computation Framework
-
Job scheduling
-
Resource management
-
Lifecycle coordination
-
Task execution module
TaskTracker
TaskTracker
TaskTracker
Task
DataNode
DataNode
DataNode
Block
23
24. HDFS Architecture Principles
The name space is a hierarchy of files and directories
Files are divided into blocks (typically 128 MB)
Namespace (metadata) is decoupled from data
-
Fast namespace operations, not slowed down by
-
Data streaming
Single NameNode keeps the entire name space in RAM
DataNodes store data blocks on local drives
Blocks are replicated on 3 DataNodes for redundancy and availability
24
25. MapReduce Framework
Job Input is a file or a set of files in a distributed file system (HDFS)
-
Input is split into blocks of roughly the same size
-
Blocks are replicated to multiple nodes
-
Block holds a list of key-value pairs
Map task is scheduled to one of the nodes containing the block
-
Map task input is node-local
-
Map task result is node-local
Map task results are grouped: one group per reducer
Each group is sorted
Reduce task is scheduled to a node
-
Reduce task transfers the targeted groups from all mapper nodes
-
Computes and stores results in a separate HDFS file
Job Output is a set of files in HDFS. With #files = #reducers
25
26. Map Reduce Example: Mean
Mean
1
n
n
xi
1
Input: large text file
Output: average length of words in the file µ
Example: µ({dogs, like, cats}) = 4
26
27. Mean Mapper
Map input is the set of words {w} in the partition
-
Key = null
Value = w
Map computes
-
Number of words in the partition
-
Total length of the words
∑length(w)
Map output
-
<“count”, #words>
-
<“length”, #totalLength>
Map (null, w)
Emit(“count”, 1)
Emit(“length”, length(w))
27
28. Single Mean Reducer
Reduce input
-
{<key, {value}>}, where
-
key = “count”, “length”
-
value is an integer
Reduce computes
-
Total number of words:
-
Total length of words: L = sum of all “length” values
Reduce Output
-
<“count”, N>
-
<“length”, L>
The result
-
µ=L/N
N = sum of all “count” values
Reduce(key, {n1, n2, …})
nRes = n1 + n2 + …
Emit(key, nRes)
Analyze ()
read(“part-r-00000”)
print(“mean = ” + L/N)
28
29. MapReduce Implementation
Single master JobTracker shepherds the distributed heard of TaskTrackers
1. Job scheduling and resource allocation
2. Job monitoring and job lifecycle coordination
3. Cluster health and resource tracking
Job is defined
-
Program: myJob.jar file
-
Configuration: job.xml
-
Input, output paths
JobClient submits the job to the JobTracker
-
Calculates and creates splits based on the input
-
Write myJob.jar and job.xml to HDFS
29
30. MapReduce Implementation
JobTracker divides the job into tasks: one map task per split.
-
Assigns a TaskTracker for each task, collocated with the split
TaskTrackers execute tasks and report status to the JobTracker
-
TaskTracker can run multiple map and reduce tasks
-
Map and Reduce Slots
Failed attempts reassigned to other TaskTrackers
Job execution status and results reported back to the client
Scheduler lets many jobs run in parallel
30
31. Example: Standard Deviation
1
n
Standard deviation
n
( xi
)2
1
Input: large text file
Output: standard deviation σ of word lengths
Example: σ({dogs, like, cats}) = 0
How many jobs
?
31
33. Standard Deviation Mapper
Map input is the set of words {w} in the partition
-
Key = null
Value = w
Map computes
-
Number of words in the partition
-
Total length of the words ∑length(w)
-
The sum of lengths squared ∑length(w)2
Map output
-
<“count”, #words>
-
<“length”, #totalLength>
-
<“squared”, #sumLengthSquared>
Map (null, w)
Emit(“count”, 1)
Emit(“length”, length(w))
Emit(“squared”, length(w)2)
33
34. Standard Deviation Reducer
Reduce input
-
{<key, {value}>}, where
-
key = “count”, “length”, “squared”
-
value is an integer
Reduce(key, {n1, n2, …})
nRes = n1 + n2 + …
Emit(key, nRes)
Reduce computes
-
Total number of words:
N = sum of all “count” values
-
Total length of words:
L = sum of all “length” values
-
Sum of length squares: S = sum of all “squared” values
Reduce Output
-
<“count”, N>
-
<“length”, L>
-
<“squared”, S>
The result
-
µ=L/N
-
Analyze ()
read(“part-r-00000”)
print(“mean = ” + L/N)
print(“std.dev = ” +
sqrt(S/N – L*L / N*N))
σ = sqrt(S / N - µ2)
34
35. Combiner, Partitioner
Combiners perform local aggregation before the shuffle & sort phase
-
Optimization to reduce data transfers during shuffle
-
In Mean example reduces transfer of many keys to only two
Partitioners assign intermediate (map) key-value pairs to reducers
-
Responsible for dividing up the intermediate key space
-
Not used with single Reducer
Input
Map
Input
Data
Map
Combiner Shuffle
Partitioner & sort
Reduce
Output
Input
Data
Reduce
35
36. Distributed Sorting
Sort a dataset, which cannot be entirely stored on one node.
Input:
-
Set of files. 100 byte records.
-
The first 10 bytes of each record is the key and the rest is the value.
Output:
-
Ordered list of files: f1, … fN
-
Each file fi is sorted, and
-
If i < j then for any keys k Є fi and r Є fj (k ≤ r)
-
Concatenation of files in the given order must form a completely sorted record set
36
37. Naïve MapReduce Sorting
If the output could be stored on one node
The input to any Reducer is always sorted by key
-
Shuffle sorts Map outputs
One identity Mapper and one identity Reducer would do the trick
-
Identity: <k,v> → <k,v>
Input
Map
Shuffle
Reduce
Output
Input
Data
Input
Data
dogs
cats
cats
Map
dogs
like
Reduce
like
dogs
cats
like
37
38. Sorting with Multiple Maps
Multiple identity Mappers and one identity Reducer – same result
-
Does not work for multiple Reducers
Input
Input
Data
dogs
Map
Shuffle
Reduce
Output
Output
Data
Map
cats
like
Map
Reduce
dogs
like
cats
Map
38
39. Sorting: Generalization
Define a hash function, such that
-
h: {k} → [1,N]
-
Preserves the order: k ≤ s → h(k) ≤ h(s)
-
h(k) is a fixed size prefix of string k (2 first bytes)
Identity Mapper
With a specialized Partitioner
-
Compute hash of the key h(k) and assigns <k,v> to reducer Rh(k)
Identity Reducer
-
Number of reducers is N: R1, …, RN
-
Inputs for Ri are all pairs that have key h(k) = i
-
Ri is an identity reducer, which writes output to HDFS file fi
-
Hash function choice guarantees that
keys from fi are less than keys from fj if i < j
The algorithm was implemented to win Gray’s Terasort Benchmark in 2008
39
41. Single NameNode of HDFS
Why High Availability is Important?
Scheduled downtime dominates Unscheduled
- OS maintenance
- Configuration changes
Reasons for Unscheduled Downtime
- 60 incidents in 500 days on 30,000 nodes
- 24 Full GC – the majority
- System bugs / Bad application / Insufficient resources
- “Data Availability and Durability with HDFS”
Lack of Availability due to Performance Problems
- A handful of nodes can saturate NameNode
41
43. WANdisco Active-Active Architecture
Fully replicated NameNodes available for reads and writes
Multiple equal-role NameNodes share namespace state via
Coordination Engine
Proposal,
Agreements
Coordinated
updates
43
44. WANdisco: Scaling Across Data Centers
Continuous availability, and Disaster Recovery over a WAN
Wide Area Network replication
Metadata – online
Data – offline
44
45. What is Apache HBase
A distributed key-value store for real-time access to semi-structured data
Table: big, sparse, loosely structured
Collection of rows, sorted by row keys
-
Rows can have arbitrary number of columns
HBase
Master
-
Table is split Horizontally into Regions
NameNode
-
Dynamic Table partitioning
-
JobTracker
Region Servers serve regions to applications
Columns grouped into Column families
-
Vertical partition of tables
TaskTracker
TaskTracker
TaskTracker
RegionServer
RegionServer
RegionServer
DataNode
DataNode
DataNode
Distributed Cache:
-
Regions are loaded in nodes’ RAM
-
Real-time access to data
45
46. HBase Challenge
Failure of a region requires failover
-
Regions reassigned to other Region Servers
-
Clients failover and reconnect to new servers
Regions in high demand
-
Many client connections to one server introduce bottleneck
Good idea to replicate popular regions on multiple Region Servers
-
Open Problem: consistent updates
Solution: Coordinated updates
46
47. Giraffa File System
A distributed highly scalable file system using HDFS and HBase
Challenge:
RAM - namespace size limitation
Giraffa is a distributed,
highly available file system
Utilizes features of
HDFS and HBase
New open source project
in experimental stage
47
48. Giraffa Requirements
Availability – the primary goal
-
Load balancing of metadata traffic
Same data streaming speed to / from DataNodes
-
Continuous Availability: No SPOF
Cluster operability, management
-
Cost of running larger clusters same as a smaller one
More files & more data
HDFS
Federated HDFS
Giraffa
Space
25 PB
120 PB
1 EB = 1000 PB
Files + blocks
200 million
1 billion
100 billion
Concurrent Clients
40,000
100,000
1 million
48
49. Giraffa Architecture
Namespace Service
HBase
1.
1
NamespaceAgent
App
Giraffa client
gets files and
blocks from
HBase
2.
Namespace Table
path, attrs, block[], DN[][]
Block
Manager
handles block
operations
3.
Stream data
to or from
DataNodes
Block Management Processor
2
Block Management Layer
BM
BM
BM
DN
DN
DN
DN
DN
DN
DN
DN
DN
3
49
50. Thank you
Contact: Samantha Leggat | t: 925.396.1194 | samantha.leggat@wandisco.com
WANdisco, Bishop Ranch 8, 5000 Executive Pkwy, Suite 270, San Ramon, CA 94583