5. Client
HDFS Map Reduce
Secondary
Name Node
Data
Blocks
….
Data Node
Name Node Job Tracker
Task Tracker
Map Reduce
Data Node Task Tracker
Map Reduce
Hadoop 1.0 – In Summary
Data Node Data NodeTask Tracker
Map Reduce
Task Tracker
Map Reduce
Slide 5 www.edureka.in/hadoop
6. Hadoop 1.0 - Challenges
Slide 6 www.edureka.in/hadoop
Problem Description
NameNode – No Horizontal
Scalability
Single NameNode and Single Namespaces, limited by
NameNode RAM
NameNode – No High Availability
(HA)
NameNode is Single Point of Failure, Need manual recovery
using Secondary NameNode in case of failure
Job Tracker – Overburdened Spends significant portion of time and effort managing the life
cycle of applications
MRv1 – Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and
cannot be used for other workloads such as Graph processing
etc.
7. NameNode – Scale and HA
NameNode - No High Availability
NameNode - No Horizontal Scale
Data
Node
Data
Node
Data
Node
….
Client get Block Locations
Block Management
Read Data
NameNode
NS
Slide 7 www.edureka.in/hadoop
8. Secondary NameNode:
“Not a hot standby” for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NemeNode metadata
Saved metadata can build a failed NameNode
You give me
metadata every
hour, I will make
it secure
Secondary
NameNode
Single Point
of FailureNameNode
metadata
Slide 8 www.edureka.in/hadoop
metadata
Name Node –Single Point of Failure
9. Job Tracker – Overburdened
CPU
Spends a very significant portion of time and
effort managing the life cycle of applications
Network
Single Listener Thread to communicate
thousands of Map and Reduce Jobs
with
Task Tracker Task Tracker Task Tracker….
Job
Tracker
Slide 9 www.edureka.in/hadoop
10. MRv1 – Unpredictability in Large Clusters
As the cluster size grow and reaches to 4000 Nodes
Cascading Failures
The DataNode failures results in a serious
deterioration of the overall cluster
performance because of attempts to
replicate data and overload live nodes,
through network flooding.
Multi-tenancy
As clusters increase in size, you may want
to employ these clusters for a variety of
models. MRv1 dedicates its nodes to
Hadoop and cannot be re-purposed for
other applications and workloads in an
Organization. With the growing popularity
and adoption of cloud computing among
enterprises, this becomes more important.
Slide 10 www.edureka.in/hadoop
11. Terabytes and Petabytes of data in HDFS can be used only for MapReduce processing
Unutilized Data in HDFS
www.edureka.in/hadoop
12. Hadoop 2.0 New Features
Slide 12 www.edureka.in/hadoop
Other important Hadoop 2.0 features
HDFS Snapshots
NFSv3 access to data in HDFS
Support for running Hadoop on MS Windows
Binary Compatibility for MapReduce applications built on Hadoop 1.0
Substantial amount of Integration testing with rest of the projects (such as
Hadoop ecosystem
PIG, HIVE) in
Property Hadoop 1.0 Hadoop 2.0
Federation One Namenode and
Namespaces
Multiple Namenode and
Namespaces
High Availability Not present Highly Available
YARN - Processing
Control and Multi-
tenancy
JobTracker, Task Tracker Resource Manager, Node
Manager, App Master,
Capacity Scheduler
13. Hadoop 2.0 Cluster Architecture - Federation
Namenode
Block Management
NS
Storage
Datanode Datanode…
NamespaceBlockStorage
Namespace
NS1 NSk NSn
NN-1 NN-k NN-n
Common Storage
Datanode 1
…
Datanode 2
…
Datanode m
…
BlockStorage
Pool 1 Pool k Pool n
Block Pools
… …
Hadoop 1.0 Hadoop 2.0
Slide 13 www.edureka.in/hadoop
http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
14. - provides cross-data center (non-
Hlo
eca
lll
o) s
Tup
hp
eo
rr
et
!fo
!r HDFS, allowing
Slide 14 www.edureka.in/hadoop
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
How does HDFS Federation help HDFS Scale horizontally?
- reduces the load on any single NameNode by using the multiple,
independent NameNode to manage individual parts of the
filesystem namespace
a cluster administrator to split the Block Storage outside the local
cluster.
Annie’s Question
15. A. In order to scale the name service horizontally, HDFS federation
uses multiple independent Namenodes. The Namenodes are
federated, that is, the Namenodes are independent and don’t require
coordination with each other.
Slide 15 www.edureka.in/hadoop
Annie’s Answer
16. /finance respectively. What will happ
Hen
elif
loyo
Tu
htr
ey
reto
!!put a file in
Slide 16 www.edureka.in/hadoop
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
You have configured two name nodes to manage /marketing and
/accounting directory?
Annie’s Question
17. The put will fail. None of the namespace will manage the file and you
will get an IOException with a No such file or directory error.
Slide 17 www.edureka.in/hadoop
Annie’s Answer
18. Node Manager
HDFS
YARN
Resource
Manager
Shared
edit logs
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and applies
to its own namespace
Secondary
Name Node
Data Node
Standby
NameNode
Active
NameNode
Container
App
Master
Node Manager
Data Node
Container
App
Master
Data Node
Container
Data Node
NDoadtae NMoadneager
Client
App
Master
Node Manager
Container
App
Master
Hadoop 2.0 Cluster Architecture - HA
NameNode High
Availability
Slide 18 www.edureka.in/hadoop
Next Generation
MapReduce
HDFS HIGH AVAILABILITY
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
*Not necessary to
configure Secondary
NameNode
19. NameNode Recovery Vs. Failover
High Availability in
Hadoop 2.0
NameNode recovery in
Hadoop 1.0
Secondary
Name Node
Standby
NameNode
Active
NameNode
Secondary
Name Node
NameNode
Edit logs
Meta-Data
Automatic failover
to Standby
NameNode
Manually Recover
using Secondary
NameNode
FSImage
www.edureka.in/hadoop
20. HDFS HA was developed to overcome the following disadvantage in
Hadoop 1.0?
- Single Point Of Failure Of Name•Node
- Only one version can be run in cHlaseslilcoMTaph•eRreed!u!ce
- Too much burden on Job TracMkeyr name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Slide 20 www.edureka.in/hadoop
Annie’s Question
21. Single Point of Failure of NameNode.
Slide 21 www.edureka.in/hadoop
Annie’s Answer
22. Hadoop 2.0 – In Summary
Client
HDFS
Distributed Data Storage
Active
NameNode
YARN
Resource Manager
Applications
Secondary
Name Node
Standby
NameNode
Distributed Data Processing
Data Node
Node Manager
Container
App
Master …….
Masters
Slaves
Node Manager
Data Node
Container
App
Master
Data Node
Node Manager
Container
App
Master
Shared
edit logs
Scheduler Manager
(AsM)
NameNode High
Availability
www.edureka.in/hadoop
Next Generation
MapReduce
23. YARN and Hadoop Ecosystem
Apache Oozie (Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework
HBase
Apache Oozie (Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework HBase
Other
YARN
Frameworks
(MPI, GIRAPH)
framework
Slide 23 www.edureka.in/hadoop
YARN
Cluster Resource Management
YARN adds a more general interface to run non-MapReduce
jobs (such as Graph Processing) within the Hadoop
25. Multi-tenancy - Capacity Scheduler
Slide 25 www.edureka.in/hadoop
Organizes jobs into queues
Queue shares as %’s of cluster
FIFO scheduling within each
queue
Data locality-aware Scheduling
Hierarchical Queues
To manage the resource within an organization.
Capacity Guarantees
A fraction to the total available capacity allocated to each Queue.
Security
To safeguard applications from other users.
Elasticity
Resources are available in a predictable and elastic manner to queues.
Multi-tenancy
Set of limit to prevent over-utilization of resources by a single application.
Operability
Runtime configuration of Queues.
Resource-based scheduling
If needed, Applications can request more resources than the default.
34. YARN was developed to overcome the following disadvantage in
Hadoop 1.0 MapReduce framework?
- Single Point Of Failure Of Name•Node
- Only one version can be run in cHlaseslilcoMTaph•eRreed!u!ce
- Too much burden on Job TracMkeyr name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Slide 34 www.edureka.in/hadoop
Annie’s Question
35. Too much burden on Job Tracker.
Slide 35 www.edureka.in/hadoop
Annie’s Answer
36. Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
In YARN, the functionality of JobTracker has been replaced by which
of the following YARN features:
Slide 36 www.edureka.in/hadoop
- Job Scheduling
- Task Monitoring
- Resource Management
- Node management
Annie’s Question
37. Task Monitoring and Resource Management. The fundamental
idea of YARN is to split up the two major functionalities of the
JobTracker, i.e. resource management and job
scheduling/monitoring, into separate daemons. A global
ResourceManager (RM) for resources and per-application
ApplicationMaster (AM) for task monitoring.
Slide 37 www.edureka.in/hadoop
Annie’s Answer
38. Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
In YARN, which of the following daemons takes care of the container
and the resource utilization by the applications?:
Slide 38 www.edureka.in/hadoop
- Node Manager
- Job Tracker
- Task tracker
- Application Master
- Resource manager
Annie’s Question
40. Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Can we run MRv1 Jobs in a YARN enabled Hadoop Cluster?
Slide 40 www.edureka.in/hadoop
- Yes
- No
Annie’s Question
41. Yes. You need to recompile the Jobs in MRv2 after enabling YARN to
run the Job successfully in a YARN enabled Hadoop Cluster.
Slide 41 www.edureka.in/hadoop
Annie’s Answer
42. Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Which of the following YARN/MRv2 daemon is responsible for
launching the tasks?
Slide 42 www.edureka.in/hadoop
- Task Tracker
- Resource Manager
- ApplicationMaster
- ApplicationMasterServer
Annie’s Question
44. Write the MapReduce programs using MRv2 for your Module-3 assignments.
Assignment Programs – WordCount , Patents, Temperature, and Alphabets
Create a Single Node Apache Hadoop Cluster using the documents present in the LMS.
Execute the ‘teragen’ example to test the Cluster
Slide 44 www.edureka.in/hadoop
Module-10 Pre-work
46. www.edureka.in/hadoop
NN High Availability – Quorum Journal Manager
Standby Name
Node
Active Name
Node
Data Node Data Node Data Node Data Node
….
Data Blocks
Journal
Node
Write namespace
modification to edit
logs; only Active Name
Node is the writer
Journal
Node
Journal
Node
Read edit logs and applies to its own
namespace
Data Nodes are configured with the location of both
Name Nodes, and send block location information and
heartbeats to both.