The document provides information about Hadoop training. It discusses the need for Hadoop in today's data-heavy world. It then describes what Hadoop is, its ecosystem including HDFS for storage and MapReduce for processing. It also discusses YARN and provides a bank use case. It further explains the architecture and working of HDFS and MapReduce in processing large datasets in parallel across clusters.
3. What’s in it for you?
Need for Hadoop
What is Hadoop?
4. What’s in it for you?
Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
5. What’s in it for you?
Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features
6. What’s in it for you?
Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features
What is HDFS?
7. What’s in it for you?
Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features
What is HDFS?
What is MapReduce?
8. What’s in it for you?
Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features
What is HDFS?
What is MapReduce?
What is YARN?
9. What’s in it for you?
Need for Hadoop
What is Hadoop?
Hadoop Ecosystem
Hadoop Features
What is HDFS?
What is MapReduce?
What is YARN?
Bank case study
10. Need for HadoopNeed for Hadoop
JUNe
In today’s world, data is
increasingly growing from
heterogenous sources like social
media, aviation, logistics, e-
commerce, etc.
11. Need for HadoopNeed for Hadoop
JUNe
All these digital data is expected to
reach 163 zettabytes by 2025
( 1 ZB = 10 TB )
9
12. Need for HadoopNeed for Hadoop
JUNe
Companies face problems in storing and
processing these vast volumes of data
13. Need for HadoopNeed for Hadoop
JUNe
Solution is big data
technologies such as
15. What is Hadoop?
Open source framework to store
and process huge volumes of data
Stores large volumes of data in multiple data nodes
Data DN1 DN2 DN3 DN4
16. What is Hadoop?
Open source framework to store
and process huge volumes of data
Stores large volumes of data in multiple data nodes
Data DN1 DN2 DN3 DN4
Processes data parallelly in multiple data nodes
19. Components of Hadoop
MapReduce – Parallel data processing2
YARN – Cluster resource management3
HDFS – Distributed data storage1
20. Hadoop Ecosystem
Data Collection
and ingestion
Work Flow
Pig
(Scripting)
Hive
(SQL Query)
Interactive
Analysis
Machine
Learning
Streaming
Read/write
access to data
Hadoop Distributed Files System
Cluster Resource Management
Data Processing
Management and
Monitoring
22. Robust
ecosystem
Hadoop Features
Scalable Fault tolerantFlexible
Distributed
storage
Cost effective
Hadoop in flexible in storing any type of data, be it
structured, semi structured or unstructured data
26. Distributed
storage
Hadoop Features
Scalable Fault tolerantFlexible
Cost effective
Hadoop has a robust ecosystem that suits the analytical needs
of small and big organizations. These include spark, pig, hive,
mahout, etc.
Robust
ecosystem
28. Hadoop use case
Before 2008 economic recession, every bank
maintained a legacy data warehouse
Data warehouse
Home mortgage details, credit card
transactions and other financial details of
every customer was restricted to local
database systems
29. Hadoop use case
Before 2008 economic recession, every bank
maintained a legacy data warehouse
Data warehouse
Home mortgage details, credit card
transactions and other financial details of
every customer was restricted to local
database systems
Banks could not store and
process data efficiently
Failed to build a comprehensive risk
portfolio for their customers
30. Hadoop use case
After 2008 economic recession, most of the
financial institutions and national monetary
associations started maintaining a single
Hadoop Cluster containing more than
petabytes of financial data
Hadoop cluster
31. Hadoop use case
After 2008 economic recession, most of the
financial institutions and national monetary
associations started maintaining a single
Hadoop Cluster containing more than
petabytes of financial data
Hadoop cluster
32. Hadoop use case
After 2008 economic recession, most of the
financial institutions and national monetary
associations started maintaining a single
Hadoop Cluster containing more than
petabytes of financial data
Hadoop cluster
Along with transaction data, it could also store
call records, email, chat
and web logs
33. Hadoop use case
After 2008 economic recession, most of the
financial institutions and national monetary
associations started maintaining a single
Hadoop Cluster containing more than
petabytes of financial data
Hadoop cluster
Along with transaction data, it could also store
call records, email, chat
and web logs
Data is analyzed to perform sentiment
analysis, text processing,
pattern matching
34. Hadoop use case
Banking and financial giant
with services in more than 100
nations
With over 150 petabytes of data, 30,000
databases, 3.5 billion log records, data is the
oil for JP Morgan
35. Hadoop use case
Banking and financial giant
with services in more than 100
nations
With over 150 petabytes of data, 30,000
databases, 3.5 billion log records, data is the
oil for JP Morgan
Storing vast volumes of unstructured
data allows the company to collect web
logs, transaction data, social media
data, etc.
36. Hadoop use case
Banking and financial giant
with services in more than 100
nations
With over 150 petabytes of data, 30,000
databases, 3.5 billion log records, data is the
oil for JP Morgan
Storing vast volumes of unstructured
data allows the company to collect web
logs, transaction data, social media
data, etc.
Uses Hadoop framework for risk
management and detecting fraud
transactions
37. What is HDFS?
Hadoop Distributed File System (HDFS) is the storage layer of
Hadoop that stores data in multiple data nodes
38. What is HDFS?
Hadoop Distributed File System (HDFS) is the storage layer of
Hadoop that stores data in multiple data nodes
Datanode2 Datanode3Datanode1 Datanode4
Big Data
39. What is HDFS?
In HDFS, data gets divided into multiple blocks and
the blocks are stored on multiple nodes
40. What is HDFS?
Each block of data is stored on multiple data nodes
and by default has 128 MB of data
300 MB Data
128 MB 128 MB 44 MB
In HDFS, data gets divided into multiple blocks and
the blocks are stored on multiple nodes
Datanode Datanode Datanode
42. HDFS Architecture
Secondary
Namenode
Namenode
Master
Metadata in Disk
Edit log Fsimage
Metadata in RAM
Metadata (Name, replicas,….):
/home/foo/data, 3, …
Namenode holds metadata information about the various
Datanodes, their location, the size of each block, etc.
43. HDFS Architecture
Secondary
Namenode
Namenode
Master
Metadata in Disk
Edit log Fsimage
Metadata in RAM
Metadata (Name, replicas,….):
/home/foo/data, 3, …
File.txt
Helps to execute file system namespace operations –
opening, closing, renaming files and directories
44. HDFS Architecture
Secondary
Namenode
Namenode
Master
Metadata in Disk
Edit log Fsimage
Metadata in RAM
Metadata (Name, replicas,….):
/home/foo/data, 3, …
File.txt
Maintains
Metadata in Disk
Edit log Fsimage
Secondary Namenode server is responsible for
maintaining a copy of Metadata in disk
52. HDFS Write
MasterClient Datanodes
Where can I write &
store my data?
Finds the datanodes
available
128 MB
128 MB
44 MB
300 MB
Write the 1st block of data to A3,
B2, B4
split
128 MB
128 MB
128 MB
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
Rack 1 Rack 2 Rack 3
Data block is replicated thrice on different datanotes
53. HDFS Write
MasterClient Datanodes
Where can I write &
store my data?
Finds the datanodes
available
128 MB
128 MB
44 MB
300 MB
split
128 MB
128 MB
128 MB
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
Rack 1 Rack 2 Rack 3
Similarly, the other 2 blocks are written on to different datanodes
128 MB 128 MB
128 MB
44 MB44 MB
44 MB
55. HDFS Read
MasterClient Datanodes
I want to read my file
Finds the datanodes to
read from
128 MB
128 MB
44 MB 128 MB
128 MB
128 MB
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
Rack 1 Rack 2 Rack 3
128 MB 128 MB
128 MB
44 MB44 MB
44 MB
56. HDFS Read
MasterClient Datanodes
I want to read my file
Finds the datanodes to
read from
128 MB
128 MB
44 MB
Read data from A2, A3, B1
128 MB
128 MB
128 MB
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
Rack 1 Rack 2 Rack 3
128 MB 128 MB
128 MB
44 MB44 MB
44 MB
57. Importing data to HDFS
Relational databases
RDBMS Data
warehouse
SQOOP is used to import data from relational databases on to HDFS
58. Importing data to HDFS
Relational databases Streaming data
RDBMS Data
warehouse
Sensor
Web
server
FLUME is used to import streaming data from sensors and web servers on to HDFS
60. What is MapReduce?
MapReduce is a programming model to process large datasets parallelly
on different nodes
Data is processed simultaneously on
different slave nodes
Slave node 1 Slave node 2
Slave node 3 Slave node 4
Master node
66. MapReduce Example
Input
Square Red Triangle Blue Circle Green
Square Green Triangle White Cube Blue
Cube Yellow Circle Red Cube Blue
Hexagon Green Square Blue Cube Yellow
67. MapReduce Example
Square Red Triangle Blue Circle Green
Square Green Triangle White Cube Blue
Cube Yellow Circle Red Cube Blue
Hexagon Green Square Blue Cube Yellow
Square Red Triangle Blue Circle Green
Square Green Triangle White Cube Blue
Cube Yellow Circle Red Cube Blue
Hexagon Green Square Blue Cube Yellow
Map Function
Split step
68. MapReduce Example
Square Red Triangle Blue Circle Green
Square Green Triangle White Cube Blue
Cube Yellow Circle Red Cube Blue
Hexagon Green Square Blue Cube Yellow
Square = 1
Red = 1
Triangle = 1
Blue = 1
Circle = 1
Green = 1
Square = 1
Green = 1
Triangle = 1
White = 1
Cube = 1
Blue = 1
Cube = 1
Yellow = 1
Circle = 1
Red = 1
Cube = 1
Blue = 1
Hexagon = 1
Green = 1
Square = 1
Blue = 1
Cube = 1
Yellow = 1
Map step
69. MapReduce Example
Square = 1
Red = 1
Triangle = 1
Blue = 1
Circle = 1
Green = 1
Square = 1
Green = 1
Triangle = 1
White = 1
Cube = 1
Blue = 1
Cube = 1
Yellow = 1
Circle = 1
Red = 1
Cube = 1
Blue = 1
Hexagon = 1
Green = 1
Square = 1
Blue = 1
Cube = 1
Yellow = 1
Merge step
Square = {1,1}
Red = {1}
Triangle = {1,1}
Blue = {1,1}
Circle = {1}
Green = {1,1}
White = {1}
Cube = {1}
Cube = {1,1,1}
Yellow = {1,1}
Circle = {1}
Red = {1}
Blue = {1,1}
Hexagon = {1}
Green = {1}
Square = {1}
Merge step
Square = {1,1,1}
Red = {1,1}
Triangle = {1,1}
Blue = {1,1,1,1}
Circle = {1,1}
Green = {1,1,1}
White = {1}
Cube = {1,1,1,1}
Yellow = {1,1}
Hexagon = {1}
70. MapReduce Example
Square = {1,1,1}
Red = {1,1}
Triangle = {1,1}
Blue = {1,1,1,1}
Circle = {1,1}
Green = {1,1,1}
White = {1}
Cube = {1,1,1,1}
Yellow = {1,1}
Hexagon = {1}
Blue = {1,1,1,1}
Circle = {1,1}
Cube = {1,1,1,1}
Green = {1,1,1}
Hexagon = {1}
Red = {1,1}
Square = {1,1,1}
Triangle = {1,1}
White = {1}
Yellow = {1,1}
Shuffle and sort step Reduce step
Blue = 4
Circle = 2
Cube = 4
Green = 3
Hexagon = 1
Red = 2
Square = 3
Triangle = 2
White = 1
Yellow = 2
71. YARN – Yet Another Resource Negotiator
YARN was introduced in Hadoop 2.0 to solve the issues in Hadoop 1.0 (MR 1)
such as scalability, availability of nodes, resource utilization, etc.
72. YARN – Yet Another Resource Negotiator
YARN was introduced in Hadoop 2.0 to solve the issues in Hadoop 1.0 (MR 1)
such as scalability, availability of nodes, resource utilization, etc.
YARN is the cluster resource management layer of Hadoop
that schedules jobs and assigns resources to running
applications
73. YARN – Yet Another Resource Negotiator
YARN was introduced in Hadoop 2.0 to solve the issues in Hadoop 1.0 (MR 1)
such as scalability, availability of nodes, resource utilization, etc.
YARN is the cluster resource management layer of Hadoop
that schedules jobs and assigns resources to running
applications
MapReduce
application
Memory CPU
RAM
77. YARN – Yet Another Resource Negotiator
Resource
ManagerClient
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Job Submission
Node Status
Submit job
request
NodeManager sends its status to
ResourceManager
78. YARN – Yet Another Resource Negotiator
Resource
ManagerClient
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Job Submission
Node Status
MapReduce Status
Submit job
request
ApplicationMaster contacts the related
NodeManager
79. YARN – Yet Another Resource Negotiator
Resource
ManagerClient
Node
Manager
container
App Master
App Master
container
Node
Manager
Node
Manager
container container
Job Submission
Node Status
MapReduce Status
Resource Request
Submit job
request
Container executes the ApplicationMaster
80. Bank case study
VIRTUAL BANK
You own a Virtual Bank that is
generating a lot of customer
transaction data and uses RDBMS
to store the data
RDBMS
81. Bank case study
VIRTUAL BANK
But, the bank’s data is rapidly
increasing and RDBMS has become
inefficient in handling such large
volumes of data
RDBMS
82. Bank case study
VIRTUAL BANK
You need a solution to move your
bank data from traditional RDBMS
to more flexible and scalable data
storage
83. Bank case study
VIRTUAL BANK
What if I use Hadoop Distributed
File System (HDFS) to store the
data?
84. Bank case study
VIRTUAL BANK
HDFS can easily store large
volumes of data. So, let me use
Sqoop to move all the bank’s data
from RDBMS onto HDFS
RDBMS
85. Bank case study
VIRTUAL BANK
This will also allow us to analyze
customer data using Sqoop
commands. Now, let’s see how to
do this
RDBMS