SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Hadoop Architecture
Agenda
• Different Hadoop daemons & its roles
• How does a Hadoop cluster look like
• Under the Hood:- How does it write a file
• Under the Hood:- How does it read a file
• Under the Hood:- How does it replicate the file
• Under the Hood:- How does it run a job
• How to balance an un-balanced hadoop cluster
Hadoop – A bit of background
• It’s an open source project
• Based on 2 technical papers published by Google
• A well known platform for distributed applications
• Easy to scale-out
• Works well with commodity hard wares(not entirely true)
• Very good for background applications
Hadoop Architecture
• Two Primary components
 Distributed File System (HDFS): It deals with file
operations like read, write, delete & etc
 Map Reduce Engine: It deals with parallel computation
Hadoop Distributed File System
• Runs on top of existing file system
• A file broken into pre-defined equal sized blocks & stored
individually
• Designed to handle very large files
• Not good for huge number of small files
Map Reduce Engine
• A Map Reduce Program consists of map and reduce functions
• A Map Reduce job is broken into tasks that run in parallel
• Prefers local processing if possible
Hadoop Server Roles
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
slaves
masters
Clients
Name NodeJob Tracker
Secondary
Name Node
Map Reduce HDFS
Distributed Data Analytics Distributed Data Storage
Hadoop ClusterHadoop Cluster
Rack 1
DN + TT
DN + TT
DN + TT
DN + TT
Name Node
Rack 2
DN + TT
DN + TT
DN + TT
DN + TT
Job Tracker
Rack 3
DN + TT
DN + TT
DN + TT
DN + TT
Secondary NN
Rack 4
DN + TT
DN + TT
DN + TT
DN + TT
Client
Rack N
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
World
switch switch switch switch switch
switch switch
BRAD HEDLUND .com
Typical Workflow
Typical Workflow
• Load data into the cluster (HDFS writes)
• Analyze the data (Map Reduce)
• Store results in the cluster (HDFS writes)
• Read the results from the cluster (HDFS reads)
How many times did our customers type the word
“Fraud” into emails sent to customer service?
Sample Scenario:
File.txt
Huge file containing all emails sent
to customer service
BRAD HEDLUND .com
Writing files to HDFS
• Client consults Name Node
• Client writes block directly to one Data Node
• Data Nodes replicates block
• Cycle repeats for next block
Name Node
Data Node 1 Data Node 5 Data Node 6 Data Node N
Client
I want to write
Blocks A,B,C of
File.txt
OK. Write to
Data Nodes
1,5,6
Blk A Blk B Blk C
File.txt
Blk A Blk B Blk C
switch
switchswitch
Preparing HDFS writes
Name Node
Data Node 1 Data Node 5
Data Node 6
Client
I want to write
File.txt
Block A
OK. Write to
Data Nodes
1,5,6
Blk A Blk B Blk C
File.txt
Ready
Data Nodes
5,6
Ready?
Rack 1 Rack 5
Rack 1:
Data Node 1
Rack 5:
Data Node 5
Data Node 6
Ready
Data Node 6
Rack aware
Ready! • Name Node picks
two nodes in the
same rack, one
node in a different
rack
• Data protection
• Locality for M/R
Ready!
BRAD HEDLUND .com
switch
switchswitch
Pipelined Write
Name Node
Data Node 1 Data Node 5
Data Node 6
Client
Blk A Blk B Blk C
File.txt
Rack 1 Rack 5
Rack 1:
Data Node 1
Rack 5:
Data Node 5
Data Node 6
A A
A
• Data Nodes 1
& 2 pass data
along as its
received
• TCP 50010
Rack aware
BRAD HEDLUND .com
switch
switchswitch
Pipelined Write
Name Node
Data Node 1 Data Node 2
Data Node 3
Client
Blk A Blk B Blk C
File.txt
Rack 1 Rack 5
Rack 1:
Data Node 1
Rack 5:
Data Node 2
Data Node 3
A A
A
Block received
Success
File.txt
Blk A:
DN1, DN2, DN3
BRAD HEDLUND .com
Multi-block Replication Pipeline
Data Node 1 Data Node X
Data Node 3
Rack 1
switch
switch
Client
switch
Blk A Blk A
Blk A
Rack 4
Data Node 2
Rack 5
switch
Data Node Y
Data Node Z
Blk B
Blk B
Blk B
Data Node W
Blk C
Blk C
Blk C
Blk A Blk B Blk C
File.txt 1TB File =
3TB storage
3TB network traffic
BRAD HEDLUND .com
Client writes Span the HDFS Cluster
Client
Rack N
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
File.txt
Rack 4
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 3
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
• Block size
• File Size
Factors:
More blocks = Wider spread
BRAD HEDLUND .com
Hadoop Rack Awareness – Why?
Name Node
File.txt=
Blk A:
DN1, DN5, DN6
Blk B:
DN7, DN1, DN2
Blk C:
DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node 3
Data Node 5
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node 7
Data Node 8
switch
Rack 9
Data Node 9
Data Node 10
Data Node 11
Data Node 12
switch
• Never loose all data if entire rack fails
• Keep bulky flows in-rack when possible
• Assumption that in-rack is higher bandwidth,
lower latency
A
A
BB
B
C C
C
switch
Rack 1:
Data Node 1
Data Node 2
Data Node 3
Rack 5:
Data Node 5
Data Node 6
Data Node 7
Rack aware
BRAD HEDLUND .com
Name Node
• Data Node sends Heartbeats
• Every 10th heartbeat is a Block report
• Name Node builds metadata from Block reports
• TCP – every 3 seconds
• If Name Node is down, HDFS is down
Name Node
Data Node 1 Data Node 2 Data Node 3 Data Node N
A AA CC
DN1: A,C
DN2: A,C
DN3: A,C
metadata
File.txt = A,C
C
File system
Awesome!
Thanks.
I’m 
alive!
I have
blocks:
A, C
BRAD HEDLUND .com
Re-replicating missing replicas
Name Node
Data Node 1 Data Node 2 Data Node 3 Data Node 8
A AA CC
DN1: A,C
DN2: A,C
DN3: A, C
metadata
Rack1: DN1, DN2
Rack5: DN3,
Rack9: DN8
C
Rack Awareness
Uh Oh!
Missing
replicas
Copy
blocks A,C
to Node 8
A C
• Missing Heartbeats signify lost Nodes
• Name Node consults metadata, finds affected data
• Name Node consults Rack Awareness script
• Name Node tells a Data Node to re-replicate
BRAD HEDLUND .com
Secondary Name Node
• Not a hot standby for the Name Node
• Connects to Name Node every hour*
• Housekeeping, backup of Name Node metadata
• Saved metadata can rebuild a failed Name Node
Name Node
metadata
File.txt = A,C
File system
Secondary
Name Node
Its been an hour,
give me your
metadata
BRAD HEDLUND .com
Client reading files from HDFS
Name Node
Client
Tell me the
block locations
of Results.txt
Blk A = 1,5,6
Blk B = 8,1,2
Blk C = 5,8,9
Results.txt=
Blk A:
DN1, DN5, DN6
Blk B:
DN7, DN1, DN2
Blk C:
DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node
Data Node
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node
Data Node
switch
Rack 9
Data Node 8
Data Node 9
Data Node
Data Node
switch
• Client receives Data Node list for each block
• Client picks first Data Node for each block
• Client reads blocks sequentially
A
A
BB
B
C C
C
BRAD HEDLUND .com
Data Processing: Map
• Map: “Run this computation on  your local data”
• Job Tracker delivers Java code to Nodes with local data
Map Task Map Task Map Task
A B C
Client
How many
times does
“Fraud” appear 
in File.txt?
Count
“Fraud”
in Block C
File.txt
Fraud = 3 Fraud = 0 Fraud = 11
Job TrackerName Node
Data Node 1 Data Node 5 Data Node 9
BRAD HEDLUND .com
switchswitchswitch
What if d ta isn’t local?
• Job Tracker tries to select Node in same rack as data
• Name Node rack awareness
“I need block A”
Map Task Map Task
B C
Client
How many
times does
“Fraud” appear 
in File.txt?
Count
“Fraud”
in Block C
Fraud = 0 Fraud = 11
Job TrackerName Node
Data Node 1
Data Node 5 Data Node 9
“no Map tasks left”
A
Data Node 2
Rack 1 Rack 5 Rack 9
BRAD HEDLUND .com
Data Node reading files from HDFS
Name Node
Block A = 1,5,6
File.txt=
Blk A:
DN1, DN5, DN6
Blk B:
DN7, DN1, DN2
Blk C:
DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node 3
Data Node
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node
Data Node
switch
Rack 9
Data Node 8
Data Node 9
Data Node
Data Node
switch
• Name Node provides rack local Nodes first
• Leverage in-rack bandwidth, single hop
A
A
BB
B
C C
C
Tell me the
locations of
Block A of
File.txt switch
Rack 1:
Data Node 1
Data Node 2
Data Node 3
Rack 5:
Data Node 5
Rack aware
BRAD HEDLUND .com
Data Processing: Reduce
• Reduce: “Run this computation across Map results”
• Map Tasks deliver output data over the network
• Reduce Task data output written to and read from HDFS
Fraud = 0
Job Tracker
Reduce Task
Sum
“Fraud”
Results.txt
Fraud = 14
Map Task Map Task Map Task
A B C
Client
HDFS
X Y Z
Data Node 1 Data Node 5 Data Node 9
Data Node 3
BRAD HEDLUND .com
Unbalanced Cluster
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
switch
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
File.txt
• Hadoop prefers local processing if possible
• New servers underutilized for Map Reduce, HDFS*
• Might see more network bandwidth, slower job times**
**I was assigned
a Map Task but
don’t have th e  
block. Guess I
need to get it.
*I’m bored!
BRAD HEDLUND .com
Unbalanced Cluster
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
switch
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
File.txt
• Hadoop prefers local processing if possible
• New servers underutilized for Map Reduce, HDFS*
• Might see more network bandwidth, slower job times**
**I was assigned
a Map Task but
don’t have th e  
block. Guess I
need to get it.
*I’m bored!
BRAD HEDLUND .com
Cluster BalancingCluster Balancing
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
switch
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
File.txt
• Balancer utility (if used) runs in the background
• Does not interfere with Map Reduce or HDFS
• Default speed limit 1 MB/s
brad@cloudera-1:~$hadoop balancer
BRAD HEDLUND .com
Quiz
• If you had written a file of size 1TB into
HDFS with replication factor 2, What is
the actual size required by the HDFS to
store this file?
• True/False? Even if Name node goes
down, I still will be able to read files from
HDFS.
Quiz
• True/False? In Hadoop Cluster, We can
have a secondary Job Tracker to enhance
the fault tolerance.
• True/False? If Job Tracker goes down, You
will not be able to write any file into
HDFS.
Quiz
• True/False? Name node stores the actual
data itself.
• True/False? Name node can be re-built using
the secondary name node.
• True/False? If a data node goes down,
Hadoop takes care of re-replicating the
affected data block.
Quiz
• In which scenario, one data node tries to
read data from another data node?
• What are the benefits of Name node’s rack-
awareness?
• True/False? HDFS is well suited for
applications which write huge number of
small files.
Quiz
• True/False? Hadoop takes care of
balancing the cluster automatically?
• True/False? Output of Map tasks are
written to HDFS file?
• True/False? Output of Reduce tasks are
written to HDFS file?
Quiz
• True/False? In production cluster,
commodity hardware can be used to
setup Name node.
• Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 

Was ist angesagt? (18)

HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 

Ähnlich wie Hadoop architecture meetup

Understanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the NetworkUnderstanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the Networkbradhedlund
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle CoherenceBen Stopford
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Flink Forward
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDeltares
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 

Ähnlich wie Hadoop architecture meetup (20)

Understanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the NetworkUnderstanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the Network
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Os Gottfrid
Os GottfridOs Gottfrid
Os Gottfrid
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De Boer
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 

Kürzlich hochgeladen

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Kürzlich hochgeladen (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Hadoop architecture meetup

  • 2. Agenda • Different Hadoop daemons & its roles • How does a Hadoop cluster look like • Under the Hood:- How does it write a file • Under the Hood:- How does it read a file • Under the Hood:- How does it replicate the file • Under the Hood:- How does it run a job • How to balance an un-balanced hadoop cluster
  • 3. Hadoop – A bit of background • It’s an open source project • Based on 2 technical papers published by Google • A well known platform for distributed applications • Easy to scale-out • Works well with commodity hard wares(not entirely true) • Very good for background applications
  • 4. Hadoop Architecture • Two Primary components  Distributed File System (HDFS): It deals with file operations like read, write, delete & etc  Map Reduce Engine: It deals with parallel computation
  • 5. Hadoop Distributed File System • Runs on top of existing file system • A file broken into pre-defined equal sized blocks & stored individually • Designed to handle very large files • Not good for huge number of small files
  • 6. Map Reduce Engine • A Map Reduce Program consists of map and reduce functions • A Map Reduce job is broken into tasks that run in parallel • Prefers local processing if possible
  • 7. Hadoop Server Roles Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker slaves masters Clients Name NodeJob Tracker Secondary Name Node Map Reduce HDFS Distributed Data Analytics Distributed Data Storage
  • 8. Hadoop ClusterHadoop Cluster Rack 1 DN + TT DN + TT DN + TT DN + TT Name Node Rack 2 DN + TT DN + TT DN + TT DN + TT Job Tracker Rack 3 DN + TT DN + TT DN + TT DN + TT Secondary NN Rack 4 DN + TT DN + TT DN + TT DN + TT Client Rack N DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT World switch switch switch switch switch switch switch BRAD HEDLUND .com
  • 9. Typical Workflow Typical Workflow • Load data into the cluster (HDFS writes) • Analyze the data (Map Reduce) • Store results in the cluster (HDFS writes) • Read the results from the cluster (HDFS reads) How many times did our customers type the word “Fraud” into emails sent to customer service? Sample Scenario: File.txt Huge file containing all emails sent to customer service BRAD HEDLUND .com
  • 10. Writing files to HDFS • Client consults Name Node • Client writes block directly to one Data Node • Data Nodes replicates block • Cycle repeats for next block Name Node Data Node 1 Data Node 5 Data Node 6 Data Node N Client I want to write Blocks A,B,C of File.txt OK. Write to Data Nodes 1,5,6 Blk A Blk B Blk C File.txt Blk A Blk B Blk C
  • 11. switch switchswitch Preparing HDFS writes Name Node Data Node 1 Data Node 5 Data Node 6 Client I want to write File.txt Block A OK. Write to Data Nodes 1,5,6 Blk A Blk B Blk C File.txt Ready Data Nodes 5,6 Ready? Rack 1 Rack 5 Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6 Ready Data Node 6 Rack aware Ready! • Name Node picks two nodes in the same rack, one node in a different rack • Data protection • Locality for M/R Ready! BRAD HEDLUND .com
  • 12. switch switchswitch Pipelined Write Name Node Data Node 1 Data Node 5 Data Node 6 Client Blk A Blk B Blk C File.txt Rack 1 Rack 5 Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6 A A A • Data Nodes 1 & 2 pass data along as its received • TCP 50010 Rack aware BRAD HEDLUND .com
  • 13. switch switchswitch Pipelined Write Name Node Data Node 1 Data Node 2 Data Node 3 Client Blk A Blk B Blk C File.txt Rack 1 Rack 5 Rack 1: Data Node 1 Rack 5: Data Node 2 Data Node 3 A A A Block received Success File.txt Blk A: DN1, DN2, DN3 BRAD HEDLUND .com
  • 14. Multi-block Replication Pipeline Data Node 1 Data Node X Data Node 3 Rack 1 switch switch Client switch Blk A Blk A Blk A Rack 4 Data Node 2 Rack 5 switch Data Node Y Data Node Z Blk B Blk B Blk B Data Node W Blk C Blk C Blk C Blk A Blk B Blk C File.txt 1TB File = 3TB storage 3TB network traffic BRAD HEDLUND .com
  • 15. Client writes Span the HDFS Cluster Client Rack N Data Node Data Node Data Node Data Node Data Node Data Node switch File.txt Rack 4 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 3 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 2 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 1 Data Node Data Node Data Node Data Node Data Node Data Node switch • Block size • File Size Factors: More blocks = Wider spread BRAD HEDLUND .com
  • 16. Hadoop Rack Awareness – Why? Name Node File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 metadata Rack 1 Data Node 1 Data Node 2 Data Node 3 Data Node 5 switch A Rack 5 Data Node 5 Data Node 6 Data Node 7 Data Node 8 switch Rack 9 Data Node 9 Data Node 10 Data Node 11 Data Node 12 switch • Never loose all data if entire rack fails • Keep bulky flows in-rack when possible • Assumption that in-rack is higher bandwidth, lower latency A A BB B C C C switch Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Data Node 6 Data Node 7 Rack aware BRAD HEDLUND .com
  • 17. Name Node • Data Node sends Heartbeats • Every 10th heartbeat is a Block report • Name Node builds metadata from Block reports • TCP – every 3 seconds • If Name Node is down, HDFS is down Name Node Data Node 1 Data Node 2 Data Node 3 Data Node N A AA CC DN1: A,C DN2: A,C DN3: A,C metadata File.txt = A,C C File system Awesome! Thanks. I’m  alive! I have blocks: A, C BRAD HEDLUND .com
  • 18. Re-replicating missing replicas Name Node Data Node 1 Data Node 2 Data Node 3 Data Node 8 A AA CC DN1: A,C DN2: A,C DN3: A, C metadata Rack1: DN1, DN2 Rack5: DN3, Rack9: DN8 C Rack Awareness Uh Oh! Missing replicas Copy blocks A,C to Node 8 A C • Missing Heartbeats signify lost Nodes • Name Node consults metadata, finds affected data • Name Node consults Rack Awareness script • Name Node tells a Data Node to re-replicate BRAD HEDLUND .com
  • 19. Secondary Name Node • Not a hot standby for the Name Node • Connects to Name Node every hour* • Housekeeping, backup of Name Node metadata • Saved metadata can rebuild a failed Name Node Name Node metadata File.txt = A,C File system Secondary Name Node Its been an hour, give me your metadata BRAD HEDLUND .com
  • 20. Client reading files from HDFS Name Node Client Tell me the block locations of Results.txt Blk A = 1,5,6 Blk B = 8,1,2 Blk C = 5,8,9 Results.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 metadata Rack 1 Data Node 1 Data Node 2 Data Node Data Node switch A Rack 5 Data Node 5 Data Node 6 Data Node Data Node switch Rack 9 Data Node 8 Data Node 9 Data Node Data Node switch • Client receives Data Node list for each block • Client picks first Data Node for each block • Client reads blocks sequentially A A BB B C C C BRAD HEDLUND .com
  • 21. Data Processing: Map • Map: “Run this computation on  your local data” • Job Tracker delivers Java code to Nodes with local data Map Task Map Task Map Task A B C Client How many times does “Fraud” appear  in File.txt? Count “Fraud” in Block C File.txt Fraud = 3 Fraud = 0 Fraud = 11 Job TrackerName Node Data Node 1 Data Node 5 Data Node 9 BRAD HEDLUND .com
  • 22. switchswitchswitch What if d ta isn’t local? • Job Tracker tries to select Node in same rack as data • Name Node rack awareness “I need block A” Map Task Map Task B C Client How many times does “Fraud” appear  in File.txt? Count “Fraud” in Block C Fraud = 0 Fraud = 11 Job TrackerName Node Data Node 1 Data Node 5 Data Node 9 “no Map tasks left” A Data Node 2 Rack 1 Rack 5 Rack 9 BRAD HEDLUND .com
  • 23. Data Node reading files from HDFS Name Node Block A = 1,5,6 File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 metadata Rack 1 Data Node 1 Data Node 2 Data Node 3 Data Node switch A Rack 5 Data Node 5 Data Node 6 Data Node Data Node switch Rack 9 Data Node 8 Data Node 9 Data Node Data Node switch • Name Node provides rack local Nodes first • Leverage in-rack bandwidth, single hop A A BB B C C C Tell me the locations of Block A of File.txt switch Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Rack aware BRAD HEDLUND .com
  • 24. Data Processing: Reduce • Reduce: “Run this computation across Map results” • Map Tasks deliver output data over the network • Reduce Task data output written to and read from HDFS Fraud = 0 Job Tracker Reduce Task Sum “Fraud” Results.txt Fraud = 14 Map Task Map Task Map Task A B C Client HDFS X Y Z Data Node 1 Data Node 5 Data Node 9 Data Node 3 BRAD HEDLUND .com
  • 25. Unbalanced Cluster New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 2 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 1 Data Node Data Node Data Node Data Node Data Node Data Node switch NEW switch New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch NEW File.txt • Hadoop prefers local processing if possible • New servers underutilized for Map Reduce, HDFS* • Might see more network bandwidth, slower job times** **I was assigned a Map Task but don’t have th e   block. Guess I need to get it. *I’m bored! BRAD HEDLUND .com Unbalanced Cluster New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 2 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 1 Data Node Data Node Data Node Data Node Data Node Data Node switch NEW switch New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch NEW File.txt • Hadoop prefers local processing if possible • New servers underutilized for Map Reduce, HDFS* • Might see more network bandwidth, slower job times** **I was assigned a Map Task but don’t have th e   block. Guess I need to get it. *I’m bored! BRAD HEDLUND .com
  • 26. Cluster BalancingCluster Balancing New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 2 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 1 Data Node Data Node Data Node Data Node Data Node Data Node switch NEW switch New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch NEW File.txt • Balancer utility (if used) runs in the background • Does not interfere with Map Reduce or HDFS • Default speed limit 1 MB/s brad@cloudera-1:~$hadoop balancer BRAD HEDLUND .com
  • 27. Quiz • If you had written a file of size 1TB into HDFS with replication factor 2, What is the actual size required by the HDFS to store this file? • True/False? Even if Name node goes down, I still will be able to read files from HDFS.
  • 28. Quiz • True/False? In Hadoop Cluster, We can have a secondary Job Tracker to enhance the fault tolerance. • True/False? If Job Tracker goes down, You will not be able to write any file into HDFS.
  • 29. Quiz • True/False? Name node stores the actual data itself. • True/False? Name node can be re-built using the secondary name node. • True/False? If a data node goes down, Hadoop takes care of re-replicating the affected data block.
  • 30. Quiz • In which scenario, one data node tries to read data from another data node? • What are the benefits of Name node’s rack- awareness? • True/False? HDFS is well suited for applications which write huge number of small files.
  • 31. Quiz • True/False? Hadoop takes care of balancing the cluster automatically? • True/False? Output of Map tasks are written to HDFS file? • True/False? Output of Reduce tasks are written to HDFS file?
  • 32. Quiz • True/False? In production cluster, commodity hardware can be used to setup Name node. • Thank You