SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Naveen P.N
Trainer
NPN TrainingTraining is the essence of success and we are committed to it.
www.npntraining.com
Module 01 - Understanding Big Data and Hadoop
Includes (Hadoop 1.x & 2.x Architecture)
Topics for the Module `
What is Big Data
OLTP VS OLAP
Limitation of existing Data Analytics
Moving Data into Code
Moving Code into Data
Hadoop 1.0 / 2.0 Core Components
Hadoop 2.0 Core Components
Hadoop Master Slave Architecture
After completing the module, you will be able to understand:
File Blocks
Rack Awareness
Anatomy of File Read and Write
Hadoop 1.x Challenges
Scala REPL
Scala Variable Types
Big Data is the term for a collection of data sets so large and complex that it
becomes difficult to process it using Traditional data processing applications.
What is Big Data
www.npntraining.com/masters-program/big-data-architect-training.php
12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?TBsof
dataevery
day
2+ billion
people on
the Web
by end
2011
30 billion RFID tags
today
(1.3B in 2005)
4.6
billion
camera
phones
world
wide
100s of
millions of
GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
Where Is This “Big Data” Coming From ?
www.npntraining.com/masters-program/big-data-architect-training.php
About RDBMS
Why do I need RDBMS
 For quick response
 It enables relation between data elements to be defined and managed.
 It enables one database to be utilized for all applications.
Presently the data is stored in RDBMS, then what is the problem why the problem of BigData come
www.npntraining.com/masters-program/big-data-architect-training.php
OLTP VS OLAP
We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that
OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it
www.npntraining.com/masters-program/big-data-architect-training.php
"Big Data are high-volume, high-velocity, and/or high-variety information assets that require new
forms of processing to enable enhanced decision making, insight discovery and process
optimization”
Big Data spans three dimensions (3Vs)
www.npntraining.com/masters-program/big-data-architect-training.php
Storage only Grid (SAN)
(Raw Data)
ETL Compute Grid
RDBMS
(Aggregated Data)
1. Can’t explore original high
fidelity raw data
3. Premature data death
90% of data is Archived
A meagre 10% of Data
is available for BI
Limitation of Existing Data Analytics Architecture
www.npntraining.com/masters-program/big-data-architect-training.php
BI Reports + Interactive Apps
Solution: A Combined Storage Compute Layer
Hadoop: Storage + Compute Grid
RDBMS
(Aggregated Data)
Scalable throughput for
ETL & aggregation
Data Exploration
& Advanced
analytics
No Data
Archiving
Keep data alive
forever
Both Storage
And Compute
Grid together
Entire Data is
available for
processing
www.npntraining.com/masters-program/big-data-architect-training.php
Processing 1TB of Data
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
45 Minutes
Limitation
Traditional Approach
Processing Data in Enterprise
www.npntraining.com/masters-program/big-data-architect-training.php
Processing 1TB of Data
10 Machine
4 I/O Channels
Each Channel – 100 MB/s
4.3 Minutes
Hadoop Approach
Processing Data in DFS
www.npntraining.com/masters-program/big-data-architect-training.php
What is Apache Hadoop
Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters of
commodity computers using a simple programming
model
To solve the Big Data problem a new
framework has evolved that is Hadoop.
Hadoop provides.
 Commodity Hardware
 Big Cluster
 Map Reduce
 Failover
 Data distribution
 Moving code to data
 Heterogeneous Hardware
 Scalable
Hadoop is based on work done by Google in the early 2000s
 Google File System (GFS) paper published in 2003
 MapReduce paper published in 2004
It is an architecture that can scale with huge volumes,
variety and speed requirements of Big Data by distributing
the work across dozens, hundreds, or even thousands of
commodity servers that process the data in parallel.
www.npntraining.com/masters-program/big-data-architect-training.php
Moving data into code Contd...
Terabyte
Wants to Analyze
the data
Traditional data processing architecture
 Nodes are broken up into separate processing and storage nodes connected by high capacity link
 Many data intensive applications are CPU demanding causing bottle neck in networks.
 Latency in transferring data.
``
www.npntraining.com/masters-program/big-data-architect-training.php
Moving code to data
Wants to Analyze
the data
Map Reduce
Client writes
MapReduce
Jobs Jobs Jobs
Jobs Jobs Jobs
Hadoop takes radically new approach to the problem of distributed computing.
Distribute the data to multiple nodes.
Distribute the program for computation to these multiple nodes.
Individual nodes then work on data stay in their nodes.
No data transfer over the network is required for initial processing.
Additional nodes can be added for scalability.
``
Distribution Vendors
Cloudera Distribution for Hadoop (CDH)
MapR Distribution.
Hortonworks Data Platform
Apache BigTop Distribution.
``
www.npntraining.com/masters-program/big-data-architect-training.php
Hadoop 1.0 Core Components
Hadoop has two main components :
1. HDFS – Hadoop Distributed File System (Storage)
2. MapReduce ( Processing)
Hadoop
HDFS MapReduce
Responsible to store the data in chunks(by
splitting into blocks of 64MB each)
To process the data in a massive parallel
manner.
Daemons
 Name Node
 Data Node
 Secondary Name Node
 Job Tracker
 Task Tracker
HDFS
NameNode (Master)
DataNode
Secondary NameNode
MapReduce
JobTracker
TaskTracker
Storage Processing
www.npntraining.com/masters-program/big-data-architect-training.php
Hadoop 2.0 Core Components
Hadoop 2.0 has two main components :
1. HDFS – Hadoop Distributed File System (Storage)
2. MapReduce ( Processing)
Hadoop
HDFS YARN/MRv2
Responsible to store the data in chunks(by
splitting into blocks of 128MB each)
To process the data in a massive parallel
manner.
Daemons
 Name Node
 Data Node
 Secondary Name Node
 ResourceManager
 NodeManager
HDFS
NameNode (Master)
DataNode(Slave)
Secondary NameNode
MapReduce
ResourceManager(Master)
NodeManager(Slave)
Storage Processing
www.npntraining.com/masters-program/big-data-architect-training.php
HDFS – Hadoop Distributed File System
HDFS is a distributed and scalable file system designed for storing very large files with
streaming data access patterns, running clusters on commodity hardware.
HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode
(Master node) and a number of DataNodes (Slave nodes).
Each and every file in the
File System is divided into
blocks of size 512 Bytes
File System
www.npntraining.com/masters-program/big-data-architect-training.php
File Blocks
By default, block size is 64 MB in Hadoop 1.x and 128 MB in Hadoop 2.x
Why block size is large?
 The main reason for having the HDFS blocks in large size is due to cost of seek time.
 The large block size is to account for proper usage of storage space while considering the limit on the
memory of NameNode.
www.npntraining.com/masters-program/big-data-architect-training.php
1.0 Master Slave Architecture – Simple cluster setup with Hadoop Daemons
……
MasterSlave
NameNode JobTracker
Secondary
NameNode
Single Box Single Box
Optional to have in two boxes
Single Box
In Separate Box ( Many)
TaskTracker TaskTracker TaskTracker
Slave1 Slave2 Slave3
DataNode DataNode DataNode
1 2 3
4
5
website : www.npntraining.com
2.0 Master Slave Architecture – Simple cluster setup with Hadoop Daemons
……
MasterSlave
NameNode
Active
ResourceManager
NameNode
Standby
Single Box Single Box
Optional to have in two boxes
Single Box
In Separate Box ( Many)
NodeManager NodeManager
NodeManager
Slave1 Slave2 Slave3
DataNode DataNode DataNode
1
2 3
4
5
Secondary
NameNode
Single Box
In Separate Box ( Many)
3
website : www.npntraining.com
Master Slave Architecture – Simple cluster setup with Hadoop Daemons
NodeManager
Slave1 Slave2 Slave3
DataNode DataNode DataNode
ResourceManager
NameNode
MapReduce
HDFS
NodeManager NodeManager
www.npntraining.com/masters-program/big-data-architect-training.php
Hadoop Cluster: A Typical Use Case
www.npntraining.com/masters-program/big-data-architect-training.php
File Blocks in HDFS
Master
Node
Communicates
Wants to save 400 MB of
data into cluster/HDFS
Decides which nodes to
write the data to
First copy is always stored in nodes
which is in close proximity to the client
128
MB
128
MB
16
MB
128
MB
128
MB
128
MB
128
MB
128
MB
128
MB
16
MB
128
MB
16
MB
In HDFS the data is broken
into blocks of size 64 MB
Hadoop creates a 3 replication
by default which is
configurable and achieves fault
tolerance
``
website : www.npntraining.com
File Blocks in HDFS Contd…
NameNode
200 MB – weblog.dat
128
72
b1r1
b1r2
b1r3
b2r1
b2r2
b2r3
Namespace
b1
r1
s1
b1
r2
s2
b1
r3
s3
b2
r1
s1
b2
r2
s2
b2
r3
s3
b1r1
b2r1
b1r2
b2r2
b1r3
b2r3
``
website : www.npntraining.com
Rack Awareness & Replication Factor
www.npntraining.com/masters-program/big-data-architect-training.php
NameNode
NameNode does not store the files but only files metadata.
NameNode keeps track of all the file system related information(Metadata) such as:
 Block Locations
 Information about file permissions and ownership.
 Last access time for the file.
 User permission like which user have access to the file.
NameNode oversees the health DataNode and coordinates access to the data stored in DataNode.
The entire metadata is in main memory.
``
www.npntraining.com/masters-program/big-data-architect-training.php
NameNode Metadata
The entire metadata is in main memory.
No demand paging of FS meta-data.
NameNode maintains two files
1. fsimage and
2. edit log
The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata.
However, while the fsimage file format is very efficient to read
It’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than
writing a new fsimage every time the namespace is modified, the NameNode instead records
the modifying operation in the edit log for durability.
``
www.npntraining.com/masters-program/big-data-architect-training.php
Secondary NameNode or CheckPoint Node
NameNode
Secondary
NameNode
-fsImage
Edit logs
Not a hot standby for the NameNode
Connects to NameNode every hour.
Housekeeping, backup of NameNode metadata
Saved metadata can build a failed NameNode
Pulls metadata
``
www.npntraining.com/masters-program/big-data-architect-training.php
Hadoop Components Contd…
ResourceManager
Name Node
NodeManager
Data Node
NodeManager
Data Node
NodeManager
Data Node
Master Node
Slave Node
Maintains and manages the blocks
present on the Slave Nodes
Periodically receives a Heartbeat and
a Block report from each of the data
nodes in the cluster
Heartbeat implies that the
DataNode is functioning properly,
once every 3 seconds
The HDFS architecture is built in
such a way that the user data is
never stored in the NameNode, it
only stored metadata.
It records the metadata of all the
files stored in the cluster, e.g. the
location, the size of the files,
permissions, hierarchy, etc
DataNodes perform the
low-level read and write
requests from the file
system’s clients.
Responsible for creating blocks,
deleting blocks and replicating
the same based on the decisions
taken by the NameNode.
Secondary Name Node
Anatomy of File Write – High Level
A user wants to write data to Hadoop
hdfs dfs –put 2016-apache-logs.txt / Client “cuts” input file into chunks of “block
size”
Client then contacts the NameNode to request write operation.
 Sends No of blocks
 Replication Factor
NameNode responds with pipeline of DataNodes for replication to write.
Clients reaches out to first DataNode in pipeline + Performs write
* No actual data transfer will take place from NameNode
Client takes the request “splits” input file into chunks of “block size”
Client writes blocks in parallel  all the blocks are written at a time  not one by one
www.npntraining.com/masters-program/big-data-architect-training.php
Anatomy of File Write – Full Example
2016-apache-logs.txt
200 MB file
Block size : 64 MB
Replication Factor : 3
44
MB
Hadoop Client
128
MB
128
MB
NameNode
–put request
Write pipeline
blk_000 to DN1,DN5,DN6
blk_001 to DN4,DN8,DN9
blk_002 to DN7,DN3,DN3
DataNode1
DataNode2
DataNode3
Rack01
DataNode4
DataNode5
DataNode6
Rack01
DataNode7
DataNode8
DataNode9
Rack01
website : www.npntraining.com
Anatomy of File Read – Full Example
2016-apache-logs.txt
200 MB file
Block size : 64 MB
Replication Factor : 3
44
MB
Hadoop Client
128
MB
128
MB
NameNode
–get request
Write pipeline
blk_000 to DN1
blk_001 to DN4
blk_002 to DN7
DataNode1
DataNode2
DataNode3
Rack01
DataNode4
DataNode5
DataNode6
Rack01
DataNode7
DataNode8
DataNode9
Rack01
website : www.npntraining.com
In HDFS, blocks of a file are written in parallel, however
replication of the blocks are done sequentially.
a) True
b) False
Hadoop is a framework that allows for the distributed
processing of :
a) Small Data sets
b) Large Data sets
A file of 400 MB is being copied to HDFS. The System has
finished copying 250 MB . What happens if a client tries to
access that file.
a) Can read up to block that’s successfully written.
b) Can read up to last bit successfully written
c) Will throw an exception
d) Cannot see that file until its finished copying
www.npntraining.com/masters-program/big-data-architect-training.php
Hadoop Eco-System
www.npntraining.com/masters-program/big-data-architect-training.php
What could be the limitation of Hadoop 1 / Gen 1
Hadoop 1.x cluster can it have multiple HDFS Namespaces
Which of the following are significant disadvantage in Hadoop 1.0
a) Single Point of Failure on NameNode
b) Too much burden on JobTracker
Hadoop 1.x cluster can it have multiple HDFS Namespaces
Can you use other than MapReduce for processing in Hadoop 1.x
www.npntraining.com/masters-program/big-data-architect-training.php
Hadoop 1.x - Challenges
NameNode – No Horizontal Scalability
Single NameNode and single Namespaces, limited by NameNode RAM
NameNode – No High Availability (HA)
NameNode is Single Point of Failure, Need manual recovery using Secondary NameNode in case
of failure.
Job Tracker - Overburdened
Spends significant portion of time and effort managing the life cycle of applications.
MRv1 – Only Map & Reduce tasks
HumongousData stored in HDFS remains unutilized and cannot be used for other workloads
such as Graph processing etc.
www.npntraining.com/masters-program/big-data-architect-training.php
Single NameNode running and managing Single Namespace. Maintains metadata in RAM
100 slaves/ 1000 slaves --> Managed by Single NameNode
Max tested till --> 4000 servers --> Single NameNode --> Single NameSpace
Lets assume we have /VOICE directory with too many files and folders we configure separate
NameNode for this
directory
/VOICE/...  NameNode01
/SMS/...  NameNode02
/Data/...  NameNode03
So based on the directory structure we can configure NameNode, so in Hadoop 2 we can configure
10000 servers can be configured because because NameNode separately managing directory
structure that's why we call it as Federation
Limitation 1 – No Horizontal Scalability
www.npntraining.com/masters-program/big-data-architect-training.php
Hadoop 2.x Architecture - Federation
NameNode1 NameNode2 NameNode3
Hadoop 1.x Hadoop 2.x
``
www.npntraining.com/masters-program/big-data-architect-training.php
How does HDFS Federation help HDFS scale horizontally?
Reduces the load on any single NameNode by using the
multiple, independent NameNode to manage individual
parts of the file system namespace.
You have configured two name nodes to manage
/voice and /sms respectively. What will happen if you try to
put a file in /lte directory?
Put will fail. None of the namespace will manage the file
and you will get an IOException with a no such file or
directory error
www.npntraining.com/masters-program/big-data-architect-training.php
If you loose Namenode you will loose the Cluster details.Manual intervention should be there to
start new NameNode and copy backup from SecondaryNN
Problem
10am --> backup to SNN
10:45am --> NameNode breakdown --> You can get data till 10:00am from SNN ( Problem in Gen 1 )
Solution
==========
HighAvailability : Active and Standby Namenodes manage same data at given point of time.
--> In case Active NameNode fails Standby NameNode will act as Active and serves request
Limitation 2 – No High Availability
www.npntraining.com/masters-program/big-data-architect-training.php
Hadoop 2.x Architecture - HA
https://hadoop.apache.org/docs/r2.5.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
``
www.npntraining.com/masters-program/big-data-architect-training.php
HDFDS HA was developed to overcome the following
disadvantage in Hadoop 1.0
a) Single Point of Failure of NameNode
b) Only one version can be run in classic Map-Reduce
c) Too much burden on JobTracker
YARN
(Yet Another Resource Negotiator
YARN – Yet Another Resource Negotiator
YARN is the core component of Hadoop 2 and is added to improve performance in Hadoop
Hadoop 1.x
MapReduce
Cluster Resource Management &
Data Processing
HDFS
(File Storage)
Hadoop 2.x
YARN
Cluster Resource Management
HDFS
(File Storage)
MapReduce
(Data Processing)
Others
(Data Processing)
It is the next generation computing platform which offer various advantage when compared to
classic MapReduce
It is a layer that separates the resource management layer and the processing components layer.
MapReduce2 moves Resource management (like infrastructure to monitor nodes, allocate
resources and schedule jobs) into YARN.
MapReduce 1.x Execution Framework
TaskTracker
JobTracker
Tasks
Client
NameNode
128
MB
72
MB
DataNode
Tasks
TaskTracker
Tasks
128
MB
72
MB
DataNode
Tasks
TaskTracker
Tasks
128
MB
72
MB
DataNode
Tasks
Program
Job Job
www.npntraining.com/courses/big-data-and-hadoop.php
MapReduce 1.x Execution Framework Contd…
DataNode2
4GB
DataNode4
DataNode1
4GB RAM
DataNode3
8GB RAM
Map
Map
Map
Map
Map
8GB RAM
Each Map/ Reduce task takes 1GB of RAM.
Resource is not properly utilized.
YARN Components
YARN consists of 3 components
1. ResourceManager
i. Scheduler
ii. Application Manager
2. NodeManager
3. Application Master
www.npntraining.com/masters-program/big-data-architect-training.php
YARN Architecture
ResourceManager
DN1NodeManager
Client
DN1NodeManager
DN1NodeManager
1. Client submits job to
Resource Manager
2. RM will contact any
of the NodeManager
Application Master
3. Node Manager will
create a daemon by Name
Application Master on the
same Node, its one per job
4. AM will communicate to
the ResourceManager to
find where the data is.
Container
www.npntraining.com/masters-program/big-data-architect-training.php
Sinble JobTracker to manage thousands of jobs
Problem
Jobtracker was overburdend
Solution
==========
YARN with multiple deamons like ResourceManager, NodeManger, ApplicationMaster(one per
Application)
Container --> variable resources allocated per task(in slave m/c) --> cpu,memory,disk,network
1. Resource Manager --> Entire Cluster Lever
2. NodeManager --> Per Node/Slave/machine/server
3. App Master --> life cycle of job ( App Master one per job )
Limitation 3 – Job Tracker Overburden
www.npntraining.com/masters-program/big-data-architect-training.php
YARN (Yet Another Resource Negotiator) is a new component added in Hadoop 2.0
Hadoop 1.x
MapReduce
Cluster Resource Management
& Data Processing
HDFS
(File Storage)
Hadoop 2.x
YARN
Cluster Resource Management
HDFS
(File Storage)
MapReduce
(Data Processing)
Others
(Data Processing)
Introduction to new YARN layer in Hadoop 2.0``
www.npntraining.com/masters-program/big-data-architect-training.php
`
Key Takeaways
Hard work beats talent
when talent fails to work hard.

Weitere ähnliche Inhalte

Was ist angesagt?

OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldOMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldAllan Cantle
 
Hive partitioning best practices
Hive partitioning  best practicesHive partitioning  best practices
Hive partitioning best practicesNabeel Moidu
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
System design for video streaming service
System design for video streaming serviceSystem design for video streaming service
System design for video streaming serviceNirmik Kale
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 

Was ist angesagt? (20)

Hadoop at Ebay
Hadoop at EbayHadoop at Ebay
Hadoop at Ebay
 
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldOMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
 
Hive partitioning best practices
Hive partitioning  best practicesHive partitioning  best practices
Hive partitioning best practices
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
NoSql
NoSqlNoSql
NoSql
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
System design for video streaming service
System design for video streaming serviceSystem design for video streaming service
System design for video streaming service
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop
HadoopHadoop
Hadoop
 

Ähnlich wie Understanding Big Data and Hadoop Architecture

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfKUMARRISHAV37
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 

Ähnlich wie Understanding Big Data and Hadoop Architecture (20)

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdf
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
paper
paperpaper
paper
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
HADOOP
HADOOPHADOOP
HADOOP
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Big data
Big dataBig data
Big data
 

Kürzlich hochgeladen

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 

Kürzlich hochgeladen (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 

Understanding Big Data and Hadoop Architecture

  • 1. Naveen P.N Trainer NPN TrainingTraining is the essence of success and we are committed to it. www.npntraining.com Module 01 - Understanding Big Data and Hadoop Includes (Hadoop 1.x & 2.x Architecture)
  • 2. Topics for the Module ` What is Big Data OLTP VS OLAP Limitation of existing Data Analytics Moving Data into Code Moving Code into Data Hadoop 1.0 / 2.0 Core Components Hadoop 2.0 Core Components Hadoop Master Slave Architecture After completing the module, you will be able to understand: File Blocks Rack Awareness Anatomy of File Read and Write Hadoop 1.x Challenges Scala REPL Scala Variable Types
  • 3. Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process it using Traditional data processing applications. What is Big Data www.npntraining.com/masters-program/big-data-architect-training.php
  • 4. 12+ TBs of tweet data every day 25+ TBs of log data every day ?TBsof dataevery day 2+ billion people on the Web by end 2011 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 76 million smart meters in 2009… 200M by 2014 Where Is This “Big Data” Coming From ? www.npntraining.com/masters-program/big-data-architect-training.php
  • 5. About RDBMS Why do I need RDBMS  For quick response  It enables relation between data elements to be defined and managed.  It enables one database to be utilized for all applications. Presently the data is stored in RDBMS, then what is the problem why the problem of BigData come www.npntraining.com/masters-program/big-data-architect-training.php
  • 6. OLTP VS OLAP We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it www.npntraining.com/masters-program/big-data-architect-training.php
  • 7. "Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization” Big Data spans three dimensions (3Vs) www.npntraining.com/masters-program/big-data-architect-training.php
  • 8. Storage only Grid (SAN) (Raw Data) ETL Compute Grid RDBMS (Aggregated Data) 1. Can’t explore original high fidelity raw data 3. Premature data death 90% of data is Archived A meagre 10% of Data is available for BI Limitation of Existing Data Analytics Architecture www.npntraining.com/masters-program/big-data-architect-training.php
  • 9. BI Reports + Interactive Apps Solution: A Combined Storage Compute Layer Hadoop: Storage + Compute Grid RDBMS (Aggregated Data) Scalable throughput for ETL & aggregation Data Exploration & Advanced analytics No Data Archiving Keep data alive forever Both Storage And Compute Grid together Entire Data is available for processing www.npntraining.com/masters-program/big-data-architect-training.php
  • 10. Processing 1TB of Data 1 Machine 4 I/O Channels Each Channel – 100 MB/s 45 Minutes Limitation Traditional Approach Processing Data in Enterprise www.npntraining.com/masters-program/big-data-architect-training.php
  • 11. Processing 1TB of Data 10 Machine 4 I/O Channels Each Channel – 100 MB/s 4.3 Minutes Hadoop Approach Processing Data in DFS www.npntraining.com/masters-program/big-data-architect-training.php
  • 12. What is Apache Hadoop Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model To solve the Big Data problem a new framework has evolved that is Hadoop. Hadoop provides.  Commodity Hardware  Big Cluster  Map Reduce  Failover  Data distribution  Moving code to data  Heterogeneous Hardware  Scalable Hadoop is based on work done by Google in the early 2000s  Google File System (GFS) paper published in 2003  MapReduce paper published in 2004 It is an architecture that can scale with huge volumes, variety and speed requirements of Big Data by distributing the work across dozens, hundreds, or even thousands of commodity servers that process the data in parallel. www.npntraining.com/masters-program/big-data-architect-training.php
  • 13. Moving data into code Contd... Terabyte Wants to Analyze the data Traditional data processing architecture  Nodes are broken up into separate processing and storage nodes connected by high capacity link  Many data intensive applications are CPU demanding causing bottle neck in networks.  Latency in transferring data. `` www.npntraining.com/masters-program/big-data-architect-training.php
  • 14. Moving code to data Wants to Analyze the data Map Reduce Client writes MapReduce Jobs Jobs Jobs Jobs Jobs Jobs Hadoop takes radically new approach to the problem of distributed computing. Distribute the data to multiple nodes. Distribute the program for computation to these multiple nodes. Individual nodes then work on data stay in their nodes. No data transfer over the network is required for initial processing. Additional nodes can be added for scalability. ``
  • 15. Distribution Vendors Cloudera Distribution for Hadoop (CDH) MapR Distribution. Hortonworks Data Platform Apache BigTop Distribution. `` www.npntraining.com/masters-program/big-data-architect-training.php
  • 16. Hadoop 1.0 Core Components Hadoop has two main components : 1. HDFS – Hadoop Distributed File System (Storage) 2. MapReduce ( Processing) Hadoop HDFS MapReduce Responsible to store the data in chunks(by splitting into blocks of 64MB each) To process the data in a massive parallel manner. Daemons  Name Node  Data Node  Secondary Name Node  Job Tracker  Task Tracker HDFS NameNode (Master) DataNode Secondary NameNode MapReduce JobTracker TaskTracker Storage Processing www.npntraining.com/masters-program/big-data-architect-training.php
  • 17. Hadoop 2.0 Core Components Hadoop 2.0 has two main components : 1. HDFS – Hadoop Distributed File System (Storage) 2. MapReduce ( Processing) Hadoop HDFS YARN/MRv2 Responsible to store the data in chunks(by splitting into blocks of 128MB each) To process the data in a massive parallel manner. Daemons  Name Node  Data Node  Secondary Name Node  ResourceManager  NodeManager HDFS NameNode (Master) DataNode(Slave) Secondary NameNode MapReduce ResourceManager(Master) NodeManager(Slave) Storage Processing www.npntraining.com/masters-program/big-data-architect-training.php
  • 18. HDFS – Hadoop Distributed File System HDFS is a distributed and scalable file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware. HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node) and a number of DataNodes (Slave nodes). Each and every file in the File System is divided into blocks of size 512 Bytes File System www.npntraining.com/masters-program/big-data-architect-training.php
  • 19. File Blocks By default, block size is 64 MB in Hadoop 1.x and 128 MB in Hadoop 2.x Why block size is large?  The main reason for having the HDFS blocks in large size is due to cost of seek time.  The large block size is to account for proper usage of storage space while considering the limit on the memory of NameNode. www.npntraining.com/masters-program/big-data-architect-training.php
  • 20. 1.0 Master Slave Architecture – Simple cluster setup with Hadoop Daemons …… MasterSlave NameNode JobTracker Secondary NameNode Single Box Single Box Optional to have in two boxes Single Box In Separate Box ( Many) TaskTracker TaskTracker TaskTracker Slave1 Slave2 Slave3 DataNode DataNode DataNode 1 2 3 4 5 website : www.npntraining.com
  • 21. 2.0 Master Slave Architecture – Simple cluster setup with Hadoop Daemons …… MasterSlave NameNode Active ResourceManager NameNode Standby Single Box Single Box Optional to have in two boxes Single Box In Separate Box ( Many) NodeManager NodeManager NodeManager Slave1 Slave2 Slave3 DataNode DataNode DataNode 1 2 3 4 5 Secondary NameNode Single Box In Separate Box ( Many) 3 website : www.npntraining.com
  • 22. Master Slave Architecture – Simple cluster setup with Hadoop Daemons NodeManager Slave1 Slave2 Slave3 DataNode DataNode DataNode ResourceManager NameNode MapReduce HDFS NodeManager NodeManager www.npntraining.com/masters-program/big-data-architect-training.php
  • 23. Hadoop Cluster: A Typical Use Case www.npntraining.com/masters-program/big-data-architect-training.php
  • 24. File Blocks in HDFS Master Node Communicates Wants to save 400 MB of data into cluster/HDFS Decides which nodes to write the data to First copy is always stored in nodes which is in close proximity to the client 128 MB 128 MB 16 MB 128 MB 128 MB 128 MB 128 MB 128 MB 128 MB 16 MB 128 MB 16 MB In HDFS the data is broken into blocks of size 64 MB Hadoop creates a 3 replication by default which is configurable and achieves fault tolerance `` website : www.npntraining.com
  • 25. File Blocks in HDFS Contd… NameNode 200 MB – weblog.dat 128 72 b1r1 b1r2 b1r3 b2r1 b2r2 b2r3 Namespace b1 r1 s1 b1 r2 s2 b1 r3 s3 b2 r1 s1 b2 r2 s2 b2 r3 s3 b1r1 b2r1 b1r2 b2r2 b1r3 b2r3 `` website : www.npntraining.com
  • 26. Rack Awareness & Replication Factor www.npntraining.com/masters-program/big-data-architect-training.php
  • 27. NameNode NameNode does not store the files but only files metadata. NameNode keeps track of all the file system related information(Metadata) such as:  Block Locations  Information about file permissions and ownership.  Last access time for the file.  User permission like which user have access to the file. NameNode oversees the health DataNode and coordinates access to the data stored in DataNode. The entire metadata is in main memory. `` www.npntraining.com/masters-program/big-data-architect-training.php
  • 28. NameNode Metadata The entire metadata is in main memory. No demand paging of FS meta-data. NameNode maintains two files 1. fsimage and 2. edit log The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata. However, while the fsimage file format is very efficient to read It’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation in the edit log for durability. `` www.npntraining.com/masters-program/big-data-architect-training.php
  • 29. Secondary NameNode or CheckPoint Node NameNode Secondary NameNode -fsImage Edit logs Not a hot standby for the NameNode Connects to NameNode every hour. Housekeeping, backup of NameNode metadata Saved metadata can build a failed NameNode Pulls metadata `` www.npntraining.com/masters-program/big-data-architect-training.php
  • 30. Hadoop Components Contd… ResourceManager Name Node NodeManager Data Node NodeManager Data Node NodeManager Data Node Master Node Slave Node Maintains and manages the blocks present on the Slave Nodes Periodically receives a Heartbeat and a Block report from each of the data nodes in the cluster Heartbeat implies that the DataNode is functioning properly, once every 3 seconds The HDFS architecture is built in such a way that the user data is never stored in the NameNode, it only stored metadata. It records the metadata of all the files stored in the cluster, e.g. the location, the size of the files, permissions, hierarchy, etc DataNodes perform the low-level read and write requests from the file system’s clients. Responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode. Secondary Name Node
  • 31. Anatomy of File Write – High Level A user wants to write data to Hadoop hdfs dfs –put 2016-apache-logs.txt / Client “cuts” input file into chunks of “block size” Client then contacts the NameNode to request write operation.  Sends No of blocks  Replication Factor NameNode responds with pipeline of DataNodes for replication to write. Clients reaches out to first DataNode in pipeline + Performs write * No actual data transfer will take place from NameNode Client takes the request “splits” input file into chunks of “block size” Client writes blocks in parallel  all the blocks are written at a time  not one by one www.npntraining.com/masters-program/big-data-architect-training.php
  • 32. Anatomy of File Write – Full Example 2016-apache-logs.txt 200 MB file Block size : 64 MB Replication Factor : 3 44 MB Hadoop Client 128 MB 128 MB NameNode –put request Write pipeline blk_000 to DN1,DN5,DN6 blk_001 to DN4,DN8,DN9 blk_002 to DN7,DN3,DN3 DataNode1 DataNode2 DataNode3 Rack01 DataNode4 DataNode5 DataNode6 Rack01 DataNode7 DataNode8 DataNode9 Rack01 website : www.npntraining.com
  • 33. Anatomy of File Read – Full Example 2016-apache-logs.txt 200 MB file Block size : 64 MB Replication Factor : 3 44 MB Hadoop Client 128 MB 128 MB NameNode –get request Write pipeline blk_000 to DN1 blk_001 to DN4 blk_002 to DN7 DataNode1 DataNode2 DataNode3 Rack01 DataNode4 DataNode5 DataNode6 Rack01 DataNode7 DataNode8 DataNode9 Rack01 website : www.npntraining.com
  • 34. In HDFS, blocks of a file are written in parallel, however replication of the blocks are done sequentially. a) True b) False Hadoop is a framework that allows for the distributed processing of : a) Small Data sets b) Large Data sets A file of 400 MB is being copied to HDFS. The System has finished copying 250 MB . What happens if a client tries to access that file. a) Can read up to block that’s successfully written. b) Can read up to last bit successfully written c) Will throw an exception d) Cannot see that file until its finished copying www.npntraining.com/masters-program/big-data-architect-training.php
  • 36. What could be the limitation of Hadoop 1 / Gen 1 Hadoop 1.x cluster can it have multiple HDFS Namespaces Which of the following are significant disadvantage in Hadoop 1.0 a) Single Point of Failure on NameNode b) Too much burden on JobTracker Hadoop 1.x cluster can it have multiple HDFS Namespaces Can you use other than MapReduce for processing in Hadoop 1.x www.npntraining.com/masters-program/big-data-architect-training.php
  • 37. Hadoop 1.x - Challenges NameNode – No Horizontal Scalability Single NameNode and single Namespaces, limited by NameNode RAM NameNode – No High Availability (HA) NameNode is Single Point of Failure, Need manual recovery using Secondary NameNode in case of failure. Job Tracker - Overburdened Spends significant portion of time and effort managing the life cycle of applications. MRv1 – Only Map & Reduce tasks HumongousData stored in HDFS remains unutilized and cannot be used for other workloads such as Graph processing etc. www.npntraining.com/masters-program/big-data-architect-training.php
  • 38. Single NameNode running and managing Single Namespace. Maintains metadata in RAM 100 slaves/ 1000 slaves --> Managed by Single NameNode Max tested till --> 4000 servers --> Single NameNode --> Single NameSpace Lets assume we have /VOICE directory with too many files and folders we configure separate NameNode for this directory /VOICE/...  NameNode01 /SMS/...  NameNode02 /Data/...  NameNode03 So based on the directory structure we can configure NameNode, so in Hadoop 2 we can configure 10000 servers can be configured because because NameNode separately managing directory structure that's why we call it as Federation Limitation 1 – No Horizontal Scalability www.npntraining.com/masters-program/big-data-architect-training.php
  • 39. Hadoop 2.x Architecture - Federation NameNode1 NameNode2 NameNode3 Hadoop 1.x Hadoop 2.x `` www.npntraining.com/masters-program/big-data-architect-training.php
  • 40. How does HDFS Federation help HDFS scale horizontally? Reduces the load on any single NameNode by using the multiple, independent NameNode to manage individual parts of the file system namespace. You have configured two name nodes to manage /voice and /sms respectively. What will happen if you try to put a file in /lte directory? Put will fail. None of the namespace will manage the file and you will get an IOException with a no such file or directory error www.npntraining.com/masters-program/big-data-architect-training.php
  • 41. If you loose Namenode you will loose the Cluster details.Manual intervention should be there to start new NameNode and copy backup from SecondaryNN Problem 10am --> backup to SNN 10:45am --> NameNode breakdown --> You can get data till 10:00am from SNN ( Problem in Gen 1 ) Solution ========== HighAvailability : Active and Standby Namenodes manage same data at given point of time. --> In case Active NameNode fails Standby NameNode will act as Active and serves request Limitation 2 – No High Availability www.npntraining.com/masters-program/big-data-architect-training.php
  • 42. Hadoop 2.x Architecture - HA https://hadoop.apache.org/docs/r2.5.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html `` www.npntraining.com/masters-program/big-data-architect-training.php
  • 43. HDFDS HA was developed to overcome the following disadvantage in Hadoop 1.0 a) Single Point of Failure of NameNode b) Only one version can be run in classic Map-Reduce c) Too much burden on JobTracker
  • 45. YARN – Yet Another Resource Negotiator YARN is the core component of Hadoop 2 and is added to improve performance in Hadoop Hadoop 1.x MapReduce Cluster Resource Management & Data Processing HDFS (File Storage) Hadoop 2.x YARN Cluster Resource Management HDFS (File Storage) MapReduce (Data Processing) Others (Data Processing) It is the next generation computing platform which offer various advantage when compared to classic MapReduce It is a layer that separates the resource management layer and the processing components layer. MapReduce2 moves Resource management (like infrastructure to monitor nodes, allocate resources and schedule jobs) into YARN.
  • 46. MapReduce 1.x Execution Framework TaskTracker JobTracker Tasks Client NameNode 128 MB 72 MB DataNode Tasks TaskTracker Tasks 128 MB 72 MB DataNode Tasks TaskTracker Tasks 128 MB 72 MB DataNode Tasks Program Job Job
  • 47. www.npntraining.com/courses/big-data-and-hadoop.php MapReduce 1.x Execution Framework Contd… DataNode2 4GB DataNode4 DataNode1 4GB RAM DataNode3 8GB RAM Map Map Map Map Map 8GB RAM Each Map/ Reduce task takes 1GB of RAM. Resource is not properly utilized.
  • 48. YARN Components YARN consists of 3 components 1. ResourceManager i. Scheduler ii. Application Manager 2. NodeManager 3. Application Master www.npntraining.com/masters-program/big-data-architect-training.php
  • 49. YARN Architecture ResourceManager DN1NodeManager Client DN1NodeManager DN1NodeManager 1. Client submits job to Resource Manager 2. RM will contact any of the NodeManager Application Master 3. Node Manager will create a daemon by Name Application Master on the same Node, its one per job 4. AM will communicate to the ResourceManager to find where the data is. Container www.npntraining.com/masters-program/big-data-architect-training.php
  • 50. Sinble JobTracker to manage thousands of jobs Problem Jobtracker was overburdend Solution ========== YARN with multiple deamons like ResourceManager, NodeManger, ApplicationMaster(one per Application) Container --> variable resources allocated per task(in slave m/c) --> cpu,memory,disk,network 1. Resource Manager --> Entire Cluster Lever 2. NodeManager --> Per Node/Slave/machine/server 3. App Master --> life cycle of job ( App Master one per job ) Limitation 3 – Job Tracker Overburden www.npntraining.com/masters-program/big-data-architect-training.php
  • 51. YARN (Yet Another Resource Negotiator) is a new component added in Hadoop 2.0 Hadoop 1.x MapReduce Cluster Resource Management & Data Processing HDFS (File Storage) Hadoop 2.x YARN Cluster Resource Management HDFS (File Storage) MapReduce (Data Processing) Others (Data Processing) Introduction to new YARN layer in Hadoop 2.0`` www.npntraining.com/masters-program/big-data-architect-training.php
  • 52. ` Key Takeaways Hard work beats talent when talent fails to work hard.

Hinweis der Redaktion

  1. FB, G+, LinkedIn, Twitter every day generating huge volume of data Facebook recently unveiled some statistics on the amount of data its system processes and stores. According to Facebook, its data system processes 2.5 million pieces of content each day amounting to 500+ terabytes of data daily. Facebook generates 2.7 billion Like actions per day and 300 million new photos are uploaded daily.
  2. Presently the data is stored in RDBMS, then why the problem of BigData What is the limitation of of RDBMS / Why do I need RDBMS We go online and we get response immediately that is the concept of DBMS or OLTP application
  3. IBM’s Definition–Big DataCharacteristics Velocity : CDR ( Call Detail Records ) Used to understand Customer Churn i.e. customer leaving service provider The rate at which velocity is generated Variety : Image MRI Scans
  4. Latency in trasferring data
  5. 100MB 1sec ? 60sec = 6000MB 6000MB – 1 Channel ? - 4 channel = 24000MB 1024GB 1TB 1024MB 1G = 1048576MB 24000MB 1min 1048576MB ? = 43.69
  6. The advantage of Share nothing architecture is it can scale easily – simply by adding another node. A shared nothing architecture (SN) is a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system. More specifically, none of the nodes share memory or disk storage.
  7. Latency in trasferring data
  8. Latency in trasferring data
  9. Processing coupled with data : In Hadoop we send jobs towards data.
  10. There are some programs which manages the Hadoop components these programs are known as Daemons. Daemons take care of the components in hadoop
  11. There are some programs which manages the Hadoop components these programs are known as Daemons. Daemons take care of the components in hadoop
  12. HDFS is a block structured file system which is designed to store very large files, where each file is divided into blocks of predetermined size. These blocks are stored across a cluster of one or several commodity hardware's.
  13. HDFS is a block structured file system which is designed to store very large files, where each file is divided into blocks of predetermined size. These blocks are stored across a cluster of one or several commodity hardware's.
  14. Tell : http://wiki.apache.org/hadoop/PoweredBy
  15. Fault Tolerance : Hadoop will not fail even one ore more slaves fail
  16. Fault Tolerance : Hadoop will not fail even one ore more slaves fail
  17. Rack : Group of server’s placed in a single place. Hadoop writes one replication in one rack and other replication in different rack, administrator can even change this also. Rack level fault tolerance also Hadoop provides
  18. This way, if the NameNode crashes, it can restore its state by first loading the fsimage
  19. This way, if the NameNode crashes, it can restore its state by first loading the fsimage This way, if the NameNode crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the namesystem modifications made since the creation of the fsimage
  20. Between the client network will be slower when compared to the cluster
  21. Between the client network will be slower when compared to the cluster
  22. In HDFS, blocks of a file are written in parallel, however replication of the blocks are done sequentially. Answer : True. A file is divided into Blocks , these blocks are written in parallel, but the block replication happen in sequence. A file of 400 MB is being copied to HDFS. The System has finished copying 250 MB . What happens if a client tries to access that file. Answer : (a) Can read up to block that’s successfully written.
  23. There are lot of self standing software's which are build on top of Hadoop Framework each addressing lot of problems. Software's built on top of Hadoop framework is called as Hadoop Eco-System. Flume is used to stream the data from Non HDFS to HDFS. e.g. Twitter
  24. Each NameNode need not to coordinate so it is called as Federated.
  25. In HDFS, blocks of a file are written in parallel, however replication of the blocks are done sequentially. Answer : True. A file is divided into Blocks , these blocks are written in parallel, but the block replication happen in sequence. A file of 400 MB is being copied to HDFS. The System has finished copying 250 MB . What happens if a client tries to access that file. Answer : (a) Can read up to block that’s successfully written.
  26. Each NameNode need not to coordinate so it is called as Federated.
  27. HDFDS HA was developed to overcome the following disadvantage in Hadoop 1.0 Answer : (a)
  28. Let’s say a client submits a program, the program communicates to JobTracker, in Hadoop terminology the program is considered as Job then in the job we would have mentioned which data to process, the Job Tracker communicates with NameNode to get the DataNode which has the data In a Nutshell the responsibility of JobTracker JT accepts the job Figure where is the data Invokes all the TT and assigns them the Job It monitors all the tasks (TT crahes) it monitors the life cycle JT will be overburdened because in production 1000’s of job will be running after certain time your JT becomes slow
  29. In Hadoop 1.x MapReduce is the only programming model to process the data which is stored in HDFS. In MapReduce work is divided into 2 phases Map phase Reduce phase Each Map takes 1GB resource for processing In Hadoop 2.x the processing is taken care by YARNThe minimum memory allocation for Map task is 1GB Map phase Reduce phase Dsf
  30. http://sivansasidharan.me/blog/Hadoop_YARN/
  31. Whenever a job is submitted it communicates to the Resource Manager. Resource Manager will then contact any of the Node Manager not necessary NodeManager which have data and say there is a job Node Manager launches a daemon called Application Master on the same node Application Master is per job. It is the responsibility of the ApplicationMaster to run the job. Application Master can contact the NodeManager as well as Resource Manager by contacting the REsourceManager Application Master will come to know where is the data and it will contact that Node and launches something called as container Containers are nothing but a simple Java process or JVM and inside the container actual program gets executed. The advantage of such a complex architecture is if DN2 requires more Resource for processing, the Application Master can contact the ResourceManager and allocate more resource so RM is a global entity which manages resources . On the other Hand the entire life cycle of the application starting from creating, monitoring, etc is managed by Application Master. NodeManager keeps track of the resource present in tht DataNode and updates to ResourceManager. In this architecture the resource is not managed by DataNode if any machine has more resource RM can communicate with NodeManager and NM will create Container and data will be copied and execute http://sivansasidharan.me/blog/Hadoop_YARN/