SlideShare ist ein Scribd-Unternehmen logo
1 von 54
Slide 1

www.edureka.in/hadoop
How It Works…
 LIVE On-Line classes
 Class recordings in Learning Management System (LMS)
 Module wise Quizzes, Coding Assignments
 24x7 on-demand technical support
 Project work on large Datasets
 Online certification exam

 Lifetime access to the LMS

Complimentary Java Classes

Slide 2

www.edureka.in/hadoop
Course Topics


Module 1





Module 2






Slide 3

Advance MapReduce
MRUnit testing framework

Module 5





Analytics using Pig
Understanding Pig Latin

Module 6







Advance HBASE
Zookeeper Service

Module 9





Advance Hive
NoSQL Databases and HBASE

Module 8





Analytics using Hive
Understanding HIVE QL

Module 7



Hadoop MapReduce framework
Programming in Map Reduce

Module 4





Hadoop Cluster Configuration
Data loading Techniques
Hadoop Project Environment

Module 3





Understanding Big Data
Hadoop Architecture

Hadoop 2.0 – New Features
Programming in MRv2

Module 10




Apache Oozie
Real world Datasets and Analysis
Project Discussion

www.edureka.in/hadoop
Topics for Today
 What is Big Data?
 Limitations of the existing solutions
 Solving the problem with Hadoop
 Introduction to Hadoop

 Hadoop Eco-System
 Hadoop Core Components
 HDFS Architecture
 MapReduce Job execution
 Anatomy of a File Write and Read
 Hadoop 2.0 (YARN or MRv2) Architecture

Slide 4

www.edureka.in/hadoop
What Is Big Data?
 Lots of Data (Terabytes or Petabytes)
 Big data is the term for a collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data processing applications. The
challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.
 Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of
information.

NYSE generates about one terabyte of new trade data
per day to Perform stock trading analytics to determine
trends for optimal trades.

Slide 5

www.edureka.in/hadoop
Un-Structured Data is Exploding

Slide 6

www.edureka.in/hadoop
IBM’s Definition

Characteristics of Big Data

Volume

Slide 7

Velocity

Variety

www.edureka.in/hadoop
Annie’s Introduction

Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.

Slide 8

www.edureka.in/hadoop
Annie’s Question
Map the following to corresponding data type:
-

XML Files

-

Word Docs, PDF files, Text files

-

E-Mail body

-

Slide 9

Hello There!!
My name is Annie.
Data from Enterprise systems (ERP, CRM etc.)
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.

www.edureka.in/hadoop
Annie’s Answer
XML Files -> Semi-structured data
Word Docs, PDF files, Text files -> Unstructured Data
E-Mail body -> Unstructured Data
Data from Enterprise systems (ERP, CRM etc.) -> Structured Data

Slide 10

www.edureka.in/hadoop
Further Reading
 More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
 Hadoop Cluster Configuration Files
http://www.edureka.in/blog/hadoop-cluster-configuration-files/
 Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
 Big Data
http://en.wikipedia.org/wiki/Big_Data

Slide 11

www.edureka.in/hadoop
Common Big Data Customer Scenarios
 Web and e-tailing





Recommendation Engines
Ad Targeting
Search Quality
Abuse and Click Fraud Detection

 Telecommunications

 Customer Churn Prevention
 Network Performance
Optimization
 Calling Data Record (CDR)
Analysis
 Analyzing Network to Predict
Failure

http://wiki.apache.org/hadoop/PoweredBy

Slide 12

www.edureka.in/hadoop
Common Big Data Customer Scenarios (Contd.)
 Government

 Fraud Detection And Cyber Security
 Welfare schemes
 Justice

 Healthcare & Life Sciences

Health information exchange
Gene sequencing
Serialization
Healthcare service quality
improvements
 Drug Safety





http://wiki.apache.org/hadoop/PoweredBy

Slide 13

www.edureka.in/hadoop
Common Big Data Customer Scenarios (Contd.)
 Banks and Financial services






Modeling True Risk
Threat Analysis
Fraud Detection
Trade Surveillance
Credit Scoring And Analysis

 Retail

 Point of sales Transaction
Analysis
 Customer Churn Analysis
 Sentiment Analysis

http://wiki.apache.org/hadoop/PoweredBy

Slide 14

www.edureka.in/hadoop
Hidden Treasure
Case Study: Sears Holding Corporation

 Insight into data can provide Business
Advantage.
 Some key early indicators can mean Fortunes
to Business.

X

 More Precise Analysis with more data.

*Sears was using traditional systems such as Oracle
Exadata, Teradata and SAS etc. to store and process the
customer activity and sales data.

Slide 15

www.edureka.in/hadoop
Limitations of Existing Data Analytics Architecture
BI Reports + Interactive Apps
A meagre
10% of the
~2PB Data is
available for
BI

RDBMS (Aggregated Data)

1. Can‟t explore original
high fidelity raw data

ETL Compute Grid
2. Moving data to compute
doesn‟t scale

Storage only Grid (original Raw Data)
Storage

Processing

90% of
the ~2PB
Archived

3. Premature data
death

Mostly Append
Collection
Instrumentation

http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
Slide 16

www.edureka.in/hadoop
Solution: A Combined Storage Computer Layer
BI Reports + Interactive Apps

1. Data Exploration &
Advanced analytics

RDBMS (Aggregated Data)
No Data
Archiving
Entire ~2PB
Data is
available for
processing

Both
Storage
And
Processing

2. Scalable throughput for ETL &
aggregation

Hadoop : Storage + Compute Grid

3. Keep data alive
forever

Mostly Append
Collection

Instrumentation

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather
than a meagre 10% as was the case with existing Non-Hadoop solutions.
Slide 17

www.edureka.in/hadoop
Hadoop Differentiating Factors
Accessible

Simple

Differentiating
Factors

Robust

Scalable

Slide 18

www.edureka.in/hadoop
Hadoop – It’s about Scale And Structure
RDBMS

EDW

MPP
RDBMS

HADOOP

NoSQL

Structured

Data Types

Multi and Unstructured

Limited, No Data Processing

Processing

Processing coupled with Data

Standards & Structured

Governance

Loosely Structured

Required On write

Schema

Required On Read

Reads are Fast

Speed

Writes are Fast

Software License

Cost

Support Only

Known Entity

Resources

Growing, Complexities, Wide

Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store

Best Fit Use

Data Discovery
Processing Unstructured Data
Massive Storage/Processing

Slide 19

www.edureka.in/hadoop
Why DFS?
Read 1 TB Data

1 Machine
4 I/O Channels
Each Channel – 100 MB/s

Slide 20

10 Machines
4 I/O Channels
Each Channel – 100 MB/s

www.edureka.in/hadoop
Why DFS?
Read 1 TB Data

1 Machine

10 Machines

4 I/O Channels
Each Channel – 100 MB/s

4 I/O Channels
Each Channel – 100 MB/s

45 Minutes
Slide 21

www.edureka.in/hadoop
Why DFS?
Read 1 TB Data

1 Machine
4 I/O Channels
Each Channel – 100 MB/s

4 I/O Channels
Each Channel – 100 MB/s

45 Minutes
Slide 22

10 Machines

4.5 Minutes
www.edureka.in/hadoop
What Is Hadoop?
 Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
 It is an Open-source Data Management with scale-out storage & distributed processing.

Slide 23

www.edureka.in/hadoop
Hadoop Key Characteristics
Reliable

Flexible

Hadoop
Features

Economical

Scalable

Slide 24

www.edureka.in/hadoop
Annie’s Question
Hadoop is a framework that allows for the distributed processing of:
-

Slide 25

Small Data Sets
Large Data Sets

Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.

www.edureka.in/hadoop
Annie’s Answer
Large Data Sets. It is also capable to process small data-sets
however to experience the true power of Hadoop one needs to have
data in Tb‟s because this where RDBMS takes hours and fails
whereas Hadoop does the same in couple of minutes.

Slide 26

www.edureka.in/hadoop
Hadoop Eco-System
Apache Oozie (Workflow)
Hive

Pig Latin

DW System

Data Analysis

Mahout
Machine Learning

MapReduce Framework

HBase
HDFS (Hadoop Distributed File System)
Flume

Sqoop
Import Or Export

Slide 27

Unstructured or
Semi-Structured data

Structured Data

www.edureka.in/hadoop
Machine Learning with Mahout
Write intelligent applications using Apache Mahout
LinkedIn Recommendations

Hadoop and
MapReduce magic in
action

https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
Slide 28

www.edureka.in/hadoop
Hadoop Core Components
Hadoop is a system for large scale data processing.
It has two main components:
 HDFS – Hadoop Distributed File System (Storage)
 Distributed across “nodes”
 Natively redundant
 NameNode tracks locations.
 MapReduce (Processing)

 Splits a task across processors
 “near” the data & assembles results
 Self-Healing, High Bandwidth
 Clustered storage
 JobTracker manages the TaskTrackers
Slide 29

www.edureka.in/hadoop
Hadoop Core Components (Contd.)

MapReduce
Engine

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker

HDFS
Cluster

Slide 30

Job Tracker
Admin Node
Name node

Data Node

Data Node

Data Node

Data Node

www.edureka.in/hadoop
HDFS Architecture

Metadata ops

Metadata (Name, replicas,…):
/home/foo/data, 3,…

NameNode

Client
Read

Block ops

Datanodes

Datanodes

Replication

Blocks

Write
Rack 1

Slide 31

Client

Rack 2

www.edureka.in/hadoop
Main Components Of HDFS
 NameNode:
 master of the system
 maintains and manages the blocks which are present on the
DataNodes

 DataNodes:
 slaves which are deployed on each machine and provide the
actual storage
 responsible for serving read and write requests for the clients

Slide 32

www.edureka.in/hadoop
NameNode Metadata
 Meta-data in Memory
 The entire metadata is in main memory
 No demand paging of FS meta-data
 Types of Metadata
 List of files
 List of Blocks for each file
 List of DataNode for each block
 File attributes, e.g. access time, replication factor
 A Transaction Log
 Records file creations, file deletions. etc

Slide 33

Name Node
(Stores metadata only)
METADATA:
/user/doug/hinfo -> 1 3 5
/user/doug/pdetail -> 4 2

Name Node:
Keeps track of overall file directory
structure and the placement of Data
Block

www.edureka.in/hadoop
Secondary Name Node
metadata
NameNode

 Secondary NameNode:

Single Point
Failure

 Not a hot standby for the NameNode
You give me
metadata every
hour, I will make
it secure

 Connects to NameNode every hour*
 Housekeeping, backup of NemeNode metadata
 Saved metadata can build a failed NameNode

Secondary
NameNode

metadata

Slide 34

www.edureka.in/hadoop
Annie’s Question
NameNode?
a)

is the “Single Point of Failure” in a cluster

b) runs on „Enterprise-class‟ hardware
c)

d) All of the above

Slide 35

Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.

stores meta-data

www.edureka.in/hadoop
Annie’s Answer

All of the above. NameNode Stores meta-data and runs on reliable
high quality hardware because it‟s a Single Point of failure in a
hadoop Cluster.

Slide 36

www.edureka.in/hadoop
Annie’s Question
When the NameNode fails, Secondary NameNode takes over
instantly and prevents Cluster Failure:
a)

TRUE

b) FALSE

Slide 37

Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.

www.edureka.in/hadoop
Annie’s Answer

False. Secondary NameNode is used for creating NameNode
Checkpoints. NameNode can Hello There!!
be manually recovered using „edits‟

My name is Annie.
and „FSImage‟ stored in Secondary NameNode.

I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.

Slide 38

www.edureka.in/hadoop
JobTracker
1. Copy Input Files

DFS
Job.xml.
Job.jar.

3. Get Input Files‟ Info

Client
2. Submit Job

4. Create Splits

Input Files

5. Upload Job
Information

User

6. Submit Job

Slide 39

Job Tracker

www.edureka.in/hadoop
JobTracker (Contd.)

DFS
Input Spilts

Client

8. Read Job Files

Job.xml.
Job.jar.
Maps

6. Submit Job

Job Tracker

Reduces

9. Create
maps and
reduces
7. Initialize Job

Slide 40

As many maps
as splits

Job Queue

www.edureka.in/hadoop
JobTracker (Contd.)

Job Tracker

H1

Job Queue

H3

11. Picks Tasks
(Data Local if possible)

H4
H5

Task Tracker H1

10. Heartbeat

10. Heartbeat

Task Tracker H2

12. Assign Tasks
10. Heartbeat

Task Tracker H3

Slide 41

10. Heartbeat

Task Tracker H4

www.edureka.in/hadoop
Annie’s Question
Hadoop framework picks which of the following daemon
for scheduling a task ?
a) namenode
b) datanode
c) task tracker
d) job tracker

Slide 42

www.edureka.in/hadoop
Annie’s Answer

JobTracker takes care of all theHello There!!
job scheduling and

My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.

assign tasks to TaskTrackers.

Slide 43

www.edureka.in/hadoop
Anatomy of A File Write

HDFS
Client

2. Create

1. Create
3. Write

Distributed
File System

NameNode
NameNode

7. Complete

6. Close

4. Write Packet

5. ack Packet

4
Pipeline of
Data nodes

DataNode
DataNode

Slide 44

4
DataNode

5

DataNode

DataNode

5

DataNode

www.edureka.in/hadoop
Anatomy of A File Read

HDFS
Client

2. Get Block locations

1. Create
3. Write

NameNode

Distributed
File System

NameNode

4. Read

5. Read

DataNode

DataNode

DataNode

Slide 45

DataNode
DataNode

DataNode

www.edureka.in/hadoop
Replication and Rack Awareness

Slide 46

www.edureka.in/hadoop
Annie’s Question

In HDFS, blocks of a file are written in parallel, however
the replication of the blocks are done sequentially:
a)

TRUE

b) FALSE

Slide 47

www.edureka.in/hadoop
Annie’s Answer

True. A files is divided into Blocks, these blocks are
written in parallel but the block replication happen in
sequence.

Slide 48

www.edureka.in/hadoop
Annie’s Question
A file of 400MB is being copied to HDFS. The system
has finished copying 250MB. What happens if a client
tries to access that file:
a)
b)
c)
d)

Slide 49

can read up to block that's successfully written.
can read up to last bit successfully written.
Will throw an throw an exception.
Cannot see that file until its finished copying.

www.edureka.in/hadoop
Annie’s Answer

Client can read up to the successfully written data block,
Answer is (a)

Slide 50

www.edureka.in/hadoop
Hadoop 2.x (YARN or MRv2)
HDFS
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Secondary
Name Node

Active
NameNode

Data Node

Client
YARN

Shared
edit logs

Read edit logs and applies
to its own namespace

Resource
Manager

Standby
NameNode

Data Node

Data Manager
Node Node
Container

Node Manager
Container

Slide 51

App
Master

Node Manager
Container

App
Master

Data Node

Node Manager
Container

App
Master

Data Node

App
Master

www.edureka.in/hadoop
Further Reading
 Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
 Apache Hadoop HDFS Architecture
http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
 Hadoop 2.0 and YARN
http://www.edureka.in/blog/apache-hadoop-2-0-and-yarn/

Slide 52

www.edureka.in/hadoop
Module-2 Pre-work
 Setup the Hadoop development environment using the documents present in the LMS.
 Hadoop Installation – Setup Cloudera CDH3 Demo VM
 Hadoop Installation – Setup Cloudera CDH4 QuickStart VM
 Execute Linux Basic Commands
 Execute HDFS Hands On commands
 Attempt the Module-1 Assignments present in the LMS.

Slide 53

www.edureka.in/hadoop
Thank You
See You in Class Next Week

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IEdureka!
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
 
Learn Big Data & Hadoop
Learn Big Data & Hadoop Learn Big Data & Hadoop
Learn Big Data & Hadoop Edureka!
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Edureka!
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...Edureka!
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solution
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solutionHadoop- A Highly Available and Secure Enterprise DataWarehousing solution
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solutionEdureka!
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Hadoop Career Path and Interview Preparation
Hadoop Career Path and Interview PreparationHadoop Career Path and Interview Preparation
Hadoop Career Path and Interview PreparationEdureka!
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java ProfessionalsEdureka!
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 

Was ist angesagt? (20)

Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Learn Big Data & Hadoop
Learn Big Data & Hadoop Learn Big Data & Hadoop
Learn Big Data & Hadoop
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solution
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solutionHadoop- A Highly Available and Secure Enterprise DataWarehousing solution
Hadoop- A Highly Available and Secure Enterprise DataWarehousing solution
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hadoop Career Path and Interview Preparation
Hadoop Career Path and Interview PreparationHadoop Career Path and Interview Preparation
Hadoop Career Path and Interview Preparation
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 

Ähnlich wie Learn Hadoop

Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
Hadoop : The Pile of Big Data
Hadoop : The Pile of Big DataHadoop : The Pile of Big Data
Hadoop : The Pile of Big DataEdureka!
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
 
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solutionHadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solutionEdureka!
 
50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabsWhizlabs
 
5 Scenarios: When To Use & When Not to Use Hadoop
5 Scenarios: When To Use & When Not to Use Hadoop5 Scenarios: When To Use & When Not to Use Hadoop
5 Scenarios: When To Use & When Not to Use HadoopEdureka!
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 

Ähnlich wie Learn Hadoop (20)

Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Hadoop : The Pile of Big Data
Hadoop : The Pile of Big DataHadoop : The Pile of Big Data
Hadoop : The Pile of Big Data
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop_Presentation
Hadoop_PresentationHadoop_Presentation
Hadoop_Presentation
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solutionHadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
 
50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs
 
Hadoop .pdf
Hadoop .pdfHadoop .pdf
Hadoop .pdf
 
5 Scenarios: When To Use & When Not to Use Hadoop
5 Scenarios: When To Use & When Not to Use Hadoop5 Scenarios: When To Use & When Not to Use Hadoop
5 Scenarios: When To Use & When Not to Use Hadoop
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Mehr von Edureka!

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaEdureka!
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaEdureka!
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaEdureka!
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaEdureka!
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaEdureka!
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaEdureka!
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaEdureka!
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaEdureka!
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaEdureka!
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaEdureka!
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | EdurekaEdureka!
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEdureka!
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEdureka!
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaEdureka!
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaEdureka!
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaEdureka!
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Edureka!
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaEdureka!
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaEdureka!
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | EdurekaEdureka!
 

Mehr von Edureka! (20)

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | Edureka
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | Edureka
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | Edureka
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | Edureka
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| Edureka
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | Edureka
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | Edureka
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | Edureka
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | Edureka
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | Edureka
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | Edureka
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | Edureka
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | Edureka
 

Kürzlich hochgeladen

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 

Kürzlich hochgeladen (20)

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 

Learn Hadoop

  • 2. How It Works…  LIVE On-Line classes  Class recordings in Learning Management System (LMS)  Module wise Quizzes, Coding Assignments  24x7 on-demand technical support  Project work on large Datasets  Online certification exam  Lifetime access to the LMS Complimentary Java Classes Slide 2 www.edureka.in/hadoop
  • 3. Course Topics  Module 1    Module 2     Slide 3 Advance MapReduce MRUnit testing framework Module 5    Analytics using Pig Understanding Pig Latin Module 6     Advance HBASE Zookeeper Service Module 9    Advance Hive NoSQL Databases and HBASE Module 8    Analytics using Hive Understanding HIVE QL Module 7   Hadoop MapReduce framework Programming in Map Reduce Module 4    Hadoop Cluster Configuration Data loading Techniques Hadoop Project Environment Module 3    Understanding Big Data Hadoop Architecture Hadoop 2.0 – New Features Programming in MRv2 Module 10    Apache Oozie Real world Datasets and Analysis Project Discussion www.edureka.in/hadoop
  • 4. Topics for Today  What is Big Data?  Limitations of the existing solutions  Solving the problem with Hadoop  Introduction to Hadoop  Hadoop Eco-System  Hadoop Core Components  HDFS Architecture  MapReduce Job execution  Anatomy of a File Write and Read  Hadoop 2.0 (YARN or MRv2) Architecture Slide 4 www.edureka.in/hadoop
  • 5. What Is Big Data?  Lots of Data (Terabytes or Petabytes)  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.  Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information. NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades. Slide 5 www.edureka.in/hadoop
  • 6. Un-Structured Data is Exploding Slide 6 www.edureka.in/hadoop
  • 7. IBM’s Definition Characteristics of Big Data Volume Slide 7 Velocity Variety www.edureka.in/hadoop
  • 8. Annie’s Introduction Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Slide 8 www.edureka.in/hadoop
  • 9. Annie’s Question Map the following to corresponding data type: - XML Files - Word Docs, PDF files, Text files - E-Mail body - Slide 9 Hello There!! My name is Annie. Data from Enterprise systems (ERP, CRM etc.) I love quizzes and puzzles and I am here to make you guys think and answer my questions. www.edureka.in/hadoop
  • 10. Annie’s Answer XML Files -> Semi-structured data Word Docs, PDF files, Text files -> Unstructured Data E-Mail body -> Unstructured Data Data from Enterprise systems (ERP, CRM etc.) -> Structured Data Slide 10 www.edureka.in/hadoop
  • 11. Further Reading  More on Big Data http://www.edureka.in/blog/the-hype-behind-big-data/  Hadoop Cluster Configuration Files http://www.edureka.in/blog/hadoop-cluster-configuration-files/  Opportunities in Hadoop http://www.edureka.in/blog/jobs-in-hadoop/  Big Data http://en.wikipedia.org/wiki/Big_Data Slide 11 www.edureka.in/hadoop
  • 12. Common Big Data Customer Scenarios  Web and e-tailing     Recommendation Engines Ad Targeting Search Quality Abuse and Click Fraud Detection  Telecommunications  Customer Churn Prevention  Network Performance Optimization  Calling Data Record (CDR) Analysis  Analyzing Network to Predict Failure http://wiki.apache.org/hadoop/PoweredBy Slide 12 www.edureka.in/hadoop
  • 13. Common Big Data Customer Scenarios (Contd.)  Government  Fraud Detection And Cyber Security  Welfare schemes  Justice  Healthcare & Life Sciences Health information exchange Gene sequencing Serialization Healthcare service quality improvements  Drug Safety     http://wiki.apache.org/hadoop/PoweredBy Slide 13 www.edureka.in/hadoop
  • 14. Common Big Data Customer Scenarios (Contd.)  Banks and Financial services      Modeling True Risk Threat Analysis Fraud Detection Trade Surveillance Credit Scoring And Analysis  Retail  Point of sales Transaction Analysis  Customer Churn Analysis  Sentiment Analysis http://wiki.apache.org/hadoop/PoweredBy Slide 14 www.edureka.in/hadoop
  • 15. Hidden Treasure Case Study: Sears Holding Corporation  Insight into data can provide Business Advantage.  Some key early indicators can mean Fortunes to Business. X  More Precise Analysis with more data. *Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc. to store and process the customer activity and sales data. Slide 15 www.edureka.in/hadoop
  • 16. Limitations of Existing Data Analytics Architecture BI Reports + Interactive Apps A meagre 10% of the ~2PB Data is available for BI RDBMS (Aggregated Data) 1. Can‟t explore original high fidelity raw data ETL Compute Grid 2. Moving data to compute doesn‟t scale Storage only Grid (original Raw Data) Storage Processing 90% of the ~2PB Archived 3. Premature data death Mostly Append Collection Instrumentation http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038? Slide 16 www.edureka.in/hadoop
  • 17. Solution: A Combined Storage Computer Layer BI Reports + Interactive Apps 1. Data Exploration & Advanced analytics RDBMS (Aggregated Data) No Data Archiving Entire ~2PB Data is available for processing Both Storage And Processing 2. Scalable throughput for ETL & aggregation Hadoop : Storage + Compute Grid 3. Keep data alive forever Mostly Append Collection Instrumentation *Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions. Slide 17 www.edureka.in/hadoop
  • 19. Hadoop – It’s about Scale And Structure RDBMS EDW MPP RDBMS HADOOP NoSQL Structured Data Types Multi and Unstructured Limited, No Data Processing Processing Processing coupled with Data Standards & Structured Governance Loosely Structured Required On write Schema Required On Read Reads are Fast Speed Writes are Fast Software License Cost Support Only Known Entity Resources Growing, Complexities, Wide Interactive OLAP Analytics Complex ACID Transactions Operational Data Store Best Fit Use Data Discovery Processing Unstructured Data Massive Storage/Processing Slide 19 www.edureka.in/hadoop
  • 20. Why DFS? Read 1 TB Data 1 Machine 4 I/O Channels Each Channel – 100 MB/s Slide 20 10 Machines 4 I/O Channels Each Channel – 100 MB/s www.edureka.in/hadoop
  • 21. Why DFS? Read 1 TB Data 1 Machine 10 Machines 4 I/O Channels Each Channel – 100 MB/s 4 I/O Channels Each Channel – 100 MB/s 45 Minutes Slide 21 www.edureka.in/hadoop
  • 22. Why DFS? Read 1 TB Data 1 Machine 4 I/O Channels Each Channel – 100 MB/s 4 I/O Channels Each Channel – 100 MB/s 45 Minutes Slide 22 10 Machines 4.5 Minutes www.edureka.in/hadoop
  • 23. What Is Hadoop?  Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.  It is an Open-source Data Management with scale-out storage & distributed processing. Slide 23 www.edureka.in/hadoop
  • 25. Annie’s Question Hadoop is a framework that allows for the distributed processing of: - Slide 25 Small Data Sets Large Data Sets Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. www.edureka.in/hadoop
  • 26. Annie’s Answer Large Data Sets. It is also capable to process small data-sets however to experience the true power of Hadoop one needs to have data in Tb‟s because this where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes. Slide 26 www.edureka.in/hadoop
  • 27. Hadoop Eco-System Apache Oozie (Workflow) Hive Pig Latin DW System Data Analysis Mahout Machine Learning MapReduce Framework HBase HDFS (Hadoop Distributed File System) Flume Sqoop Import Or Export Slide 27 Unstructured or Semi-Structured data Structured Data www.edureka.in/hadoop
  • 28. Machine Learning with Mahout Write intelligent applications using Apache Mahout LinkedIn Recommendations Hadoop and MapReduce magic in action https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout Slide 28 www.edureka.in/hadoop
  • 29. Hadoop Core Components Hadoop is a system for large scale data processing. It has two main components:  HDFS – Hadoop Distributed File System (Storage)  Distributed across “nodes”  Natively redundant  NameNode tracks locations.  MapReduce (Processing)  Splits a task across processors  “near” the data & assembles results  Self-Healing, High Bandwidth  Clustered storage  JobTracker manages the TaskTrackers Slide 29 www.edureka.in/hadoop
  • 30. Hadoop Core Components (Contd.) MapReduce Engine Task Tracker Task Tracker Task Tracker Task Tracker HDFS Cluster Slide 30 Job Tracker Admin Node Name node Data Node Data Node Data Node Data Node www.edureka.in/hadoop
  • 31. HDFS Architecture Metadata ops Metadata (Name, replicas,…): /home/foo/data, 3,… NameNode Client Read Block ops Datanodes Datanodes Replication Blocks Write Rack 1 Slide 31 Client Rack 2 www.edureka.in/hadoop
  • 32. Main Components Of HDFS  NameNode:  master of the system  maintains and manages the blocks which are present on the DataNodes  DataNodes:  slaves which are deployed on each machine and provide the actual storage  responsible for serving read and write requests for the clients Slide 32 www.edureka.in/hadoop
  • 33. NameNode Metadata  Meta-data in Memory  The entire metadata is in main memory  No demand paging of FS meta-data  Types of Metadata  List of files  List of Blocks for each file  List of DataNode for each block  File attributes, e.g. access time, replication factor  A Transaction Log  Records file creations, file deletions. etc Slide 33 Name Node (Stores metadata only) METADATA: /user/doug/hinfo -> 1 3 5 /user/doug/pdetail -> 4 2 Name Node: Keeps track of overall file directory structure and the placement of Data Block www.edureka.in/hadoop
  • 34. Secondary Name Node metadata NameNode  Secondary NameNode: Single Point Failure  Not a hot standby for the NameNode You give me metadata every hour, I will make it secure  Connects to NameNode every hour*  Housekeeping, backup of NemeNode metadata  Saved metadata can build a failed NameNode Secondary NameNode metadata Slide 34 www.edureka.in/hadoop
  • 35. Annie’s Question NameNode? a) is the “Single Point of Failure” in a cluster b) runs on „Enterprise-class‟ hardware c) d) All of the above Slide 35 Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. stores meta-data www.edureka.in/hadoop
  • 36. Annie’s Answer All of the above. NameNode Stores meta-data and runs on reliable high quality hardware because it‟s a Single Point of failure in a hadoop Cluster. Slide 36 www.edureka.in/hadoop
  • 37. Annie’s Question When the NameNode fails, Secondary NameNode takes over instantly and prevents Cluster Failure: a) TRUE b) FALSE Slide 37 Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. www.edureka.in/hadoop
  • 38. Annie’s Answer False. Secondary NameNode is used for creating NameNode Checkpoints. NameNode can Hello There!! be manually recovered using „edits‟ My name is Annie. and „FSImage‟ stored in Secondary NameNode. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Slide 38 www.edureka.in/hadoop
  • 39. JobTracker 1. Copy Input Files DFS Job.xml. Job.jar. 3. Get Input Files‟ Info Client 2. Submit Job 4. Create Splits Input Files 5. Upload Job Information User 6. Submit Job Slide 39 Job Tracker www.edureka.in/hadoop
  • 40. JobTracker (Contd.) DFS Input Spilts Client 8. Read Job Files Job.xml. Job.jar. Maps 6. Submit Job Job Tracker Reduces 9. Create maps and reduces 7. Initialize Job Slide 40 As many maps as splits Job Queue www.edureka.in/hadoop
  • 41. JobTracker (Contd.) Job Tracker H1 Job Queue H3 11. Picks Tasks (Data Local if possible) H4 H5 Task Tracker H1 10. Heartbeat 10. Heartbeat Task Tracker H2 12. Assign Tasks 10. Heartbeat Task Tracker H3 Slide 41 10. Heartbeat Task Tracker H4 www.edureka.in/hadoop
  • 42. Annie’s Question Hadoop framework picks which of the following daemon for scheduling a task ? a) namenode b) datanode c) task tracker d) job tracker Slide 42 www.edureka.in/hadoop
  • 43. Annie’s Answer JobTracker takes care of all theHello There!! job scheduling and My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. assign tasks to TaskTrackers. Slide 43 www.edureka.in/hadoop
  • 44. Anatomy of A File Write HDFS Client 2. Create 1. Create 3. Write Distributed File System NameNode NameNode 7. Complete 6. Close 4. Write Packet 5. ack Packet 4 Pipeline of Data nodes DataNode DataNode Slide 44 4 DataNode 5 DataNode DataNode 5 DataNode www.edureka.in/hadoop
  • 45. Anatomy of A File Read HDFS Client 2. Get Block locations 1. Create 3. Write NameNode Distributed File System NameNode 4. Read 5. Read DataNode DataNode DataNode Slide 45 DataNode DataNode DataNode www.edureka.in/hadoop
  • 46. Replication and Rack Awareness Slide 46 www.edureka.in/hadoop
  • 47. Annie’s Question In HDFS, blocks of a file are written in parallel, however the replication of the blocks are done sequentially: a) TRUE b) FALSE Slide 47 www.edureka.in/hadoop
  • 48. Annie’s Answer True. A files is divided into Blocks, these blocks are written in parallel but the block replication happen in sequence. Slide 48 www.edureka.in/hadoop
  • 49. Annie’s Question A file of 400MB is being copied to HDFS. The system has finished copying 250MB. What happens if a client tries to access that file: a) b) c) d) Slide 49 can read up to block that's successfully written. can read up to last bit successfully written. Will throw an throw an exception. Cannot see that file until its finished copying. www.edureka.in/hadoop
  • 50. Annie’s Answer Client can read up to the successfully written data block, Answer is (a) Slide 50 www.edureka.in/hadoop
  • 51. Hadoop 2.x (YARN or MRv2) HDFS All name space edits logged to shared NFS storage; single writer (fencing) Secondary Name Node Active NameNode Data Node Client YARN Shared edit logs Read edit logs and applies to its own namespace Resource Manager Standby NameNode Data Node Data Manager Node Node Container Node Manager Container Slide 51 App Master Node Manager Container App Master Data Node Node Manager Container App Master Data Node App Master www.edureka.in/hadoop
  • 52. Further Reading  Apache Hadoop and HDFS http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/  Apache Hadoop HDFS Architecture http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/  Hadoop 2.0 and YARN http://www.edureka.in/blog/apache-hadoop-2-0-and-yarn/ Slide 52 www.edureka.in/hadoop
  • 53. Module-2 Pre-work  Setup the Hadoop development environment using the documents present in the LMS.  Hadoop Installation – Setup Cloudera CDH3 Demo VM  Hadoop Installation – Setup Cloudera CDH4 QuickStart VM  Execute Linux Basic Commands  Execute HDFS Hands On commands  Attempt the Module-1 Assignments present in the LMS. Slide 53 www.edureka.in/hadoop
  • 54. Thank You See You in Class Next Week

Hinweis der Redaktion

  1. Web and e-tailing- eBay is using Hadoop technology and the Hbase database, which supports real-time analysis of Hadoop data, to build a new search engine for its auction site. eBay has more than 97 million active buyers and sellers and over 200 million items for sale in 50,000 categories. The site handles close to 2 billion page views, 250 million search queries and tens of billions of database calls daily. The company has 9 petabytes of data stored on Hadoop and Teradata clusters, and the amount is growing quickly.TelecommunicationsChina Mobile; Data Mining platform for Telecom Industry, 5-8 TB/day CDR , Network Signaling DataCurrent Solutions such as Oracle DB, SAS (Data Mining), Unix Servers and SAN aren’t sufficient to store and process such a vast amount of dataNeed faster data processing to Precision marketing, Network Optimization, Service Optimization and Log Processing
  2. Government-AADHAR by Govt. Of India; 5 MB Data per resident, maps to about 10-15 PB of raw data. The Hadoop stack: HDFS (Hadoop distributed file system) is used to provide high data read/write throughput in the order of many terabytes per day. Distributed architecture enables scale out as needed. Hive is used for building the UIDAI data warehouse, HBase for indexed lookup of records across millions of rows, Zookeeper as a distributed coordination service for server instances, and Pig as an ETL (extract, transform and load) solution for loading data into Hive.Healthcare and Life Sciences- Life sciences research firm NextBio uses Hadoop and HBase to help big pharmaceutical companies conduct genomic research. The company embraced Hadoop in 2009 to make the sheer scale of genomic data-analysis more affordable. The company's core 100-node Hadoop cluster, which has processed as much as 100 terabytes of data, is used to compare data from drug studies to publically available genomics data. Given that there are tens of thousands of such studies and 3.2 billion base pairs behind each of the hundreds of genomes that NextBio studies, it's clearly a big-data challenge.
  3. Banks and Financial services:3 of the top 5 Banks run Cloudera HadoopJPMorgan Chase uses Hadoop technology for a growing number of purposes, including fraud detection, IT risk management and self service. With over 150 petabytes of data stored online, 30,000 databases and 3.5 billion log-ins to user accounts, data is the lifeblood of JPMorgan Chase.Retail:Sears is an American multinational mid-range department store chain headquartered in Hoffman Estates, Illinois). It Moved to Hadoop from Hadoop from Teradata and SAS to avoid archiving and deleting its meaningful sales and other customer activity data. 300-Node Hadoop cluster helps Sears to keep its 100% data (~2PB) available to BI rather than a meager 10% as was the case with Non-Hadoop solutions. Walmart; migrated data from its existing Oracle, Neteeza, Oracle and Greenplum gear to its 250-Node Hadoop Cluster.
  4. Why Oracle, HP, IBM and other Enterprise Technology giants are in Red on growth. Sears:Sears wanted to personalize marketing campaigns, coupons, and offers down to the individual customer, but the existing legacy systems were incapable of supporting that.Sears' process for analyzing marketing campaigns for loyalty club members used to take six weeks on mainframe, Teradata, and SAS servers. The new process running on Hadoop can be completed weekly. For certain online and mobile commerce scenarios, Sears can now perform daily analyses. What's more, targeting is more granular, in some cases down to the individual customer. Moving up the stack, Sears is consolidating its databases to MySQL, InfoBright, and Teradata--EMC Greenplum, Microsoft SQL Server, and Oracle (including four Exadata boxes) are on their way out.Sears routinely replaces legacy Unix systems with Linux rather than upgrade them, and it has retired most of its Sun and HP-UX servers. Microsoft server and development technologies are also on the way out.
  5. - 2 PB of data--mostly structured and unstructured data such as customer transaction, point of sale, and supply chain.- Because of Archiving Need 90% of the ~2PB of Data is not available for BI
  6. 300-Node Hadoop cluster helps Sears to keep its 100% data (~2PB) available to BI rather than a meager 10% as was the case with Non-Hadoop solutions. Sears now keeps all of its data down to individual transactions (rather than aggregates) and years of history (rather than imposing quarterly windows on certain data, as it did previously). That's raw data, which Sears can refactor and combine as needed quickly and efficiently within Hadoop.To give a sense of how early Sears was to Hadoop development, Wal-Mart divulged early this year that it was scaling out an experimental 10-node Hadoop cluster for e-commerce analysis. Sears passed that size in 2010.Has its own Hadoop solutions subsidiary MetaScale, to provide hadoop services to other retail companies on the line of AWS.
  7. Accessible: Hadoop runs on large clusters of commodity machines or cloud computing services such as Amazon EC2Robust: Since Hadoop can run on commodity cluster, its designed with the assumption of frequent hardware failure, it can gracefully handle such failure and computation don’t stop because of few failed devices / systemsScalable:Hadoop scales linearly to handle large data by adding more slave nodes to the clusterSimple : Its easy to write efficient parallel programming with Hadoop
  8. We will cover other Hadoop Components in detail in future sessions of this course.
  9. Data transferred from DataNode to MapTask process. DBlk is the file data block; CBlk is the file checksum block. File data are transferred to the client through Java niotransferTo (aka UNIX sendfilesyscall). Checksum data are first fetched to DataNode JVM buffer, and then pushed to the client (details are not shown). Both file data and checksum data are bundled in an HDFS packet (typically 64KB) in the format of: {packet header | checksum bytes | data bytes}.2. Data received from the socket are buffered in a BufferedInputStream, presumably for the purpose of reducing the number of syscalls to the kernel. This actually involves two buffer-copies: first, data are copied from kernel buffers into a temporary direct buffer in JDK code; second, data are copied from the temporary direct buffer to the byte[] buffer owned by the BufferedInputStream. The size of the byte[] in BufferedInputStream is controlled by configuration property "io.file.buffer.size", and is default to 4K. In our production environment, this parameter is customized to 128K.3. Through the BufferedInputStream, the checksum bytes are saved into an internal ByteBuffer (whose size is roughly (PacketSize / 512 * 4) or 512B), and file bytes (compressed data) are deposited into the byte[] buffer supplied by the decompression layer. Since the checksum calculation requires a full 512 byte chunk while a user's request may not be aligned with a chunk boundary, a 512B byte[] buffer is used to align the input before copying partial chunks into user-supplied byte[] buffer. Also note that data are copied to the buffer in 512-byte pieces (as required by FSInputChecker API). Finally, all checksum bytes are copied to a 4-byte array for FSInputChecker to perform checksum verification. Overall, this step involves an extra buffer-copy.4. The decompression layer uses a byte[] buffer to receive data from the DFSClient layer. The DecompressorStream copies the data from the byte[] buffer to a 64K direct buffer, calls the native library code to decompress the data and stores the uncompressed bytes in another 64K direct buffer. This step involves two buffer-copies.5.LineReader maintains an internal buffer to absorb data from the downstream. From the buffer, line separators are discovered and line bytes are copied to form Text objects. This step requires two buffer-copies.The client creates the file by calling create() on Distributed FileSystem (step 1). Distributed FileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it (step 2). The namenode performs various checks to make sure the file doesn’t already exist, and that the client has the right permissions to create the file.
  10. The client opens the file it wishes to read by calling open() on the FileSystemobject,which for HDFS is an instance of DFS(step 1).DistributedFileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the File (step 2). For each block, the namenode returns the addresses of the datanodes that have a copy of that block