Lecture 2 Hadoop.pptx

Learning Objectives Learning Outcomes
Introduction to Hadoop
1. To study the features of Hadoop.
2. To learn the basic concepts of HDFS and
MapReduce Programming.
3. To study HDFS Architecture.
4. To study MapReduce Programming Model
5. To study Hadoop Ecosystem.
a) To comprehend the reasons behind the
popularity of Hadoop.
b) To be able to perform HDFS
operations.
c) To comprehend MapReduce framework.
d) To understand the read and write in
HDFS.
e) To be able to understand Hadoop
Ecosystem.

Agenda
 Hadoop - An Introduction
 RDBMS versus Hadoop
 Distributed Computing
Challenges
 History of Hadoop
 Hadoop Overview
 Key Aspects of Hadoop
 Hadoop Components
 High Level Architecture of
Hadoop
 Use case for Hadoop
 ClickStream Data
 Hadoop Distributors
 HDFS
 HDFS Daemons
 Anatomy of File Read
 Anatomy of File Write
 Replica Placement Strategy
 Working with HDFS commands
 Special Features of HDFS

Agenda
 Processing Data
with Hadoop
 What is MapReduce
Programming?
 How does MapReduce
Works?
 MapReduce Word Count
Example
 Managing Resources and
Application with Hadoop YARN
 Limitations of Hadoop 1.0
Architecture
 Hadoop 2 YARN: Taking Hadoop
Beyond Batch
 Hadoop Ecosystem
 Pig
 Hive
 Sqoop
 HBase

Hadoop – An Introduction
 Hadoop is an open-source distributed
computing framework that is used for
storing and processing large volumes of
data.
 It is designed to run on a cluster of
commodity hardware, and its main
components include a distributed file
system (Hadoop Distributed File System or
HDFS) and a parallel processing
framework (MapReduce).
 Its capability to handle massive
amounts of data, different categories of

What is Hadoop ?
Hadoop is an open-source, Java-based framework from
Apache which is used for storing, processing and analyzing
data which are very huge in volume.
Hadoop is used for batch/ offline processing.
It is a collection of software utilities which uses a network of
many computers to solve problems involving large amounts
of data and computation.

Hadoop Overview
 Key Aspects of Hadoop

History of Hadoop- Hadoop was created by Doug Cutting and Mike Cafarella in
2005, inspired by Google's MapReduce and Google File System (GFS) technologies.

Is there any full form of HADOOP?
 NO
 Doug used the name for his open source project because it
was relatively easy to spell and pronounce, meaningless, and
not used elsewhere.

Distributed Computing Challenges

Hadoop Components
HBase is a key value store (mostly), Hive is a system to execute SQL-like queries on a Hadoop system,
Pig is a special query language to access big data. Apache Sqoop is a tool that is extensively used to
transfer large amounts of data from Hadoop to the relational database servers and vice-versa.

Hadoop Components
 Hadoop Core Components:
 HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.
 MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
Hadoop
HDFS
MapReduce

Hadoop High Level Architecture

Hadoop High Level Architecture
 Every Hadoop cluster consists of a single master and multiple
worker nodes.
 The Master node has a Job Tracker, Task Tracker, Name Node
and Data Node while
 the Slave (worker node) can act as both a DataNode and
TaskTracker.
 Also it is possible to have data-only and compute only worker
nodes.

Modules of Hadoop
 The Hadoop framework is composed of the following modules :
 Hadoop Distributed File System (HDFS) : It includes the files that
will be broken into blocks and will be stored in nodes over a
distributed architecture. Using a distributed file system provides very
high aggregate bandwidth across clusters

Modules of Hadoop
 Hadoop Distributed File System (HDFS)
 Hadoop Yarn (Yet Another Resource Negotiator) : Used for job
scheduling and managing the computing resources in clusters.

Modules of Hadoop
 Hadoop Yarn (Yet Another Resource Negotiator)
 Hadoop MapReduce : It is an algorithm which distributes the task
into small pieces and assigns those pieces to many computers
joined over the network, and assembles all the events to form the
last event dataset.

Modules of Hadoop
 Hadoop Yarn (Yet Another Resource Negotiator)
 Hadoop MapReduce
 Hadoop Common : Includes Java Libraries that are used to start
Hadoop and utilities which are needed by other Hadoop modules.

ClickStream Data Analysis
 ClickStream data (mouse clicks) helps you to
understand the purchasing behavior of customers.
ClickStream analysis helps online marketers to
optimize their product web pages, promotional
content, etc. to improve their business.

Lecture 2 Hadoop.pptx

Recommended

Recommended

More Related Content

Similar to Lecture 2 Hadoop.pptx

Similar to Lecture 2 Hadoop.pptx (20)

More from Anonymous9etQKwW

More from Anonymous9etQKwW (11)

Recently uploaded

Recently uploaded (20)

Lecture 2 Hadoop.pptx