2. Learning Objectives Learning Outcomes
Introduction to Hadoop
1. To study the features of Hadoop.
2. To learn the basic concepts of HDFS and
MapReduce Programming.
3. To study HDFS Architecture.
4. To study MapReduce Programming Model
5. To study Hadoop Ecosystem.
a) To comprehend the reasons behind the
popularity of Hadoop.
b) To be able to perform HDFS
operations.
c) To comprehend MapReduce framework.
d) To understand the read and write in
HDFS.
e) To be able to understand Hadoop
Ecosystem.
3. Agenda
Hadoop - An Introduction
RDBMS versus Hadoop
Distributed Computing
Challenges
History of Hadoop
Hadoop Overview
Key Aspects of Hadoop
Hadoop Components
High Level Architecture of
Hadoop
Use case for Hadoop
ClickStream Data
Hadoop Distributors
HDFS
HDFS Daemons
Anatomy of File Read
Anatomy of File Write
Replica Placement Strategy
Working with HDFS commands
Special Features of HDFS
4. Agenda
Processing Data
with Hadoop
What is MapReduce
Programming?
How does MapReduce
Works?
MapReduce Word Count
Example
Managing Resources and
Application with Hadoop YARN
Limitations of Hadoop 1.0
Architecture
Hadoop 2 YARN: Taking Hadoop
Beyond Batch
Hadoop Ecosystem
Pig
Hive
Sqoop
HBase
5. Hadoop – An Introduction
Hadoop is an open-source distributed
computing framework that is used for
storing and processing large volumes of
data.
It is designed to run on a cluster of
commodity hardware, and its main
components include a distributed file
system (Hadoop Distributed File System or
HDFS) and a parallel processing
framework (MapReduce).
Its capability to handle massive
amounts of data, different categories of
6. What is Hadoop ?
Hadoop is an open-source, Java-based framework from
Apache which is used for storing, processing and analyzing
data which are very huge in volume.
Hadoop is used for batch/ offline processing.
It is a collection of software utilities which uses a network of
many computers to solve problems involving large amounts
of data and computation.
12. History of Hadoop- Hadoop was created by Doug Cutting and Mike Cafarella in
2005, inspired by Google's MapReduce and Google File System (GFS) technologies.
13. Is there any full form of HADOOP?
NO
Doug used the name for his open source project because it
was relatively easy to spell and pronounce, meaningless, and
not used elsewhere.
16. Hadoop Components
HBase is a key value store (mostly), Hive is a system to execute SQL-like queries on a Hadoop system,
Pig is a special query language to access big data. Apache Sqoop is a tool that is extensively used to
transfer large amounts of data from Hadoop to the relational database servers and vice-versa.
22. Hadoop High Level Architecture
Every Hadoop cluster consists of a single master and multiple
worker nodes.
The Master node has a Job Tracker, Task Tracker, Name Node
and Data Node while
the Slave (worker node) can act as both a DataNode and
TaskTracker.
Also it is possible to have data-only and compute only worker
nodes.
23. Modules of Hadoop
The Hadoop framework is composed of the following modules :
Hadoop Distributed File System (HDFS) : It includes the files that
will be broken into blocks and will be stored in nodes over a
distributed architecture. Using a distributed file system provides very
high aggregate bandwidth across clusters
24. Modules of Hadoop
The Hadoop framework is composed of the following modules :
Hadoop Distributed File System (HDFS)
Hadoop Yarn (Yet Another Resource Negotiator) : Used for job
scheduling and managing the computing resources in clusters.
25. Modules of Hadoop
The Hadoop framework is composed of the following modules :
Hadoop Distributed File System (HDFS)
Hadoop Yarn (Yet Another Resource Negotiator)
Hadoop MapReduce : It is an algorithm which distributes the task
into small pieces and assigns those pieces to many computers
joined over the network, and assembles all the events to form the
last event dataset.
26. Modules of Hadoop
The Hadoop framework is composed of the following modules :
Hadoop Distributed File System (HDFS)
Hadoop Yarn (Yet Another Resource Negotiator)
Hadoop MapReduce
Hadoop Common : Includes Java Libraries that are used to start
Hadoop and utilities which are needed by other Hadoop modules.
27. ClickStream Data Analysis
ClickStream data (mouse clicks) helps you to
understand the purchasing behavior of customers.
ClickStream analysis helps online marketers to
optimize their product web pages, promotional
content, etc. to improve their business.