Hadoop Ecosystem

HADOOP ECOSYSTEM
Sandip K. Darwade
MNIT Jaipur
May 27, 2014
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 1 / 29

Outline
Hadoop
Hadoop Ecosystem
HDFS
MapReduce
YARN
Avro
Pig
Hive
HBase
Mahout
Sqoop
ZooKeeper
Chukwa
HCatalog
References

What is Hadoop ?
The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
Hadoop is best known for MapReduce and its distributed
ﬁlesystem (HDFS),and large-scale data processing.

What is Hadoop Ecosystem ?
Introduction to the world of Hadoop and the core related
software projects. There are countless commercial
Hadoop-integrated products focused on making Hadoop
more usable and layman-accessible, but the ones here
were chosen because they provide core functionality and
speed in Hadoop so called Hadoop Ecosystem.

Hadoop Ecosystem
Figure : Hadoop Ecosystem Architecture

HDFS
Hadoop Distributed File System.
Files are stored in HDFS and divided into blocks, which
are then copied to multiple Data Nodes.
Hadoop cluster contains only one NameNode and many
DataNodes.
Data blocks are replicated for High Availability and fast
access.
Figure : HDFS Architecture

HDFS
NameNode
Run on a separate machine.
Manage the file system namespace,and control access of external
clients.
Store file system Meta-data in memory.
File information, each block information of files, and every file
block information in Data Node .
DataNode
Run on Separate machine,which is the basic unit of file storage.
Sent all messages of existing Blocks periodically to Name Node.
Data Node response read and write request from the Name
Node,and also respond, create, delete, and copy the block
command from Name Node.

MapReduce
Programming model for data processing.
Hadoop can run MapReduce programs written in various
languages Java,Python.
Parallel Processing,put Mapreduce in very large-scale
data analysis.
Mapper produce intermediate results.
Reducer aggregates the results.

MapReduce
Files are split into ﬁxed sized blocks and stored on data
nodes (Default 64MB).
Programs written, can process on distributed clusters in
parallel.
Input data is a set of key/value pairs, the output is also
the key/value pairs.
Mainly Two Phase Map and Reduce.

MapReduce (continue...)
Figure : MapReduce Process Architecture

MapReduce (continue...)
Map
Map process each block separately in parallel.
Generate an intermediate key/value pairs set.
Results of these logic blocks are reassembled.
Reduce
Accepts an intermediate key and related value.
Processed the intermediate key and value.
Form a set of relatively small value set.

YARN
YARN (Yet Another Resource Negotiator).
MapReduce 1.0 had issues with scalability, memory usage
and synchronization.
YARN addresses problems with MapReduce 1.0’s
architecture, speciﬁcally with the JobTracker service.
YARN splits up the two major functionalities of the
JobTracker, resource management and job
scheduling/monitoring, into separate daemons.
Rather than burdening a single node with handling
scheduling and resource management for the entire
cluster, YARN now distributes this responsibility across
the cluster.

YARN (continue...)
Figure : Yarn Architecture Via Apache

Avro
Avro is a framework for performing remote procedure
calls and data serialization.
It can be used to pass data from one program or language
to another, e.g. from C to Pig.
Suited for use with scripting languages such as Pig
because data is always stored with its schema in Avro and
therefore the data is self-describing.
Avro can also handle changes in schema still preserving
access to the data.

Pig
Pig is a framework consisting of a high-level scripting
language (Pig Latin).
Run-time environment that allows users to execute
MapReduce on a Hadoop cluster.
Like HiveQL in Hive, Pig Latin is a higher-level language
that compiles to MapReduce.
Pig is more ﬂexible than Hive with respect to possible
data format.
Pig’s data model is similar to the relational data model,
except that tuples (a.k.a. records or rows) can be nested.

Hive
Apache Hive is a data warehouse infrastructure built on
top of Hadoop for providing data summarization, query
and analysis.
Using Hadoop was not easy for end users those who were
not familiar with MapReduce framework.
A Hive query is converted to MapReduce tasks.
Figure : Hive Architecture

Hive (continue...)
Building blocks of Hive.
Metastore stores the system catalog and metadata about tables,
columns, partitions, etc.
Driver manages the lifecycle of a HiveQL statement as it moves
through Hive.
Query Compiler compiles HiveQL into a directed acyclic graph for
MapReduce tasks.
Execution Engine executes the tasks produced by the compiler in
proper dependency order.
Hive Server provides a thrift interface and a JDBC/ODBC server.

HBase
HBase is distributed column-oriented database built on
top of HDFS.
HBase is not relational and does not support SQL, but
given the proper problem space.
It is able to do what an RDBMS cannot.
HBase is modeled with an HBase master node
orchestrating a cluster of one or more regionserver slaves.
HBase master is responsible for bootstrapping a virgin
install, for assigning regions to registered regionservers,
and for recovering regionserver failures.
HBase manages a ZooKeeper instance as the authority on
cluster state.

HBase (continue...)
Figure : HBase Architecture

Mahout
Mahout is a scalable machine-learning and data mining
library.
There are currently four main groups of algorithms in
Mahout.
Recommendations, a.k.a. collective ﬁltering.
Classiﬁcation, a.k.a categorization.
Clustering.
Frequent itemset mining, a.k.a parallel frequent pattern mining.
Mahout is not simply a collection of pre-existing
algorithms.
Algorithms in the Mahout library belong to the subset
that can be executed in a distributed fashion, and have
been written to be executable in MapReduce.

Mahout (continue...)
Figure : Mahout Architecture

Sqoop
Sqoop allows easy import and export of data from
structured data stores.
Command-line tool to import any JDBC supported
database into Hadoop.
Generate Writables for use in MapReduce jobs.
High performance connectors for some RDBMS.
Distributed,reliable,available service for eﬃciently moving
large amount of data as it is produced.
Suited for gathering log from multiple systems.
Inserting them into HDFS as they are generated.
Design Goal : Reliability , Scalability , Manageability,
Extensibility.

Sqoop (continue...)
Figure : Sqoop Architecture

ZooKeeper
ZooKeeper is a distributed, open-source coordination
service for distributed applications.
They are especially prone to errors such as race
conditions and deadlock.
Generate Writables for use in MapReduce jobs.
ZooKeeper is to relieve distributed applications the
responsibility of implementing coordination services from
scratch.
ZooKeeper allows distributed processes to coordinate
with each other through a shared hierarchical namespace.
The name space consists of data registers called znodes,
and these are similar to ﬁles and directories.
ZooKeeper data is kept in-memory, which means it can
achieve high throughput and low latency numbers.

ZooKeeper (continue...)
Figure : ZooKeeper Architecture

Chukwa
Chukwa is a Hadoop subproject devoted to large-scale log
collection and analysis.
Chukwa is built on top of HDFS and MapReduce
framework and inherits Hadoops scalability and
robustness.
Four Components of Chukwa.
Agents that run on each machine and emit data.
Collectors that receive data from the agent and write to a stable storage.
MapReduce jobs for parsing and archiving the data.
HICC, Hadoop Infrastructure Care Center; a web-portal style interface
for displaying data.

Chukwa (continue...)
Figure : Chukwa Architecture

HCatalog
An incubator-level project at Apache.
HCatalog is a metadata and table storage management
service for HDFS.
HCatalog depends on the Hive metastore and exposes it
to other services such as MapReduce and Pig.
HCatalog’s goal is to simplify the user’s interaction with
HDFS data.
Enable data sharing between tools and execution
platforms.

Bibliography I
G. Yang, “The application of mapreduce in the cloud computing,” Intelligence
Information Processing and Trusted Computing (IPTC) 2011, vol. 9,
pp. 154–156, Oct 2011.
T. White, Hadoop:The Deﬁnitive Guide, Third Edition.
1005 Gravenstein Highway North, Sebastopol, CA 95472: OReilly Media, Inc.,
2012.

Hadoop Ecosystem

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Hadoop Ecosystem

Ähnlich wie Hadoop Ecosystem (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop Ecosystem