SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
HADOOP ECOSYSTEM
Sandip K. Darwade
MNIT Jaipur
May 27, 2014
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 1 / 29
Outline
Hadoop
Hadoop Ecosystem
HDFS
MapReduce
YARN
Avro
Pig
Hive
HBase
Mahout
Sqoop
ZooKeeper
Chukwa
HCatalog
References
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 2 / 29
What is Hadoop ?
The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
Hadoop is best known for MapReduce and its distributed
filesystem (HDFS),and large-scale data processing.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 3 / 29
What is Hadoop Ecosystem ?
Introduction to the world of Hadoop and the core related
software projects. There are countless commercial
Hadoop-integrated products focused on making Hadoop
more usable and layman-accessible, but the ones here
were chosen because they provide core functionality and
speed in Hadoop so called Hadoop Ecosystem.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 4 / 29
Hadoop Ecosystem
Figure : Hadoop Ecosystem Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 5 / 29
HDFS
Hadoop Distributed File System.
Files are stored in HDFS and divided into blocks, which
are then copied to multiple Data Nodes.
Hadoop cluster contains only one NameNode and many
DataNodes.
Data blocks are replicated for High Availability and fast
access.
Figure : HDFS Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 6 / 29
HDFS
NameNode
Run on a separate machine.
Manage the file system namespace,and control access of external
clients.
Store file system Meta-data in memory.
File information, each block information of files, and every file
block information in Data Node .
DataNode
Run on Separate machine,which is the basic unit of file storage.
Sent all messages of existing Blocks periodically to Name Node.
Data Node response read and write request from the Name
Node,and also respond, create, delete, and copy the block
command from Name Node.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 7 / 29
MapReduce
Programming model for data processing.
Hadoop can run MapReduce programs written in various
languages Java,Python.
Parallel Processing,put Mapreduce in very large-scale
data analysis.
Mapper produce intermediate results.
Reducer aggregates the results.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 8 / 29
MapReduce
Files are split into fixed sized blocks and stored on data
nodes (Default 64MB).
Programs written, can process on distributed clusters in
parallel.
Input data is a set of key/value pairs, the output is also
the key/value pairs.
Mainly Two Phase Map and Reduce.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 9 / 29
MapReduce (continue...)
Figure : MapReduce Process Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 10 / 29
MapReduce (continue...)
Map
Map process each block separately in parallel.
Generate an intermediate key/value pairs set.
Results of these logic blocks are reassembled.
Reduce
Accepts an intermediate key and related value.
Processed the intermediate key and value.
Form a set of relatively small value set.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 11 / 29
YARN
YARN (Yet Another Resource Negotiator).
MapReduce 1.0 had issues with scalability, memory usage
and synchronization.
YARN addresses problems with MapReduce 1.0’s
architecture, specifically with the JobTracker service.
YARN splits up the two major functionalities of the
JobTracker, resource management and job
scheduling/monitoring, into separate daemons.
Rather than burdening a single node with handling
scheduling and resource management for the entire
cluster, YARN now distributes this responsibility across
the cluster.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 12 / 29
YARN (continue...)
Figure : Yarn Architecture Via Apache
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 13 / 29
Avro
Avro is a framework for performing remote procedure
calls and data serialization.
It can be used to pass data from one program or language
to another, e.g. from C to Pig.
Suited for use with scripting languages such as Pig
because data is always stored with its schema in Avro and
therefore the data is self-describing.
Avro can also handle changes in schema still preserving
access to the data.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 14 / 29
Pig
Pig is a framework consisting of a high-level scripting
language (Pig Latin).
Run-time environment that allows users to execute
MapReduce on a Hadoop cluster.
Like HiveQL in Hive, Pig Latin is a higher-level language
that compiles to MapReduce.
Pig is more flexible than Hive with respect to possible
data format.
Pig’s data model is similar to the relational data model,
except that tuples (a.k.a. records or rows) can be nested.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 15 / 29
Hive
Apache Hive is a data warehouse infrastructure built on
top of Hadoop for providing data summarization, query
and analysis.
Using Hadoop was not easy for end users those who were
not familiar with MapReduce framework.
A Hive query is converted to MapReduce tasks.
Figure : Hive Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 16 / 29
Hive (continue...)
Building blocks of Hive.
Metastore stores the system catalog and metadata about tables,
columns, partitions, etc.
Driver manages the lifecycle of a HiveQL statement as it moves
through Hive.
Query Compiler compiles HiveQL into a directed acyclic graph for
MapReduce tasks.
Execution Engine executes the tasks produced by the compiler in
proper dependency order.
Hive Server provides a thrift interface and a JDBC/ODBC server.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 17 / 29
HBase
HBase is distributed column-oriented database built on
top of HDFS.
HBase is not relational and does not support SQL, but
given the proper problem space.
It is able to do what an RDBMS cannot.
HBase is modeled with an HBase master node
orchestrating a cluster of one or more regionserver slaves.
HBase master is responsible for bootstrapping a virgin
install, for assigning regions to registered regionservers,
and for recovering regionserver failures.
HBase manages a ZooKeeper instance as the authority on
cluster state.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 18 / 29
HBase (continue...)
Figure : HBase Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 19 / 29
Mahout
Mahout is a scalable machine-learning and data mining
library.
There are currently four main groups of algorithms in
Mahout.
Recommendations, a.k.a. collective filtering.
Classification, a.k.a categorization.
Clustering.
Frequent itemset mining, a.k.a parallel frequent pattern mining.
Mahout is not simply a collection of pre-existing
algorithms.
Algorithms in the Mahout library belong to the subset
that can be executed in a distributed fashion, and have
been written to be executable in MapReduce.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 20 / 29
Mahout (continue...)
Figure : Mahout Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 21 / 29
Sqoop
Sqoop allows easy import and export of data from
structured data stores.
Command-line tool to import any JDBC supported
database into Hadoop.
Generate Writables for use in MapReduce jobs.
High performance connectors for some RDBMS.
Distributed,reliable,available service for efficiently moving
large amount of data as it is produced.
Suited for gathering log from multiple systems.
Inserting them into HDFS as they are generated.
Design Goal : Reliability , Scalability , Manageability,
Extensibility.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 22 / 29
Sqoop (continue...)
Figure : Sqoop Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 23 / 29
ZooKeeper
ZooKeeper is a distributed, open-source coordination
service for distributed applications.
They are especially prone to errors such as race
conditions and deadlock.
Generate Writables for use in MapReduce jobs.
ZooKeeper is to relieve distributed applications the
responsibility of implementing coordination services from
scratch.
ZooKeeper allows distributed processes to coordinate
with each other through a shared hierarchical namespace.
The name space consists of data registers called znodes,
and these are similar to files and directories.
ZooKeeper data is kept in-memory, which means it can
achieve high throughput and low latency numbers.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 24 / 29
ZooKeeper (continue...)
Figure : ZooKeeper Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 25 / 29
Chukwa
Chukwa is a Hadoop subproject devoted to large-scale log
collection and analysis.
Chukwa is built on top of HDFS and MapReduce
framework and inherits Hadoops scalability and
robustness.
Four Components of Chukwa.
Agents that run on each machine and emit data.
Collectors that receive data from the agent and write to a stable storage.
MapReduce jobs for parsing and archiving the data.
HICC, Hadoop Infrastructure Care Center; a web-portal style interface
for displaying data.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 26 / 29
Chukwa (continue...)
Figure : Chukwa Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 27 / 29
HCatalog
An incubator-level project at Apache.
HCatalog is a metadata and table storage management
service for HDFS.
HCatalog depends on the Hive metastore and exposes it
to other services such as MapReduce and Pig.
HCatalog’s goal is to simplify the user’s interaction with
HDFS data.
Enable data sharing between tools and execution
platforms.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 28 / 29
Bibliography I
G. Yang, “The application of mapreduce in the cloud computing,” Intelligence
Information Processing and Trusted Computing (IPTC) 2011, vol. 9,
pp. 154–156, Oct 2011.
T. White, Hadoop:The Definitive Guide, Third Edition.
1005 Gravenstein Highway North, Sebastopol, CA 95472: OReilly Media, Inc.,
2012.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 29 / 29

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

Was ist angesagt? (20)

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
BIGDATA ANALYTICS LAB MANUAL final.pdf
BIGDATA  ANALYTICS LAB MANUAL final.pdfBIGDATA  ANALYTICS LAB MANUAL final.pdf
BIGDATA ANALYTICS LAB MANUAL final.pdf
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Ähnlich wie Hadoop Ecosystem

Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoopRexRamos9
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoopOmar Jaber
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopVigen Sahakyan
 

Ähnlich wie Hadoop Ecosystem (20)

Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
The solution for big data
The solution for big dataThe solution for big data
The solution for big data
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Kürzlich hochgeladen

原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 

Kürzlich hochgeladen (20)

原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 

Hadoop Ecosystem

  • 1. HADOOP ECOSYSTEM Sandip K. Darwade MNIT Jaipur May 27, 2014 Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 1 / 29
  • 3. What is Hadoop ? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is best known for MapReduce and its distributed filesystem (HDFS),and large-scale data processing. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 3 / 29
  • 4. What is Hadoop Ecosystem ? Introduction to the world of Hadoop and the core related software projects. There are countless commercial Hadoop-integrated products focused on making Hadoop more usable and layman-accessible, but the ones here were chosen because they provide core functionality and speed in Hadoop so called Hadoop Ecosystem. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 4 / 29
  • 5. Hadoop Ecosystem Figure : Hadoop Ecosystem Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 5 / 29
  • 6. HDFS Hadoop Distributed File System. Files are stored in HDFS and divided into blocks, which are then copied to multiple Data Nodes. Hadoop cluster contains only one NameNode and many DataNodes. Data blocks are replicated for High Availability and fast access. Figure : HDFS Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 6 / 29
  • 7. HDFS NameNode Run on a separate machine. Manage the file system namespace,and control access of external clients. Store file system Meta-data in memory. File information, each block information of files, and every file block information in Data Node . DataNode Run on Separate machine,which is the basic unit of file storage. Sent all messages of existing Blocks periodically to Name Node. Data Node response read and write request from the Name Node,and also respond, create, delete, and copy the block command from Name Node. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 7 / 29
  • 8. MapReduce Programming model for data processing. Hadoop can run MapReduce programs written in various languages Java,Python. Parallel Processing,put Mapreduce in very large-scale data analysis. Mapper produce intermediate results. Reducer aggregates the results. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 8 / 29
  • 9. MapReduce Files are split into fixed sized blocks and stored on data nodes (Default 64MB). Programs written, can process on distributed clusters in parallel. Input data is a set of key/value pairs, the output is also the key/value pairs. Mainly Two Phase Map and Reduce. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 9 / 29
  • 10. MapReduce (continue...) Figure : MapReduce Process Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 10 / 29
  • 11. MapReduce (continue...) Map Map process each block separately in parallel. Generate an intermediate key/value pairs set. Results of these logic blocks are reassembled. Reduce Accepts an intermediate key and related value. Processed the intermediate key and value. Form a set of relatively small value set. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 11 / 29
  • 12. YARN YARN (Yet Another Resource Negotiator). MapReduce 1.0 had issues with scalability, memory usage and synchronization. YARN addresses problems with MapReduce 1.0’s architecture, specifically with the JobTracker service. YARN splits up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. Rather than burdening a single node with handling scheduling and resource management for the entire cluster, YARN now distributes this responsibility across the cluster. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 12 / 29
  • 13. YARN (continue...) Figure : Yarn Architecture Via Apache Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 13 / 29
  • 14. Avro Avro is a framework for performing remote procedure calls and data serialization. It can be used to pass data from one program or language to another, e.g. from C to Pig. Suited for use with scripting languages such as Pig because data is always stored with its schema in Avro and therefore the data is self-describing. Avro can also handle changes in schema still preserving access to the data. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 14 / 29
  • 15. Pig Pig is a framework consisting of a high-level scripting language (Pig Latin). Run-time environment that allows users to execute MapReduce on a Hadoop cluster. Like HiveQL in Hive, Pig Latin is a higher-level language that compiles to MapReduce. Pig is more flexible than Hive with respect to possible data format. Pig’s data model is similar to the relational data model, except that tuples (a.k.a. records or rows) can be nested. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 15 / 29
  • 16. Hive Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis. Using Hadoop was not easy for end users those who were not familiar with MapReduce framework. A Hive query is converted to MapReduce tasks. Figure : Hive Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 16 / 29
  • 17. Hive (continue...) Building blocks of Hive. Metastore stores the system catalog and metadata about tables, columns, partitions, etc. Driver manages the lifecycle of a HiveQL statement as it moves through Hive. Query Compiler compiles HiveQL into a directed acyclic graph for MapReduce tasks. Execution Engine executes the tasks produced by the compiler in proper dependency order. Hive Server provides a thrift interface and a JDBC/ODBC server. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 17 / 29
  • 18. HBase HBase is distributed column-oriented database built on top of HDFS. HBase is not relational and does not support SQL, but given the proper problem space. It is able to do what an RDBMS cannot. HBase is modeled with an HBase master node orchestrating a cluster of one or more regionserver slaves. HBase master is responsible for bootstrapping a virgin install, for assigning regions to registered regionservers, and for recovering regionserver failures. HBase manages a ZooKeeper instance as the authority on cluster state. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 18 / 29
  • 19. HBase (continue...) Figure : HBase Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 19 / 29
  • 20. Mahout Mahout is a scalable machine-learning and data mining library. There are currently four main groups of algorithms in Mahout. Recommendations, a.k.a. collective filtering. Classification, a.k.a categorization. Clustering. Frequent itemset mining, a.k.a parallel frequent pattern mining. Mahout is not simply a collection of pre-existing algorithms. Algorithms in the Mahout library belong to the subset that can be executed in a distributed fashion, and have been written to be executable in MapReduce. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 20 / 29
  • 21. Mahout (continue...) Figure : Mahout Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 21 / 29
  • 22. Sqoop Sqoop allows easy import and export of data from structured data stores. Command-line tool to import any JDBC supported database into Hadoop. Generate Writables for use in MapReduce jobs. High performance connectors for some RDBMS. Distributed,reliable,available service for efficiently moving large amount of data as it is produced. Suited for gathering log from multiple systems. Inserting them into HDFS as they are generated. Design Goal : Reliability , Scalability , Manageability, Extensibility. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 22 / 29
  • 23. Sqoop (continue...) Figure : Sqoop Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 23 / 29
  • 24. ZooKeeper ZooKeeper is a distributed, open-source coordination service for distributed applications. They are especially prone to errors such as race conditions and deadlock. Generate Writables for use in MapReduce jobs. ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace. The name space consists of data registers called znodes, and these are similar to files and directories. ZooKeeper data is kept in-memory, which means it can achieve high throughput and low latency numbers. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 24 / 29
  • 25. ZooKeeper (continue...) Figure : ZooKeeper Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 25 / 29
  • 26. Chukwa Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of HDFS and MapReduce framework and inherits Hadoops scalability and robustness. Four Components of Chukwa. Agents that run on each machine and emit data. Collectors that receive data from the agent and write to a stable storage. MapReduce jobs for parsing and archiving the data. HICC, Hadoop Infrastructure Care Center; a web-portal style interface for displaying data. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 26 / 29
  • 27. Chukwa (continue...) Figure : Chukwa Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 27 / 29
  • 28. HCatalog An incubator-level project at Apache. HCatalog is a metadata and table storage management service for HDFS. HCatalog depends on the Hive metastore and exposes it to other services such as MapReduce and Pig. HCatalog’s goal is to simplify the user’s interaction with HDFS data. Enable data sharing between tools and execution platforms. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 28 / 29
  • 29. Bibliography I G. Yang, “The application of mapreduce in the cloud computing,” Intelligence Information Processing and Trusted Computing (IPTC) 2011, vol. 9, pp. 154–156, Oct 2011. T. White, Hadoop:The Definitive Guide, Third Edition. 1005 Gravenstein Highway North, Sebastopol, CA 95472: OReilly Media, Inc., 2012. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 29 / 29