SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
HADOOP ECOSYSTEM
Sandip K. Darwade
MNIT Jaipur
May 27, 2014
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 1 / 29
Outline
Hadoop
Hadoop Ecosystem
HDFS
MapReduce
YARN
Avro
Pig
Hive
HBase
Mahout
Sqoop
ZooKeeper
Chukwa
HCatalog
References
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 2 / 29
What is Hadoop ?
The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
Hadoop is best known for MapReduce and its distributed
filesystem (HDFS),and large-scale data processing.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 3 / 29
What is Hadoop Ecosystem ?
Introduction to the world of Hadoop and the core related
software projects. There are countless commercial
Hadoop-integrated products focused on making Hadoop
more usable and layman-accessible, but the ones here
were chosen because they provide core functionality and
speed in Hadoop so called Hadoop Ecosystem.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 4 / 29
Hadoop Ecosystem
Figure : Hadoop Ecosystem Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 5 / 29
HDFS
Hadoop Distributed File System.
Files are stored in HDFS and divided into blocks, which
are then copied to multiple Data Nodes.
Hadoop cluster contains only one NameNode and many
DataNodes.
Data blocks are replicated for High Availability and fast
access.
Figure : HDFS Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 6 / 29
HDFS
NameNode
Run on a separate machine.
Manage the file system namespace,and control access of external
clients.
Store file system Meta-data in memory.
File information, each block information of files, and every file
block information in Data Node .
DataNode
Run on Separate machine,which is the basic unit of file storage.
Sent all messages of existing Blocks periodically to Name Node.
Data Node response read and write request from the Name
Node,and also respond, create, delete, and copy the block
command from Name Node.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 7 / 29
MapReduce
Programming model for data processing.
Hadoop can run MapReduce programs written in various
languages Java,Python.
Parallel Processing,put Mapreduce in very large-scale
data analysis.
Mapper produce intermediate results.
Reducer aggregates the results.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 8 / 29
MapReduce
Files are split into fixed sized blocks and stored on data
nodes (Default 64MB).
Programs written, can process on distributed clusters in
parallel.
Input data is a set of key/value pairs, the output is also
the key/value pairs.
Mainly Two Phase Map and Reduce.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 9 / 29
MapReduce (continue...)
Figure : MapReduce Process Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 10 / 29
MapReduce (continue...)
Map
Map process each block separately in parallel.
Generate an intermediate key/value pairs set.
Results of these logic blocks are reassembled.
Reduce
Accepts an intermediate key and related value.
Processed the intermediate key and value.
Form a set of relatively small value set.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 11 / 29
YARN
YARN (Yet Another Resource Negotiator).
MapReduce 1.0 had issues with scalability, memory usage
and synchronization.
YARN addresses problems with MapReduce 1.0’s
architecture, specifically with the JobTracker service.
YARN splits up the two major functionalities of the
JobTracker, resource management and job
scheduling/monitoring, into separate daemons.
Rather than burdening a single node with handling
scheduling and resource management for the entire
cluster, YARN now distributes this responsibility across
the cluster.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 12 / 29
YARN (continue...)
Figure : Yarn Architecture Via Apache
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 13 / 29
Avro
Avro is a framework for performing remote procedure
calls and data serialization.
It can be used to pass data from one program or language
to another, e.g. from C to Pig.
Suited for use with scripting languages such as Pig
because data is always stored with its schema in Avro and
therefore the data is self-describing.
Avro can also handle changes in schema still preserving
access to the data.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 14 / 29
Pig
Pig is a framework consisting of a high-level scripting
language (Pig Latin).
Run-time environment that allows users to execute
MapReduce on a Hadoop cluster.
Like HiveQL in Hive, Pig Latin is a higher-level language
that compiles to MapReduce.
Pig is more flexible than Hive with respect to possible
data format.
Pig’s data model is similar to the relational data model,
except that tuples (a.k.a. records or rows) can be nested.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 15 / 29
Hive
Apache Hive is a data warehouse infrastructure built on
top of Hadoop for providing data summarization, query
and analysis.
Using Hadoop was not easy for end users those who were
not familiar with MapReduce framework.
A Hive query is converted to MapReduce tasks.
Figure : Hive Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 16 / 29
Hive (continue...)
Building blocks of Hive.
Metastore stores the system catalog and metadata about tables,
columns, partitions, etc.
Driver manages the lifecycle of a HiveQL statement as it moves
through Hive.
Query Compiler compiles HiveQL into a directed acyclic graph for
MapReduce tasks.
Execution Engine executes the tasks produced by the compiler in
proper dependency order.
Hive Server provides a thrift interface and a JDBC/ODBC server.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 17 / 29
HBase
HBase is distributed column-oriented database built on
top of HDFS.
HBase is not relational and does not support SQL, but
given the proper problem space.
It is able to do what an RDBMS cannot.
HBase is modeled with an HBase master node
orchestrating a cluster of one or more regionserver slaves.
HBase master is responsible for bootstrapping a virgin
install, for assigning regions to registered regionservers,
and for recovering regionserver failures.
HBase manages a ZooKeeper instance as the authority on
cluster state.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 18 / 29
HBase (continue...)
Figure : HBase Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 19 / 29
Mahout
Mahout is a scalable machine-learning and data mining
library.
There are currently four main groups of algorithms in
Mahout.
Recommendations, a.k.a. collective filtering.
Classification, a.k.a categorization.
Clustering.
Frequent itemset mining, a.k.a parallel frequent pattern mining.
Mahout is not simply a collection of pre-existing
algorithms.
Algorithms in the Mahout library belong to the subset
that can be executed in a distributed fashion, and have
been written to be executable in MapReduce.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 20 / 29
Mahout (continue...)
Figure : Mahout Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 21 / 29
Sqoop
Sqoop allows easy import and export of data from
structured data stores.
Command-line tool to import any JDBC supported
database into Hadoop.
Generate Writables for use in MapReduce jobs.
High performance connectors for some RDBMS.
Distributed,reliable,available service for efficiently moving
large amount of data as it is produced.
Suited for gathering log from multiple systems.
Inserting them into HDFS as they are generated.
Design Goal : Reliability , Scalability , Manageability,
Extensibility.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 22 / 29
Sqoop (continue...)
Figure : Sqoop Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 23 / 29
ZooKeeper
ZooKeeper is a distributed, open-source coordination
service for distributed applications.
They are especially prone to errors such as race
conditions and deadlock.
Generate Writables for use in MapReduce jobs.
ZooKeeper is to relieve distributed applications the
responsibility of implementing coordination services from
scratch.
ZooKeeper allows distributed processes to coordinate
with each other through a shared hierarchical namespace.
The name space consists of data registers called znodes,
and these are similar to files and directories.
ZooKeeper data is kept in-memory, which means it can
achieve high throughput and low latency numbers.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 24 / 29
ZooKeeper (continue...)
Figure : ZooKeeper Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 25 / 29
Chukwa
Chukwa is a Hadoop subproject devoted to large-scale log
collection and analysis.
Chukwa is built on top of HDFS and MapReduce
framework and inherits Hadoops scalability and
robustness.
Four Components of Chukwa.
Agents that run on each machine and emit data.
Collectors that receive data from the agent and write to a stable storage.
MapReduce jobs for parsing and archiving the data.
HICC, Hadoop Infrastructure Care Center; a web-portal style interface
for displaying data.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 26 / 29
Chukwa (continue...)
Figure : Chukwa Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 27 / 29
HCatalog
An incubator-level project at Apache.
HCatalog is a metadata and table storage management
service for HDFS.
HCatalog depends on the Hive metastore and exposes it
to other services such as MapReduce and Pig.
HCatalog’s goal is to simplify the user’s interaction with
HDFS data.
Enable data sharing between tools and execution
platforms.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 28 / 29
Bibliography I
G. Yang, “The application of mapreduce in the cloud computing,” Intelligence
Information Processing and Trusted Computing (IPTC) 2011, vol. 9,
pp. 154–156, Oct 2011.
T. White, Hadoop:The Definitive Guide, Third Edition.
1005 Gravenstein Highway North, Sebastopol, CA 95472: OReilly Media, Inc.,
2012.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 29 / 29

Weitere ähnliche Inhalte

Was ist angesagt? (20)

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Hadoop vs Apache Spark
Hadoop vs Apache SparkHadoop vs Apache Spark
Hadoop vs Apache Spark
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Apache hive
Apache hiveApache hive
Apache hive
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 

Ähnlich wie Hadoop Ecosystem

Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoopRexRamos9
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoopOmar Jaber
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopVigen Sahakyan
 

Ähnlich wie Hadoop Ecosystem (20)

Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
The solution for big data
The solution for big dataThe solution for big data
The solution for big data
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Kürzlich hochgeladen

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 

Kürzlich hochgeladen (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

Hadoop Ecosystem

  • 1. HADOOP ECOSYSTEM Sandip K. Darwade MNIT Jaipur May 27, 2014 Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 1 / 29
  • 3. What is Hadoop ? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is best known for MapReduce and its distributed filesystem (HDFS),and large-scale data processing. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 3 / 29
  • 4. What is Hadoop Ecosystem ? Introduction to the world of Hadoop and the core related software projects. There are countless commercial Hadoop-integrated products focused on making Hadoop more usable and layman-accessible, but the ones here were chosen because they provide core functionality and speed in Hadoop so called Hadoop Ecosystem. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 4 / 29
  • 5. Hadoop Ecosystem Figure : Hadoop Ecosystem Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 5 / 29
  • 6. HDFS Hadoop Distributed File System. Files are stored in HDFS and divided into blocks, which are then copied to multiple Data Nodes. Hadoop cluster contains only one NameNode and many DataNodes. Data blocks are replicated for High Availability and fast access. Figure : HDFS Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 6 / 29
  • 7. HDFS NameNode Run on a separate machine. Manage the file system namespace,and control access of external clients. Store file system Meta-data in memory. File information, each block information of files, and every file block information in Data Node . DataNode Run on Separate machine,which is the basic unit of file storage. Sent all messages of existing Blocks periodically to Name Node. Data Node response read and write request from the Name Node,and also respond, create, delete, and copy the block command from Name Node. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 7 / 29
  • 8. MapReduce Programming model for data processing. Hadoop can run MapReduce programs written in various languages Java,Python. Parallel Processing,put Mapreduce in very large-scale data analysis. Mapper produce intermediate results. Reducer aggregates the results. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 8 / 29
  • 9. MapReduce Files are split into fixed sized blocks and stored on data nodes (Default 64MB). Programs written, can process on distributed clusters in parallel. Input data is a set of key/value pairs, the output is also the key/value pairs. Mainly Two Phase Map and Reduce. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 9 / 29
  • 10. MapReduce (continue...) Figure : MapReduce Process Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 10 / 29
  • 11. MapReduce (continue...) Map Map process each block separately in parallel. Generate an intermediate key/value pairs set. Results of these logic blocks are reassembled. Reduce Accepts an intermediate key and related value. Processed the intermediate key and value. Form a set of relatively small value set. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 11 / 29
  • 12. YARN YARN (Yet Another Resource Negotiator). MapReduce 1.0 had issues with scalability, memory usage and synchronization. YARN addresses problems with MapReduce 1.0’s architecture, specifically with the JobTracker service. YARN splits up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. Rather than burdening a single node with handling scheduling and resource management for the entire cluster, YARN now distributes this responsibility across the cluster. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 12 / 29
  • 13. YARN (continue...) Figure : Yarn Architecture Via Apache Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 13 / 29
  • 14. Avro Avro is a framework for performing remote procedure calls and data serialization. It can be used to pass data from one program or language to another, e.g. from C to Pig. Suited for use with scripting languages such as Pig because data is always stored with its schema in Avro and therefore the data is self-describing. Avro can also handle changes in schema still preserving access to the data. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 14 / 29
  • 15. Pig Pig is a framework consisting of a high-level scripting language (Pig Latin). Run-time environment that allows users to execute MapReduce on a Hadoop cluster. Like HiveQL in Hive, Pig Latin is a higher-level language that compiles to MapReduce. Pig is more flexible than Hive with respect to possible data format. Pig’s data model is similar to the relational data model, except that tuples (a.k.a. records or rows) can be nested. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 15 / 29
  • 16. Hive Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis. Using Hadoop was not easy for end users those who were not familiar with MapReduce framework. A Hive query is converted to MapReduce tasks. Figure : Hive Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 16 / 29
  • 17. Hive (continue...) Building blocks of Hive. Metastore stores the system catalog and metadata about tables, columns, partitions, etc. Driver manages the lifecycle of a HiveQL statement as it moves through Hive. Query Compiler compiles HiveQL into a directed acyclic graph for MapReduce tasks. Execution Engine executes the tasks produced by the compiler in proper dependency order. Hive Server provides a thrift interface and a JDBC/ODBC server. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 17 / 29
  • 18. HBase HBase is distributed column-oriented database built on top of HDFS. HBase is not relational and does not support SQL, but given the proper problem space. It is able to do what an RDBMS cannot. HBase is modeled with an HBase master node orchestrating a cluster of one or more regionserver slaves. HBase master is responsible for bootstrapping a virgin install, for assigning regions to registered regionservers, and for recovering regionserver failures. HBase manages a ZooKeeper instance as the authority on cluster state. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 18 / 29
  • 19. HBase (continue...) Figure : HBase Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 19 / 29
  • 20. Mahout Mahout is a scalable machine-learning and data mining library. There are currently four main groups of algorithms in Mahout. Recommendations, a.k.a. collective filtering. Classification, a.k.a categorization. Clustering. Frequent itemset mining, a.k.a parallel frequent pattern mining. Mahout is not simply a collection of pre-existing algorithms. Algorithms in the Mahout library belong to the subset that can be executed in a distributed fashion, and have been written to be executable in MapReduce. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 20 / 29
  • 21. Mahout (continue...) Figure : Mahout Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 21 / 29
  • 22. Sqoop Sqoop allows easy import and export of data from structured data stores. Command-line tool to import any JDBC supported database into Hadoop. Generate Writables for use in MapReduce jobs. High performance connectors for some RDBMS. Distributed,reliable,available service for efficiently moving large amount of data as it is produced. Suited for gathering log from multiple systems. Inserting them into HDFS as they are generated. Design Goal : Reliability , Scalability , Manageability, Extensibility. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 22 / 29
  • 23. Sqoop (continue...) Figure : Sqoop Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 23 / 29
  • 24. ZooKeeper ZooKeeper is a distributed, open-source coordination service for distributed applications. They are especially prone to errors such as race conditions and deadlock. Generate Writables for use in MapReduce jobs. ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace. The name space consists of data registers called znodes, and these are similar to files and directories. ZooKeeper data is kept in-memory, which means it can achieve high throughput and low latency numbers. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 24 / 29
  • 25. ZooKeeper (continue...) Figure : ZooKeeper Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 25 / 29
  • 26. Chukwa Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of HDFS and MapReduce framework and inherits Hadoops scalability and robustness. Four Components of Chukwa. Agents that run on each machine and emit data. Collectors that receive data from the agent and write to a stable storage. MapReduce jobs for parsing and archiving the data. HICC, Hadoop Infrastructure Care Center; a web-portal style interface for displaying data. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 26 / 29
  • 27. Chukwa (continue...) Figure : Chukwa Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 27 / 29
  • 28. HCatalog An incubator-level project at Apache. HCatalog is a metadata and table storage management service for HDFS. HCatalog depends on the Hive metastore and exposes it to other services such as MapReduce and Pig. HCatalog’s goal is to simplify the user’s interaction with HDFS data. Enable data sharing between tools and execution platforms. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 28 / 29
  • 29. Bibliography I G. Yang, “The application of mapreduce in the cloud computing,” Intelligence Information Processing and Trusted Computing (IPTC) 2011, vol. 9, pp. 154–156, Oct 2011. T. White, Hadoop:The Definitive Guide, Third Edition. 1005 Gravenstein Highway North, Sebastopol, CA 95472: OReilly Media, Inc., 2012. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 29 / 29