SlideShare a Scribd company logo
1 of 18
Hadoop
Abhishek Agarwal, SA
Hadoop Ecosystem
Hadoop Ecosystem
Hadoop Ecosystem
• HDFS: Hadoop Distributed File System is a distributed file system which can be installed on commodity servers. HDFS offers a way
to store large data files on numerous machines and is designed to be fault tolerant due its data replication feature.
• YARN: Yet Another Resource Negotiator aka MapReduce V2. Its a framework for job scheduling and managing resources on
cluster.
• Flume: Its a tool or service that is used for aggregating, collecting and moving large amount of log data in and out of Hadoop.
• Zookeeper: Its a framework that enables highly reliable distributed coordination of nodes in the cluster.
• Sqoop: “SQL-to-Hadoop” or Sqoop is a tool for efficient transfer between Hadoop and structured data sources i.e Relational
Database or other Hadoop data stores, e.g. Hive or HBase.
• Oozie: Workflow scheduler system to manage Hadoop jobs. The jobs may include non MapReduce jobs.
• Pig: Initially developed at Yahoo!, Pig is a framework consisting of high level scripting language i.e Pig Latin along with a run time
environment to which allows user to run MapReduce on Hadoop cluster.
• Mahout: Mahout is a scalable machine learning and data mining library.
• R Connectors: R Connectors are used for generating statistics of the nodes in a cluster.
• Hive: Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files.
• HBase: Its a column-oriented non-rational database management system that runs on top of HDFS.
• Amabri: This component of Hadoop ecosystem is used for provisioning, managing and monitoring the Hadoop cluster.
Hadoop HDFS
• Hadoop cluster for storage. Data replicated
over multiple machines.
• Master/Slave Architecture ( 1 master n slaves
)=> nameNode/DataNode.
• Designed for largerfiles ( 64Mb), but handles
for streaming of volume of data.
• A file to be written in split into blocks. Each
block by default is replicated 3 times by
default, distributed across cluster ( node,
rack, datacenter) hence maintaining network
topology for productive search.
• Nodes are self replicated to maintain
failover.
Projects
Data Format
Avro Parquet
Data Ingestion
Flume Sqoop
Data Processing
Pig Hive Crunch Spark
Data Storage
HBase
Coordination
Zookeeper
Data Formats
• Data extraction from Hadoop nodes by
Hadoop client.
• MapReduce needs
• Splittable
• Dynamic schema searchable
• Binary format able
• Compress able
• Encoding and decoding
• Serializable and de-serializable
• RPC
• Language independent ( Neutral )
THRIFT
Google
Protocol Buffers
Avro
• It’s a JSON based encoded using binary format.
• Data serialization framework supporting compressibility and splittable for MapReduce.
• RPC communication between Hadoop nodes and client program to Hadoop services.
• Language Neutral data serialization system. – write and read in language as c, c#, c++,
java, JavaScript, PHP, python, ruby.
• Language independent schema, hence code generation is optional, hence compact
encoding. It has rich schema resolution capability.
• Data types -- primitive types (null, boolean, int, long, float, double, bytes, and string) and
complex types (record, enum, array, map, union, and fixed).
• Java support specific mapping, generic mapping and reflect mapping ( slow ) of types to
the avro data types hence solving compilation problems.
• Avro is in memory serialization and deserialization.
• Data file has header containing metadata – avro schema and sync marker ( series of
blocks containing the serialized avro objects ).
• Ref: Avro Documentation
Parquet
• Columnar storage to store nested data => helps file size and query performance.
• Good for the nested data storage.
• ORCFile ( Optimized Record columunar file ) is a type used in Hive project.
• In memory data models can be used to read and write the parquet files.
• Data types -- primitive types (boolean, int32, int64, int96, float, double, binary,
fixed_len_byte_array) and logical types (UTF8, ENUM, DECIMAL, DATE, LIST, MAP).
• Parquet doesn’t need sync marker as in avro as block boundaries are stored in the footer
metadata.
• The structure is encoded by 2 integers : definition level and repetition level.
• The file has header and blocks ( these have row groups of column blocks ( these has
many pages ))
• Compression supported are snappy, gzip, LZO.
• Default block size same as hdfs block size of 128 MB. Default page size is 1Mb and it’s the
smallest unit to store.
• Parquet files are processed using HIVE, IMPALA, PIG.
• Ref: Parquet Documentation
Flume
• High volume ingestion into Hadoop. ( event based data as in web app server log
files, JMS messages. ). Property are set for the configuration of Flume agents.
• Flume Agent run source and sinks connected via channels. Sink are HDFS system.
Hbase, Solr. Type for each member can be logger, directory, file or directory.
• Fan Out support the source delivering to multiple channel ( eg one being file and
another memory ) in same agent.
• Load balancing is catered by having one agent sink sending to two subsequent
flume agents for processing.
• Source catergory are ( avro, Exec, Http, jms, netstat, sequence generator, spooling
directory, syslog, thrift, twitter). Sink category are ( avro, elasticsarch, File Roll,
hbase, hdfs, irc, logger, morphline(solr), null, thrift ). Channel are ( file, jdbc,
memory). Interceptor are ( host, morphline, regex filtering, static, timestamp,
UUID ).
• Ref: Flume Documentation
Sqoop
• Extracts data from a structured data
source as RDBMS. Mapreduce and Hive is
used.
• Read it from book as its move CLI based
commands.
• Ref:Sqoop Documentation
HBase
• Its non relational distributed column oriented db.
• Integrated with MapReduce, REST API, Java API for client, bulk
imports, block cache and bloom filters for real time queries,
replication across cluster / backup options,
• A single table is partitioned into regions. Regions are assigned to
region servers across clusters.
• RDBMS – Entity(table), attributes(column), relation ( FK),
*to*(junction table), natural keys ( artificial ids )
• Splits rows into regions, region is hosted on one server, writes
are in-memory using flush ( drawback ), read merge rows in-
memory using flush ( local files ), read write consistent to row.
• Table – design space. Row – atomic Key/value container, Column
– key in the K/V container inside a row. Value. TimeStamp.
Column family – divide column file in physical files.
• HBase good for large datasets, sparse datasets, loosely coupled (
denorm ) records, concurrent clients.
• NOSQL API: get, put, append, increment, scan,delete,
checkandPut, checkandmutate, checkanddelete, batch.
• Use cases: monitoring devices logs.
Crunch
• High level API for writing and testing complex MapReduce pipelines. It uses
multiple serializable type data model. Its good for non tuple data types as images,
audio, seismic data.
• It composes of processing the pipelines. i.e. DAG.
• Pipelines are MapReduce, memory, spark.
• Input source / output targets: AVRO, Parquet, Sequence files, HBASE, Hfiles, CSV,
JDBC, text.
• 3 interfaces : Pcollection<t>, Ptable, PGroupedTable used for distributed datasets.
Works with spark and Hadoop.
• It uses arbitrary objects. Its for complex data types. It support in-memory
execution engine. Data flow pattern to use crunch.
• Ref: Crunch Documentation
Spark
• In-memory data processing. Its alternative to
MapReduce for certain applications.
• 10x(on disk) – 100x(in-memory) faster for the
algorithms to access data.
• Iterative machine learning algos and iterative data
mining. API for java, scala, python.
• SparkSQL ( unstructured data processing ). Mlib (
machine learning algo. ) GraphX( graph processing
). Spark streaming( live data streams )
• Custer Manager is used ( Yarn, Mesos )
• Stage – Each job divided into smaller set of tasks.
Spark Context – connection to spark cluster to
create RDD, accumulators and broadcast variables
on that cluster.
• RDD( Resilient Distributed Dataset ) is abstraction
in Spark. Its fault tolerant, immutable, partitioned
collection of elements operated parallel.
Operations as map, filter, persist. Types of files
supported ( text, sequence, Hadoop Input format)
Spark Cont’d
•
ZooKeeper
• Distributed computing issues – network reliability, latency, bandwidth,
secure network, topology changes, many admins, heterogeneous network,
transportation cost.
• It allows distributed process to coordinate using hierarchical name space of
data registers. It stores the name in file system model.
• Services catered are naming, config, lock & sync, group services.
• Leader is elected on the service startup.
• Configuration stores – data location, transaction log, cluster members info,
myid file.
• API’s of CLI – create, delete, exists,set data, get data, get children, sync.
• Features – atomicity, notification, ordering, version write, sequential node,
HA, ephemeral nodes ( lifecycle dependent, no children ).
• Ref: Apache Zookeeper
Storm
• Real time streaming key for Lamada architecture.[ batch and real time data quering]
• Its 1M+message per sec per node. Fast, fault tolerant, scalable, parallelism in streaming.
• Tuples [ key value, immutable], streams. Sprouts [ same as flume source, emit tuples], Bolts [ computation on tuples].
• Topology – DAG of sprouts and bolts, streaming computation.
• Topology Groups – shuffle, localorshuffle, field grouping.
• Architecture – uses nimbus to generate/ control task, zookeeper, supervisor, worker.
• Trident – on top of storm built for merge and join, aggregate, grouping, function, filter. Stateful for incremental processing. Stream
oriented API. MICROBATCHED ORIENTED.
For data migration [ fundamentals ]
• Kappa Architecture – simplification of Lamada without batch processing.[ picture on left ]
• Lamada Architecture -- Immutable sequence via streaming and batch processing. [ picture on right ]
• CAP Theorem - states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. Hence
NOSQL.
Big Data Landscape

More Related Content

What's hot

BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalogAdam Muise
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFSKavyaGo
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 

What's hot (20)

BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
HDFS
HDFSHDFS
HDFS
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
HUG slides on NFS and ODBC
HUG slides on NFS and ODBCHUG slides on NFS and ODBC
HUG slides on NFS and ODBC
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 

Similar to Hadoop

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Hadoop storage
Hadoop storageHadoop storage
Hadoop storageSanSan149
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquetNAVER D2
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Sandeep Kunkunuru
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsNetajiGandi1
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheSandeepTaksande
 

Similar to Hadoop (20)

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop storage
Hadoop storageHadoop storage
Hadoop storage
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 

Recently uploaded

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 

Recently uploaded (20)

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 

Hadoop

  • 4. Hadoop Ecosystem • HDFS: Hadoop Distributed File System is a distributed file system which can be installed on commodity servers. HDFS offers a way to store large data files on numerous machines and is designed to be fault tolerant due its data replication feature. • YARN: Yet Another Resource Negotiator aka MapReduce V2. Its a framework for job scheduling and managing resources on cluster. • Flume: Its a tool or service that is used for aggregating, collecting and moving large amount of log data in and out of Hadoop. • Zookeeper: Its a framework that enables highly reliable distributed coordination of nodes in the cluster. • Sqoop: “SQL-to-Hadoop” or Sqoop is a tool for efficient transfer between Hadoop and structured data sources i.e Relational Database or other Hadoop data stores, e.g. Hive or HBase. • Oozie: Workflow scheduler system to manage Hadoop jobs. The jobs may include non MapReduce jobs. • Pig: Initially developed at Yahoo!, Pig is a framework consisting of high level scripting language i.e Pig Latin along with a run time environment to which allows user to run MapReduce on Hadoop cluster. • Mahout: Mahout is a scalable machine learning and data mining library. • R Connectors: R Connectors are used for generating statistics of the nodes in a cluster. • Hive: Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. • HBase: Its a column-oriented non-rational database management system that runs on top of HDFS. • Amabri: This component of Hadoop ecosystem is used for provisioning, managing and monitoring the Hadoop cluster.
  • 5. Hadoop HDFS • Hadoop cluster for storage. Data replicated over multiple machines. • Master/Slave Architecture ( 1 master n slaves )=> nameNode/DataNode. • Designed for largerfiles ( 64Mb), but handles for streaming of volume of data. • A file to be written in split into blocks. Each block by default is replicated 3 times by default, distributed across cluster ( node, rack, datacenter) hence maintaining network topology for productive search. • Nodes are self replicated to maintain failover.
  • 6. Projects Data Format Avro Parquet Data Ingestion Flume Sqoop Data Processing Pig Hive Crunch Spark Data Storage HBase Coordination Zookeeper
  • 7. Data Formats • Data extraction from Hadoop nodes by Hadoop client. • MapReduce needs • Splittable • Dynamic schema searchable • Binary format able • Compress able • Encoding and decoding • Serializable and de-serializable • RPC • Language independent ( Neutral ) THRIFT Google Protocol Buffers
  • 8. Avro • It’s a JSON based encoded using binary format. • Data serialization framework supporting compressibility and splittable for MapReduce. • RPC communication between Hadoop nodes and client program to Hadoop services. • Language Neutral data serialization system. – write and read in language as c, c#, c++, java, JavaScript, PHP, python, ruby. • Language independent schema, hence code generation is optional, hence compact encoding. It has rich schema resolution capability. • Data types -- primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed). • Java support specific mapping, generic mapping and reflect mapping ( slow ) of types to the avro data types hence solving compilation problems. • Avro is in memory serialization and deserialization. • Data file has header containing metadata – avro schema and sync marker ( series of blocks containing the serialized avro objects ). • Ref: Avro Documentation
  • 9. Parquet • Columnar storage to store nested data => helps file size and query performance. • Good for the nested data storage. • ORCFile ( Optimized Record columunar file ) is a type used in Hive project. • In memory data models can be used to read and write the parquet files. • Data types -- primitive types (boolean, int32, int64, int96, float, double, binary, fixed_len_byte_array) and logical types (UTF8, ENUM, DECIMAL, DATE, LIST, MAP). • Parquet doesn’t need sync marker as in avro as block boundaries are stored in the footer metadata. • The structure is encoded by 2 integers : definition level and repetition level. • The file has header and blocks ( these have row groups of column blocks ( these has many pages )) • Compression supported are snappy, gzip, LZO. • Default block size same as hdfs block size of 128 MB. Default page size is 1Mb and it’s the smallest unit to store. • Parquet files are processed using HIVE, IMPALA, PIG. • Ref: Parquet Documentation
  • 10. Flume • High volume ingestion into Hadoop. ( event based data as in web app server log files, JMS messages. ). Property are set for the configuration of Flume agents. • Flume Agent run source and sinks connected via channels. Sink are HDFS system. Hbase, Solr. Type for each member can be logger, directory, file or directory. • Fan Out support the source delivering to multiple channel ( eg one being file and another memory ) in same agent. • Load balancing is catered by having one agent sink sending to two subsequent flume agents for processing. • Source catergory are ( avro, Exec, Http, jms, netstat, sequence generator, spooling directory, syslog, thrift, twitter). Sink category are ( avro, elasticsarch, File Roll, hbase, hdfs, irc, logger, morphline(solr), null, thrift ). Channel are ( file, jdbc, memory). Interceptor are ( host, morphline, regex filtering, static, timestamp, UUID ). • Ref: Flume Documentation
  • 11. Sqoop • Extracts data from a structured data source as RDBMS. Mapreduce and Hive is used. • Read it from book as its move CLI based commands. • Ref:Sqoop Documentation
  • 12. HBase • Its non relational distributed column oriented db. • Integrated with MapReduce, REST API, Java API for client, bulk imports, block cache and bloom filters for real time queries, replication across cluster / backup options, • A single table is partitioned into regions. Regions are assigned to region servers across clusters. • RDBMS – Entity(table), attributes(column), relation ( FK), *to*(junction table), natural keys ( artificial ids ) • Splits rows into regions, region is hosted on one server, writes are in-memory using flush ( drawback ), read merge rows in- memory using flush ( local files ), read write consistent to row. • Table – design space. Row – atomic Key/value container, Column – key in the K/V container inside a row. Value. TimeStamp. Column family – divide column file in physical files. • HBase good for large datasets, sparse datasets, loosely coupled ( denorm ) records, concurrent clients. • NOSQL API: get, put, append, increment, scan,delete, checkandPut, checkandmutate, checkanddelete, batch. • Use cases: monitoring devices logs.
  • 13. Crunch • High level API for writing and testing complex MapReduce pipelines. It uses multiple serializable type data model. Its good for non tuple data types as images, audio, seismic data. • It composes of processing the pipelines. i.e. DAG. • Pipelines are MapReduce, memory, spark. • Input source / output targets: AVRO, Parquet, Sequence files, HBASE, Hfiles, CSV, JDBC, text. • 3 interfaces : Pcollection<t>, Ptable, PGroupedTable used for distributed datasets. Works with spark and Hadoop. • It uses arbitrary objects. Its for complex data types. It support in-memory execution engine. Data flow pattern to use crunch. • Ref: Crunch Documentation
  • 14. Spark • In-memory data processing. Its alternative to MapReduce for certain applications. • 10x(on disk) – 100x(in-memory) faster for the algorithms to access data. • Iterative machine learning algos and iterative data mining. API for java, scala, python. • SparkSQL ( unstructured data processing ). Mlib ( machine learning algo. ) GraphX( graph processing ). Spark streaming( live data streams ) • Custer Manager is used ( Yarn, Mesos ) • Stage – Each job divided into smaller set of tasks. Spark Context – connection to spark cluster to create RDD, accumulators and broadcast variables on that cluster. • RDD( Resilient Distributed Dataset ) is abstraction in Spark. Its fault tolerant, immutable, partitioned collection of elements operated parallel. Operations as map, filter, persist. Types of files supported ( text, sequence, Hadoop Input format)
  • 16. ZooKeeper • Distributed computing issues – network reliability, latency, bandwidth, secure network, topology changes, many admins, heterogeneous network, transportation cost. • It allows distributed process to coordinate using hierarchical name space of data registers. It stores the name in file system model. • Services catered are naming, config, lock & sync, group services. • Leader is elected on the service startup. • Configuration stores – data location, transaction log, cluster members info, myid file. • API’s of CLI – create, delete, exists,set data, get data, get children, sync. • Features – atomicity, notification, ordering, version write, sequential node, HA, ephemeral nodes ( lifecycle dependent, no children ). • Ref: Apache Zookeeper
  • 17. Storm • Real time streaming key for Lamada architecture.[ batch and real time data quering] • Its 1M+message per sec per node. Fast, fault tolerant, scalable, parallelism in streaming. • Tuples [ key value, immutable], streams. Sprouts [ same as flume source, emit tuples], Bolts [ computation on tuples]. • Topology – DAG of sprouts and bolts, streaming computation. • Topology Groups – shuffle, localorshuffle, field grouping. • Architecture – uses nimbus to generate/ control task, zookeeper, supervisor, worker. • Trident – on top of storm built for merge and join, aggregate, grouping, function, filter. Stateful for incremental processing. Stream oriented API. MICROBATCHED ORIENTED. For data migration [ fundamentals ] • Kappa Architecture – simplification of Lamada without batch processing.[ picture on left ] • Lamada Architecture -- Immutable sequence via streaming and batch processing. [ picture on right ] • CAP Theorem - states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. Hence NOSQL.

Editor's Notes

  1. http://www.slideshare.net/PhilippeJulio/hadoop-architecture/43-HIGH_AVAILABILITY_SOLUTIONS_NameNode_JobTracker -- reference ( big data analytics with Hadoop , Philippe Julio)
  2. HDFS: Hadoop Distributed File System as name suggest is a distributed file system which can be installed on commodity servers. HDFS offers a way to store large data files on numerous machines and is designed to be fault tolerant due its data replication feature. Learn more on HDFS at  Apache YARN: Yet Another Resource Negotiator aka MapReduce V2. Its a framework for job scheduling and managing resources on cluster.  Learn more on HDFS at  Apache Flume: Its a tool or service that is used for aggregating, collecting and moving large amount of log data in and out of Hadoop. More information can be found here. Zookeeper: Its a framework that enables highly reliable distributed coordination of nodes in the cluster. Check this interesting video on Zookeeper. Sqoop: “SQL-to-Hadoop” or Sqoop is a tool for efficient transfer between Hadoop and structured data sources i.e Relational Database or other Hadoop data stores, e.g. Hive or HBase. Explore more on sqoop here. Oozie: Workflow scheduler system to manage Hadoop jobs. The jobs may include non MapReduce jobs. Check out more. Pig: Initially developed at Yahoo!, Pig is a framework consisting of high level scripting language i.e Pig Latin along with a run time environment to which allows user to run MapReduce on hadoop cluster.  Refer link for more on Pig. Mahout: Mahout is a scalable machine learning and data mining library. Check this video on Mahout and Machine Learning. R Connectors: R Connectors are used for generating statistics of the nodes in a cluster. More on Oracle R Connectors. Hive: Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. More on Hive at Apache. HBase: Its a column-oriented non-rational database management system that runs on top of HDFS. A 3 min video on Hbase. Amabri: This component of Hadoop ecosystem is used for provisioning, managing and monitoring the hadoop cluster. Check this link for more information. http://www.sagarjain.com/the-hadoop-ecosystem-in-a-nutshell/