SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
1Page:
Effectively Deploying
Hadoop to the Cloud
By Avinash Ramineni
2Page:
Agenda
• Hadoop - 101
• Understanding the needs
• Amazon Web Services - 101
• Compute and Storage Types
• HA and DR options
• Run your own cluster ?
• Questions
3Page
Clairvoyant
4Page:
Clairvoyant Services
5Page:
Hadoop - Key Concepts to
understand
● Unlike CPU, RAM improvements the disk speed has not changed much in the last
10 years
● Hardware fails! Redundancy
● Data locality
● Commodity hardware: Hadoop prefers more average nodes than few supernodes
● Hadoop is a giant I/O platform: To counter disk slowness > more spindles per node
for parallelism > more network traffic > more data per node > longer re-replication
times
6Page:
Understanding your needs
Technical Use Case Hive (MR2 or Spark) Spark Hbase Impala
Data Processing o x x
Machine Learning x o
Operational Database (Fast read write
lookups)
o
Analytic Database (Ad-hoc queries) x x o
Most Common: o Common: x
7Page:
Understanding your needs
Processing Frameworks CPU Memory IO-Throughput IO-Latency
Hive / MR2 Heavy Light/Medium Most important but tends to
be cpu bound
Spark Medium Heavy ML - Low, ETL - High Streaming use cases
HBase Low Medium Medium Most important factor
Impala Medium (concurrency
will need more cores)
Medium Full scans Throughput matters over
latency for almost all use
cases
8Page:
Common Architectural Patterns
● Long-Running Clusters Vs Transient Clusters
● Patterns based on Storage Types
○ S3
○ EBS
○ Ephemeral
9Page:
Transient and Long-Running Clusters
10Page:
AWS 101
- AWS Compute Services
- EC2
- AWS Storage Services
- EBS, S3, EFS
- Instance/Ephemeral Storage
- AWS Relational Database services
- AWS Networking
- VPC , DirectConnect
11Page:
Storage Types - AWS, Azure, GCP
12Page:
Cloud Storage: Cost vs Latency vs Throughput
Image: Courtesy Cloudera
13Page:
Hadoop Clusters on AWS - Storage - Instance Storage
● Instance Storage / Ephemeral Storage
○ Temporary Storage added to certain EC2 instances (d2,i2..)
○ Storage physically attached to the host computer
○ Attached at instance launch time
○ Data is reset when instance is halted, stopped, terminated
○ Faster than EBS Volumes (IOPS and throughput)
○ No specific storage cost
● Best suited for fault-tolerant architectures like Hadoop ( Long running / transient clusters -
with S3 backed)
14Page:
Hadoop Clusters on AWS - Storage - EBS Volumes
● Elastic Block Storage
○ Storage persists independently from the life of EC2 instance
○ Network attached disks
○ Multiple types (IOPs are dependent upon size of the disk)
■ General Purpose SSD (gp2)
■ Provisioned IOPS SSD (io1)
■ Throughput Optimized (st1)
■ Cold HDD (sc1)
■ Magnetic (standard)
○ Support for snapshots.
○ Encrypted to protect the data in-transit and at-rest
15Page:
Hadoop Clusters on AWS - Storage - EBS Volumes
● EBS- Optimized instances
○ Dedicated bandwidth to EBS
● Minimum dedicated EBS Bandwidth - 1000 Mbps (125 MB / s)
○ 1 TB st1 / 3.2 TB sc1 : 40 MB/s
○ 4 - 1 TB st1 volumes for instance ?
● No more than 25 EBS volumes per instance
● Increase read-ahead for high throughput (sudo blockdev --settra 2048 /dev/<device>)
● EBS Volumes Baseline , Burst performance and Burst Credit bucket
16Page:
Hadoop Clusters on AWS - Instance Types
● Attach Ephemeral disks if they are available for the instances
● Pick instance types with enhanced networking like C4,C3,R3,R4,M4 instances
● CPU
○ 2 vCPU for OS ,1 vCPU for each master service
● Placement groups
● Support for different types of instances for worker nodes ?
17Page:
Sizing the cluster
● Sizing the cluster based on IOPs
○ data to be scanned at a time for the workloads
● Sizing the cluster based on CPU Usage
○ Executors/Mappers/reducers/parallelism required for the processing
● Sizing based on data
○ compression ratio (Snappy, LZOP, etc)
○ Replication factor
○ Initial size of data
○ Intermediate data factor, for temporary storage
○ Rate of data increase
● Sizing based on Memory
18Page:
Hadoop Clusters on AWS - Recommendations
● Master
○ Use gp2 volumes for master nodes
○ 100 GB volume for each HDFS metadata , Zookeeper
○ 1 vCPU for each master service
● Worker
○ 1 vCPU for each worker service
○ Use st1 volumes for data storage
● Recommended to use HVM AMIs over PV (paravirtualization)
● Cloud Watch Monitoring for disk usage
○ burst balance, volume read/write bytes, volume read/write ops, volume queue length
19Page:
Hadoop Clusters on AWS - HA , DR
● HA
○ HDFS HA, YARN HA, HBase HA, Hive HA
● AWS Regions
○ Hadoop nodes across Regions ?
● AWS Availability Zones
○ Hadoop nodes across Availability zones ?
● Recovery Point Objective and Recovery Time Objective
○ Backup data / metadata to S3
○ Separate DR cluster
■ Parallel Ingest
■ Parallel Compute
■ Distcp/ Cloudera BDR
20Page:
Hadoop Clusters on AWS - Security
● Hadoop Security
○ Kerberos
○ Sentry / Ranger / Navigator / Atlas
● S3
○ IAM Roles
○ Server side encryption
● EBS volume encryption
● HDFS transparent encryption
○ encryption zones
○ navigator encrypt
21Page:
Typical Deployment Layout on AWS
22Page:
Tiny / Small / Medium / Large?
Tiny: 1-9 Small: 10-99 Medium: 100-999 Large: 1000+
MINIMAL HA CLUSTER
Cloudera Manager - (1) Master Nodes A - (2) Master Node B - (1) Worker Nodes -
(4)
Security Node A - (1) Security Node B - (1)
Alert Publisher
Event Server
Host Monitor
Navigator Audit Server
Navigator Metadata
Server
Reports Manager
Service Monitor
CM Database
ZooKeeper
HDFS JournalNode
HDFS Failover Controller
HDFS NameNode
YARN Resource
Manager
Hive Metastore
HiveServer2
Hue
Oozie
HBase Master
HDFS Gateway
YARN Gateway
Spark Gateway
Hive Gateway
Sqoop Gateway
ZooKeeper
HDFS JournalNode
HDFS Balancer
Impala Catalog Server
Impala StateStore
YARN History Server
Spark History Server
HBase/SOLR Key Value
Indexer
Sentry
Key Trustee KMS
HDFS Gateway
YARN Gateway
Spark Gateway
Hive Gateway
Sqoop Gateway
HDFS DataNode
YARN
NodeManager
Impala Impalad
HBase
RegionServer
SOLR Server
Kafka Broker
Flume Agent
KDC
LDAP
Private Certificate Authority
Active Key Trustee Server
Key Trustee Server Active
Database
KDC
LDAP
Passive Key Trustee
Server
Key Trustee Server
Passive Database
Cloudera licensed services / Orange is for Client Gateway roles / Blue is for Ingest services /Purple is for RDBMS
database.
Hadoop Cluster Security Cluster
23Page:
Pricing
24Page:
Run your own cluster?
● Amazon S3, Kinesis, Lambda, Athena, Databricks
● GCP -- Google Data flow
● Azure
● Cloudera Altus
● Databricks
Image: Courtesy Amazon
25Page:
AWS Big Data
Image: Courtesy Amazon
26Page:
Azure Big Data
Image: Courtesy Microsoft
27Page:
Google Big Data
Image: Courtesy Google
28Page:
It’s a tricky job
29Page:
Feedback & Questions
Effectively Deploying Hadoop to the Cloud
30Page:
Thank You!
Avinash Ramineni
Principal, Clairvoyant
avinash@clairvoyantsoft.com
https://www.linkedin.com/in/avinashramineni/

Weitere ähnliche Inhalte

Was ist angesagt?

Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDBMongoDB
 
Ceph Object Storage at Spreadshirt
Ceph Object Storage at SpreadshirtCeph Object Storage at Spreadshirt
Ceph Object Storage at SpreadshirtJens Hadlich
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News AggregatorMário Almeida
 
Propelling IoT Innovation with Predictive Analytics
Propelling IoT Innovation with Predictive AnalyticsPropelling IoT Innovation with Predictive Analytics
Propelling IoT Innovation with Predictive AnalyticsSingleStore
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataRob Gardner
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgwzhouyuan
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopAn intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopPatrick McGarry
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series WorldMapR Technologies
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architectureOmid Vahdaty
 
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...TomBarron
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...
Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...
Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...Lviv Startup Club
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldSage Weil
 
New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesKamesh Pemmaraju
 

Was ist angesagt? (18)

Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
Ceph Research at UCSC
Ceph Research at UCSCCeph Research at UCSC
Ceph Research at UCSC
 
Ceph Object Storage at Spreadshirt
Ceph Object Storage at SpreadshirtCeph Object Storage at Spreadshirt
Ceph Object Storage at Spreadshirt
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News Aggregator
 
Propelling IoT Innovation with Predictive Analytics
Propelling IoT Innovation with Predictive AnalyticsPropelling IoT Innovation with Predictive Analytics
Propelling IoT Innovation with Predictive Analytics
 
Devopsconf 2015 sebamontini
Devopsconf 2015 sebamontiniDevopsconf 2015 sebamontini
Devopsconf 2015 sebamontini
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgw
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopAn intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data Workshop
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architecture
 
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
 
RubiX
RubiXRubiX
RubiX
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...
Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...
Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
 
New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference Architectures
 

Ähnlich wie Effectively deploying hadoop to the cloud

AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)Rick Hwang
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...Amazon Web Services
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudEdureka!
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Aerospike Architecture
Aerospike ArchitectureAerospike Architecture
Aerospike ArchitecturePeter Milne
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraCeph Community
 
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for TomorrowOpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for TomorrowEd Balduf
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingSam Ng
 
Red Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed_Hat_Storage
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemNETWAYS
 

Ähnlich wie Effectively deploying hadoop to the cloud (20)

AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
In-memory database
In-memory databaseIn-memory database
In-memory database
 
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Aerospike Architecture
Aerospike ArchitectureAerospike Architecture
Aerospike Architecture
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
 
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for TomorrowOpenStack Cinder, Implementation Today and New Trends for Tomorrow
OpenStack Cinder, Implementation Today and New Trends for Tomorrow
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Red Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed Hat Gluster Storage Performance
Red Hat Gluster Storage Performance
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
 

Mehr von Avinash Ramineni

Simplifying the data privacy governance quagmire building automated privacy ...
Simplifying the data privacy governance quagmire  building automated privacy ...Simplifying the data privacy governance quagmire  building automated privacy ...
Simplifying the data privacy governance quagmire building automated privacy ...Avinash Ramineni
 
Winning the war on data breaches in a changing data landscape
Winning the war on data breaches in a changing data landscapeWinning the war on data breaches in a changing data landscape
Winning the war on data breaches in a changing data landscapeAvinash Ramineni
 
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...Avinash Ramineni
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaAvinash Ramineni
 
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...Avinash Ramineni
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014Avinash Ramineni
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015Avinash Ramineni
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniAvinash Ramineni
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaAvinash Ramineni
 
Event Driven Architectures
Event Driven ArchitecturesEvent Driven Architectures
Event Driven ArchitecturesAvinash Ramineni
 

Mehr von Avinash Ramineni (10)

Simplifying the data privacy governance quagmire building automated privacy ...
Simplifying the data privacy governance quagmire  building automated privacy ...Simplifying the data privacy governance quagmire  building automated privacy ...
Simplifying the data privacy governance quagmire building automated privacy ...
 
Winning the war on data breaches in a changing data landscape
Winning the war on data breaches in a changing data landscapeWinning the war on data breaches in a changing data landscape
Winning the war on data breaches in a changing data landscape
 
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
Autonomous Security: Using Big Data, Machine Learning and AI to Fix Today's S...
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
 
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash Ramineni
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Event Driven Architectures
Event Driven ArchitecturesEvent Driven Architectures
Event Driven Architectures
 

Kürzlich hochgeladen

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Kürzlich hochgeladen (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Effectively deploying hadoop to the cloud

  • 1. 1Page: Effectively Deploying Hadoop to the Cloud By Avinash Ramineni
  • 2. 2Page: Agenda • Hadoop - 101 • Understanding the needs • Amazon Web Services - 101 • Compute and Storage Types • HA and DR options • Run your own cluster ? • Questions
  • 5. 5Page: Hadoop - Key Concepts to understand ● Unlike CPU, RAM improvements the disk speed has not changed much in the last 10 years ● Hardware fails! Redundancy ● Data locality ● Commodity hardware: Hadoop prefers more average nodes than few supernodes ● Hadoop is a giant I/O platform: To counter disk slowness > more spindles per node for parallelism > more network traffic > more data per node > longer re-replication times
  • 6. 6Page: Understanding your needs Technical Use Case Hive (MR2 or Spark) Spark Hbase Impala Data Processing o x x Machine Learning x o Operational Database (Fast read write lookups) o Analytic Database (Ad-hoc queries) x x o Most Common: o Common: x
  • 7. 7Page: Understanding your needs Processing Frameworks CPU Memory IO-Throughput IO-Latency Hive / MR2 Heavy Light/Medium Most important but tends to be cpu bound Spark Medium Heavy ML - Low, ETL - High Streaming use cases HBase Low Medium Medium Most important factor Impala Medium (concurrency will need more cores) Medium Full scans Throughput matters over latency for almost all use cases
  • 8. 8Page: Common Architectural Patterns ● Long-Running Clusters Vs Transient Clusters ● Patterns based on Storage Types ○ S3 ○ EBS ○ Ephemeral
  • 10. 10Page: AWS 101 - AWS Compute Services - EC2 - AWS Storage Services - EBS, S3, EFS - Instance/Ephemeral Storage - AWS Relational Database services - AWS Networking - VPC , DirectConnect
  • 11. 11Page: Storage Types - AWS, Azure, GCP
  • 12. 12Page: Cloud Storage: Cost vs Latency vs Throughput Image: Courtesy Cloudera
  • 13. 13Page: Hadoop Clusters on AWS - Storage - Instance Storage ● Instance Storage / Ephemeral Storage ○ Temporary Storage added to certain EC2 instances (d2,i2..) ○ Storage physically attached to the host computer ○ Attached at instance launch time ○ Data is reset when instance is halted, stopped, terminated ○ Faster than EBS Volumes (IOPS and throughput) ○ No specific storage cost ● Best suited for fault-tolerant architectures like Hadoop ( Long running / transient clusters - with S3 backed)
  • 14. 14Page: Hadoop Clusters on AWS - Storage - EBS Volumes ● Elastic Block Storage ○ Storage persists independently from the life of EC2 instance ○ Network attached disks ○ Multiple types (IOPs are dependent upon size of the disk) ■ General Purpose SSD (gp2) ■ Provisioned IOPS SSD (io1) ■ Throughput Optimized (st1) ■ Cold HDD (sc1) ■ Magnetic (standard) ○ Support for snapshots. ○ Encrypted to protect the data in-transit and at-rest
  • 15. 15Page: Hadoop Clusters on AWS - Storage - EBS Volumes ● EBS- Optimized instances ○ Dedicated bandwidth to EBS ● Minimum dedicated EBS Bandwidth - 1000 Mbps (125 MB / s) ○ 1 TB st1 / 3.2 TB sc1 : 40 MB/s ○ 4 - 1 TB st1 volumes for instance ? ● No more than 25 EBS volumes per instance ● Increase read-ahead for high throughput (sudo blockdev --settra 2048 /dev/<device>) ● EBS Volumes Baseline , Burst performance and Burst Credit bucket
  • 16. 16Page: Hadoop Clusters on AWS - Instance Types ● Attach Ephemeral disks if they are available for the instances ● Pick instance types with enhanced networking like C4,C3,R3,R4,M4 instances ● CPU ○ 2 vCPU for OS ,1 vCPU for each master service ● Placement groups ● Support for different types of instances for worker nodes ?
  • 17. 17Page: Sizing the cluster ● Sizing the cluster based on IOPs ○ data to be scanned at a time for the workloads ● Sizing the cluster based on CPU Usage ○ Executors/Mappers/reducers/parallelism required for the processing ● Sizing based on data ○ compression ratio (Snappy, LZOP, etc) ○ Replication factor ○ Initial size of data ○ Intermediate data factor, for temporary storage ○ Rate of data increase ● Sizing based on Memory
  • 18. 18Page: Hadoop Clusters on AWS - Recommendations ● Master ○ Use gp2 volumes for master nodes ○ 100 GB volume for each HDFS metadata , Zookeeper ○ 1 vCPU for each master service ● Worker ○ 1 vCPU for each worker service ○ Use st1 volumes for data storage ● Recommended to use HVM AMIs over PV (paravirtualization) ● Cloud Watch Monitoring for disk usage ○ burst balance, volume read/write bytes, volume read/write ops, volume queue length
  • 19. 19Page: Hadoop Clusters on AWS - HA , DR ● HA ○ HDFS HA, YARN HA, HBase HA, Hive HA ● AWS Regions ○ Hadoop nodes across Regions ? ● AWS Availability Zones ○ Hadoop nodes across Availability zones ? ● Recovery Point Objective and Recovery Time Objective ○ Backup data / metadata to S3 ○ Separate DR cluster ■ Parallel Ingest ■ Parallel Compute ■ Distcp/ Cloudera BDR
  • 20. 20Page: Hadoop Clusters on AWS - Security ● Hadoop Security ○ Kerberos ○ Sentry / Ranger / Navigator / Atlas ● S3 ○ IAM Roles ○ Server side encryption ● EBS volume encryption ● HDFS transparent encryption ○ encryption zones ○ navigator encrypt
  • 22. 22Page: Tiny / Small / Medium / Large? Tiny: 1-9 Small: 10-99 Medium: 100-999 Large: 1000+ MINIMAL HA CLUSTER Cloudera Manager - (1) Master Nodes A - (2) Master Node B - (1) Worker Nodes - (4) Security Node A - (1) Security Node B - (1) Alert Publisher Event Server Host Monitor Navigator Audit Server Navigator Metadata Server Reports Manager Service Monitor CM Database ZooKeeper HDFS JournalNode HDFS Failover Controller HDFS NameNode YARN Resource Manager Hive Metastore HiveServer2 Hue Oozie HBase Master HDFS Gateway YARN Gateway Spark Gateway Hive Gateway Sqoop Gateway ZooKeeper HDFS JournalNode HDFS Balancer Impala Catalog Server Impala StateStore YARN History Server Spark History Server HBase/SOLR Key Value Indexer Sentry Key Trustee KMS HDFS Gateway YARN Gateway Spark Gateway Hive Gateway Sqoop Gateway HDFS DataNode YARN NodeManager Impala Impalad HBase RegionServer SOLR Server Kafka Broker Flume Agent KDC LDAP Private Certificate Authority Active Key Trustee Server Key Trustee Server Active Database KDC LDAP Passive Key Trustee Server Key Trustee Server Passive Database Cloudera licensed services / Orange is for Client Gateway roles / Blue is for Ingest services /Purple is for RDBMS database. Hadoop Cluster Security Cluster
  • 24. 24Page: Run your own cluster? ● Amazon S3, Kinesis, Lambda, Athena, Databricks ● GCP -- Google Data flow ● Azure ● Cloudera Altus ● Databricks Image: Courtesy Amazon
  • 25. 25Page: AWS Big Data Image: Courtesy Amazon
  • 26. 26Page: Azure Big Data Image: Courtesy Microsoft
  • 29. 29Page: Feedback & Questions Effectively Deploying Hadoop to the Cloud
  • 30. 30Page: Thank You! Avinash Ramineni Principal, Clairvoyant avinash@clairvoyantsoft.com https://www.linkedin.com/in/avinashramineni/