SlideShare a Scribd company logo
1 of 39
1 © Hortonworks Inc. 2011–2018. All rights
reserved
Reimagine Apache Hadoop
on Google Cloud Platform
Siddharth Seth (Hortonworks)
Christopher Crosbie (Google)
2 © Hortonworks Inc. 2011–2018. All rights
reserved
Outline of the session
4 Hadoop on Google Cloud - Resource Recommendations
3 Hadoop on Google Cloud - Deployment Patterns
2 Under the hood: Cloud Storage vs HDFS
1 Google Cloud Storage Overview
Compute
App Engine
Compute
Engine
Container
Engine
Container
Registry
Cloud
Functions
Networking
Cloud DNS
Virtual Private
Cloud
Cloud Load
Balancing
Cloud CDN
Cloud
Interconnect
Big Data
BigQuery
Cloud
Dataflow
Cloud
Dataproc
Cloud
Datalab
Cloud
Pub/Sub
Genomics
Storage and Databases
Cloud
Bigtable
Cloud
Storage
Cloud
Datastore
Cloud SQL
Cloud
Spanner
Identity & Security
Cloud IAM
Cloud Resource
Manager
Cloud Security
Scanner
BeyondCorp
Data Loss
Prevention
Identity-Aware
Proxy
Security Key
Enforcement
Persistent
Disk
Machine Learning
Cloud Machine
Learning
Cloud
Vision API
Cloud
Speech API
Cloud Natural
Language API
Cloud
Translation
API
Cloud
Jobs API
Networking
Key
Management
Service
Cloud
Router
VPN
Firewall
External IP
More than 60 Google Cloud Platform services
Management Tools
Stackdriver Monitoring Logging
Error
Reporting
Trace
Debugger
Cloud
Deployment
Manager
Cloud
Endpoints
Cloud
Console
Developer Tools
Cloud SDK
Cloud
Deployment
Manager
Cloud Source
Repositories
Cloud
Tools for
Android Studio
Cloud Tools
for IntelliJ
Cloud
Tools for
PowerShell
Cloud
Tools for
Visual Studio
Google Plug-in
for Eclipse
Cloud Test
Lab
Cloud Shell
Cloud Mobile
App
Cloud
Billing API
Cloud APIs
More than 60 Google Cloud Platform services
Google Cloud Platform Customers
Cloud Storage at Google Scale
Google Cloud Platform 8
● Unstructured Object Storage
● Stores Exabytes of Google products’ data on the
same backend (Google Docs, Photos, GMail)
● Each of our large external customers are
downloading/uploading Petabytes of data daily
● Each of them is doing billions of ops daily
● We have plenty of space and scale for your data
9
Hadoop FileSystem Abstraction
org.apache.hadoop.fs.FileSystem
Why abstraction for distributed file
system?
• File can be larger than any disk in the
network
• Having the abstraction at block level
simplifies storage subsystem
• A damaged block can be replicated
from another source
Google Cloud Storage alongside HDFS
Confidential & ProprietaryGoogle Cloud Platform 10
Google Cloud Storage
Spark
Hive for Analysts
MapReduce ETL Business Reporting
Hive for IT
More Benefits of the Cloud Storage Connector
Benefits
Direct Data access
HDFS compatibility
Cloud Interoperability
Data accessibility
High Data Availability
No Storage Management Overhead
Quick Startup
Compatibility with existing code
Google IAM Security
Cloud storage for
long term, less frequently
accessed content.
Cloud storage for use cases that
don't require high availability.
Take Advantage of Storage Classes for Hadoop
Google Cloud Platform
Nearline Storage
12
Regional Storage Coldline Storage
Cost
$0.026 - $0.02
Cost
$0.01
Cost
$0.007
1
Universal cloud storage
for any workload.
Can be Multi-Regional or Regional.
Use for interactive Hive/Spark
analysis or Batch jobs that occur
more than once a month
Use for batch jobs that only need
the data in historical
reporting/aggregations.
(at most once a month)
Use for post-processed data that
don’t expect to use again (no more
than once per year)
Google Cloud Storage in Action
Simple Name Substitutions
Google Cloud Storage in Action
Hive
hive> create table dataworksdemo (col_value STRING)
location 'gs://CONFIGBUCKET/dir/file;
hive> show create table dataworksdemo;
Spark
sc = SparkContext(conf=conf)
df_testfile = spark.read.format("csv")
.option("header", "true")
.option("delimiter" ,"t")
.option("inferSchema", "true")
.load("gs://test_data_folder/test_tab_file.tsv")
df_testfile.show()Hadoop Shell
hadoop fs -ls gs://CONFIGBUCKET/dir/file
Simple Name Substitutions
Google Cloud Storage in Action
GCS is NOT the Hadoop File System
1
6
© Hortonworks Inc. 2011–2018. All rights
reserved
• Place a single jar (gcs-shaded) in Hadoop Classpath, Tez tar, etc
• GCS repo link: https://github.com/GoogleCloudPlatform/bigdata-interop
• Release 1.9 - works with Hadoop 2.x and Hadoop 3.x
• Configs
• Credential distribution, and configuration in core-site
Getting Apache Hadoop to work with GCS
Google Cloud Storage Connector in HDP 2.6.5
Confidential & ProprietaryGoogle Cloud Platform 17
Google Cloud Storage
HDFS
HDFS No HDFS HDFS
GCS Connector
GCS Connector GCS Connector
GCS Connector
GCS Connector
No HDFS
1
8
© Hortonworks Inc. 2011–2018. All rights
reserved
Hand off to Sid
1
9
© Hortonworks Inc. 2011–2018. All rights
reserved
Deployment Architectures
2
0
© Hortonworks Inc. 2011–2018. All rights
reserved
Multi-User, Shared Hadoop Cluster
Typical On-Premise Deployment
Data
(HDFS)
Temp Data
(HDFS)
Metadata
(Hive metastore,
RDBMS)
AuthZ Policies,
Audit
(Ranger, RDBMS)
Compute: YARN
Hive Spark MR etc AuthN
Kerberos
, LDAP
Kafka, Storm,
etc
2
1
© Hortonworks Inc. 2011–2018. All rights
reserved
Multi user, secure
Hadoop Cluster
Typical On-Premise Deployment (continued)
Data Temp
Data
Metadata
AuthZ,
Policies, Audit
Compute
AuthN
• Multi-tenant cluster
• Resources per org – YARN queues
• Data/Temp Data – HDFS
• Metadata, Policies etc – defined once,
accessible to cluster users
2
2
© Hortonworks Inc. 2011–2018. All rights
reserved
Cloud Deployment Models
2
3
© Hortonworks Inc. 2011–2018. All rights
reserved
Simple Compute Only Clusters
• Optional Hadoop security
• Data – GCS
• No persistent metadata or policies
• Redefine metadata / policies for each new
cluster
• Run select components only
• Can be scaled easily
• Ad-Hoc Jobs
Single User
Ephemeral Cluster
Data (GCS)
Temp Data
(HDFS)
Metadata
AuthZ,
Policies, Audit
Compute
2
4
© Hortonworks Inc. 2011–2018. All rights
reserved
Shared Services
Shared Services - Externalize Metadata
• Optional Hadoop Security
• Data – GCS
• Metadata – Cloud SQL
• No persistent policies
• Redefine policies for each new cluster
• Metadata available across cluster launches
• Metadata can be shared
• Run select components only
• Can be scaled easily
Single User
Ephemeral Cluster
Data (GCS)
Temp Data
(HDFS)
Metadata
AuthZ,
Policies, Audit
Compute
2
5
© Hortonworks Inc. 2011–2018. All rights
reserved
Shared Services
Shared Services – Externalize Policies, Audit
• Hadoop Security
• Data – GCS
• Metadata – Cloud SQL
• AuthZ Policies, Audit – Shared cluster, Cloud SQL
• Audit Logs to GCS
• AuthZ Policies, Metadata available across cluster
launches
• AuthZ Policies, Metadata can be shared
• Run select components only
• Can be scaled easily
Single User
Ephemeral Cluster
Data (GCS)
Temp Data
(HDFS)
Metadata
Compute
AuthZ,
Policies, Audit
AuthN
2
6
© Hortonworks Inc. 2011–2018. All rights
reserved
Long Running, Highly Available, Shared Services Cluster
Shared Services - Recap
Data (GCS)
Metadata
(Hive MetaStore, Cloud SQL)
Ephemeral
Compute
AuthZ, Policies, Audit
(Ranger)
AuthN
Long Running
ComputeEphemeral
ComputeEphemeral
ComputeEphemeral
Compute
Long Running
Compute
Shared Service Cluster
• Long Running
• Highly Available
• Shared by Multiple ephemeral/long
running Compute clusters
Advantages
• Agility
• Capacity on Demand
• New Software Versions
• Test Instances
• Shared Metadata
• Define Policies Once
2
7
© Hortonworks Inc. 2011–2018. All rights
reserved
Multi user, secure
Hadoop Cluster
Lift and Shift (Data on HDFS)
Data
(HDFS)
Temp
Data
Metadata
AuthZ,
Policies, Audit
Compute
AuthN
• Simple* to reason about
• Scaling – Difficult since HDFS data on all nodes
• Complex HDFS configuration
• Persistent Disk – support re-attach
OR
• Local SSDs - Span Zones in a region, Model
zones as Racks
• Doesn’t provide much agility
AVOID THIS
2
8
© Hortonworks Inc. 2011–2018. All rights
reserved
Multi user, secure
Hadoop Cluster
Lift and Shift (Data on GCS) - Long Running
Data
(GCS)
Temp Data
(HDFS)
Metadata
AuthZ,
Policies, Audit
Compute
AuthN
• Typically Highly Available Masters
• Scaling
• Few HDFS nodes - easy
• Single task on a node makes it difficult to
remove
• Data is persisted and available across restarts
/ additional clusters
2
9
© Hortonworks Inc. 2011–2018. All rights
reserved
Cloud Deployment Models
• Ephemeral Clusters – standalone
• Ephemeral Clusters – with Shared Metadata
• Ephemeral Clusters – with Shared Metadata and Security Policies
• Long Running Clusters (HDFS) – w or w/o Shared Services
• Long Running Clusters (GCS) - w or w/o Shared Services
3
0
© Hortonworks Inc. 2011–2018. All rights
reserved
Cluster Shape / Resources
Master
Node(s)
Comput
e +
Temp
HDFS
Comput
e +
Temp
HDFS
Compute
+ Temp
HDFS – at
least 3
Compute
Only
Nodes
Compute
Only
Nodes
Compute
Only
Nodes
Compute
Only
Nodes
Compute
Only
Nodes
Compute
Only
Nodes
Preemptible
VMs
Non Preemptible
VMs
3
1
© Hortonworks Inc. 2011–2018. All rights
reserved
Resource
Recommendations
3
2
© Hortonworks Inc. 2011–2018. All rights
reserved
Disk/Storage Configuration
• General
• Disk performance increases with disk size
• Attached disks count towards networks bandwidth
• Shuffle Data (Compute)
• Locally Attached SSDs preferable – if data is small enough (3 TB per node)
• Configure VMs not to auto-migrate (Management tool would need to replace VMs)
• Avoid boot volume for Shuffle Data
• Configure independent disks
• Temporary HDFS - A few Standard Persistent Disks should be sufficient
3
3
© Hortonworks Inc. 2011–2018. All rights
reserved
Disk/Storage Configuration (continued)
• LLAP Cache
• Local SSDs will give the best performance
• Combine multiple disks into a single volume
• GCS
• Same Region as Compute Cluster preferred
• Multiple Buckets can be accessed from the same cluster – plan data accordingly
3
4
© Hortonworks Inc. 2011–2018. All rights
reserved
Network Configuration
• General
• 2Gbps per vcore, maxes out at 16Gbps
• Persistent Disks count towards the network limit
• Standard Network Tier Should be Sufficient
• Query processing requires High bandwidth, but limited to within the cluster
• Not a lot sent out of the cluster
• Use Private IP addresses for within Cluster Network Traffic
3
5
© Hortonworks Inc. 2011–2018. All rights
reserved
Node Size
• GCP very flexible in terms of what is available
• Costs typically scale linearly based on vcores used
• High Mem nodes available if required
• Dedicated instances available if required
• Master nodes can be smaller than workers
• Keep In Mind
• Network bandwidth linked to number of vcores
• # disks associated with num vcores
3
6
© Hortonworks Inc. 2011–2018. All rights
reserved
Cluster Operations / Maintenance
• General
• Monitoring and Logging for Hadoop components can be part of Shared Services
• Google Stackdriver for everything else
• Tagging to track users, organizations, etc - metadata
• Shared Service Clusters
• Highly Available
• Should be upgradable
• Ephemeral Clusters
• Single Master
• Tear down and spin up new cluster if upgrade required
3
7
© Hortonworks Inc. 2011–2018. All rights
reserved
Putting it all Together
• Cloudbreak-2.7 supports Shared services
• HDP-2.6.5 includes the Google connector
3
8
© Hortonworks Inc. 2011–2018. All rights
reserved
• GCS Repo: https://github.com/GoogleCloudPlatform/bigdata-interop
• GCS-1.9 release available for Hadoop-2.x and Hadoop-3.x
• Cloudbreak Repo: https://github.com/hortonworks/cloudbreak
Useful Resources
3
9
© Hortonworks Inc. 2011–2018. All rights
reserved
Questions

More Related Content

More from DataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesDataWorks Summit
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentDataWorks Summit
 
Big Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteBig Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteDataWorks Summit
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraDataWorks Summit
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachDataWorks Summit
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsDataWorks Summit
 

More from DataWorks Summit (20)

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
 
Big Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteBig Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science Institute
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management Things
 

Recently uploaded

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Running Apache Hadoop on the Google Cloud Platform

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved Reimagine Apache Hadoop on Google Cloud Platform Siddharth Seth (Hortonworks) Christopher Crosbie (Google)
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Outline of the session 4 Hadoop on Google Cloud - Resource Recommendations 3 Hadoop on Google Cloud - Deployment Patterns 2 Under the hood: Cloud Storage vs HDFS 1 Google Cloud Storage Overview
  • 3.
  • 4.
  • 5. Compute App Engine Compute Engine Container Engine Container Registry Cloud Functions Networking Cloud DNS Virtual Private Cloud Cloud Load Balancing Cloud CDN Cloud Interconnect Big Data BigQuery Cloud Dataflow Cloud Dataproc Cloud Datalab Cloud Pub/Sub Genomics Storage and Databases Cloud Bigtable Cloud Storage Cloud Datastore Cloud SQL Cloud Spanner Identity & Security Cloud IAM Cloud Resource Manager Cloud Security Scanner BeyondCorp Data Loss Prevention Identity-Aware Proxy Security Key Enforcement Persistent Disk Machine Learning Cloud Machine Learning Cloud Vision API Cloud Speech API Cloud Natural Language API Cloud Translation API Cloud Jobs API Networking Key Management Service Cloud Router VPN Firewall External IP More than 60 Google Cloud Platform services
  • 6. Management Tools Stackdriver Monitoring Logging Error Reporting Trace Debugger Cloud Deployment Manager Cloud Endpoints Cloud Console Developer Tools Cloud SDK Cloud Deployment Manager Cloud Source Repositories Cloud Tools for Android Studio Cloud Tools for IntelliJ Cloud Tools for PowerShell Cloud Tools for Visual Studio Google Plug-in for Eclipse Cloud Test Lab Cloud Shell Cloud Mobile App Cloud Billing API Cloud APIs More than 60 Google Cloud Platform services
  • 8. Cloud Storage at Google Scale Google Cloud Platform 8 ● Unstructured Object Storage ● Stores Exabytes of Google products’ data on the same backend (Google Docs, Photos, GMail) ● Each of our large external customers are downloading/uploading Petabytes of data daily ● Each of them is doing billions of ops daily ● We have plenty of space and scale for your data
  • 9. 9 Hadoop FileSystem Abstraction org.apache.hadoop.fs.FileSystem Why abstraction for distributed file system? • File can be larger than any disk in the network • Having the abstraction at block level simplifies storage subsystem • A damaged block can be replicated from another source
  • 10. Google Cloud Storage alongside HDFS Confidential & ProprietaryGoogle Cloud Platform 10 Google Cloud Storage Spark Hive for Analysts MapReduce ETL Business Reporting Hive for IT
  • 11. More Benefits of the Cloud Storage Connector Benefits Direct Data access HDFS compatibility Cloud Interoperability Data accessibility High Data Availability No Storage Management Overhead Quick Startup Compatibility with existing code Google IAM Security
  • 12. Cloud storage for long term, less frequently accessed content. Cloud storage for use cases that don't require high availability. Take Advantage of Storage Classes for Hadoop Google Cloud Platform Nearline Storage 12 Regional Storage Coldline Storage Cost $0.026 - $0.02 Cost $0.01 Cost $0.007 1 Universal cloud storage for any workload. Can be Multi-Regional or Regional. Use for interactive Hive/Spark analysis or Batch jobs that occur more than once a month Use for batch jobs that only need the data in historical reporting/aggregations. (at most once a month) Use for post-processed data that don’t expect to use again (no more than once per year)
  • 13. Google Cloud Storage in Action Simple Name Substitutions
  • 14. Google Cloud Storage in Action Hive hive> create table dataworksdemo (col_value STRING) location 'gs://CONFIGBUCKET/dir/file; hive> show create table dataworksdemo; Spark sc = SparkContext(conf=conf) df_testfile = spark.read.format("csv") .option("header", "true") .option("delimiter" ,"t") .option("inferSchema", "true") .load("gs://test_data_folder/test_tab_file.tsv") df_testfile.show()Hadoop Shell hadoop fs -ls gs://CONFIGBUCKET/dir/file Simple Name Substitutions
  • 15. Google Cloud Storage in Action GCS is NOT the Hadoop File System
  • 16. 1 6 © Hortonworks Inc. 2011–2018. All rights reserved • Place a single jar (gcs-shaded) in Hadoop Classpath, Tez tar, etc • GCS repo link: https://github.com/GoogleCloudPlatform/bigdata-interop • Release 1.9 - works with Hadoop 2.x and Hadoop 3.x • Configs • Credential distribution, and configuration in core-site Getting Apache Hadoop to work with GCS
  • 17. Google Cloud Storage Connector in HDP 2.6.5 Confidential & ProprietaryGoogle Cloud Platform 17 Google Cloud Storage HDFS HDFS No HDFS HDFS GCS Connector GCS Connector GCS Connector GCS Connector GCS Connector No HDFS
  • 18. 1 8 © Hortonworks Inc. 2011–2018. All rights reserved Hand off to Sid
  • 19. 1 9 © Hortonworks Inc. 2011–2018. All rights reserved Deployment Architectures
  • 20. 2 0 © Hortonworks Inc. 2011–2018. All rights reserved Multi-User, Shared Hadoop Cluster Typical On-Premise Deployment Data (HDFS) Temp Data (HDFS) Metadata (Hive metastore, RDBMS) AuthZ Policies, Audit (Ranger, RDBMS) Compute: YARN Hive Spark MR etc AuthN Kerberos , LDAP Kafka, Storm, etc
  • 21. 2 1 © Hortonworks Inc. 2011–2018. All rights reserved Multi user, secure Hadoop Cluster Typical On-Premise Deployment (continued) Data Temp Data Metadata AuthZ, Policies, Audit Compute AuthN • Multi-tenant cluster • Resources per org – YARN queues • Data/Temp Data – HDFS • Metadata, Policies etc – defined once, accessible to cluster users
  • 22. 2 2 © Hortonworks Inc. 2011–2018. All rights reserved Cloud Deployment Models
  • 23. 2 3 © Hortonworks Inc. 2011–2018. All rights reserved Simple Compute Only Clusters • Optional Hadoop security • Data – GCS • No persistent metadata or policies • Redefine metadata / policies for each new cluster • Run select components only • Can be scaled easily • Ad-Hoc Jobs Single User Ephemeral Cluster Data (GCS) Temp Data (HDFS) Metadata AuthZ, Policies, Audit Compute
  • 24. 2 4 © Hortonworks Inc. 2011–2018. All rights reserved Shared Services Shared Services - Externalize Metadata • Optional Hadoop Security • Data – GCS • Metadata – Cloud SQL • No persistent policies • Redefine policies for each new cluster • Metadata available across cluster launches • Metadata can be shared • Run select components only • Can be scaled easily Single User Ephemeral Cluster Data (GCS) Temp Data (HDFS) Metadata AuthZ, Policies, Audit Compute
  • 25. 2 5 © Hortonworks Inc. 2011–2018. All rights reserved Shared Services Shared Services – Externalize Policies, Audit • Hadoop Security • Data – GCS • Metadata – Cloud SQL • AuthZ Policies, Audit – Shared cluster, Cloud SQL • Audit Logs to GCS • AuthZ Policies, Metadata available across cluster launches • AuthZ Policies, Metadata can be shared • Run select components only • Can be scaled easily Single User Ephemeral Cluster Data (GCS) Temp Data (HDFS) Metadata Compute AuthZ, Policies, Audit AuthN
  • 26. 2 6 © Hortonworks Inc. 2011–2018. All rights reserved Long Running, Highly Available, Shared Services Cluster Shared Services - Recap Data (GCS) Metadata (Hive MetaStore, Cloud SQL) Ephemeral Compute AuthZ, Policies, Audit (Ranger) AuthN Long Running ComputeEphemeral ComputeEphemeral ComputeEphemeral Compute Long Running Compute Shared Service Cluster • Long Running • Highly Available • Shared by Multiple ephemeral/long running Compute clusters Advantages • Agility • Capacity on Demand • New Software Versions • Test Instances • Shared Metadata • Define Policies Once
  • 27. 2 7 © Hortonworks Inc. 2011–2018. All rights reserved Multi user, secure Hadoop Cluster Lift and Shift (Data on HDFS) Data (HDFS) Temp Data Metadata AuthZ, Policies, Audit Compute AuthN • Simple* to reason about • Scaling – Difficult since HDFS data on all nodes • Complex HDFS configuration • Persistent Disk – support re-attach OR • Local SSDs - Span Zones in a region, Model zones as Racks • Doesn’t provide much agility AVOID THIS
  • 28. 2 8 © Hortonworks Inc. 2011–2018. All rights reserved Multi user, secure Hadoop Cluster Lift and Shift (Data on GCS) - Long Running Data (GCS) Temp Data (HDFS) Metadata AuthZ, Policies, Audit Compute AuthN • Typically Highly Available Masters • Scaling • Few HDFS nodes - easy • Single task on a node makes it difficult to remove • Data is persisted and available across restarts / additional clusters
  • 29. 2 9 © Hortonworks Inc. 2011–2018. All rights reserved Cloud Deployment Models • Ephemeral Clusters – standalone • Ephemeral Clusters – with Shared Metadata • Ephemeral Clusters – with Shared Metadata and Security Policies • Long Running Clusters (HDFS) – w or w/o Shared Services • Long Running Clusters (GCS) - w or w/o Shared Services
  • 30. 3 0 © Hortonworks Inc. 2011–2018. All rights reserved Cluster Shape / Resources Master Node(s) Comput e + Temp HDFS Comput e + Temp HDFS Compute + Temp HDFS – at least 3 Compute Only Nodes Compute Only Nodes Compute Only Nodes Compute Only Nodes Compute Only Nodes Compute Only Nodes Preemptible VMs Non Preemptible VMs
  • 31. 3 1 © Hortonworks Inc. 2011–2018. All rights reserved Resource Recommendations
  • 32. 3 2 © Hortonworks Inc. 2011–2018. All rights reserved Disk/Storage Configuration • General • Disk performance increases with disk size • Attached disks count towards networks bandwidth • Shuffle Data (Compute) • Locally Attached SSDs preferable – if data is small enough (3 TB per node) • Configure VMs not to auto-migrate (Management tool would need to replace VMs) • Avoid boot volume for Shuffle Data • Configure independent disks • Temporary HDFS - A few Standard Persistent Disks should be sufficient
  • 33. 3 3 © Hortonworks Inc. 2011–2018. All rights reserved Disk/Storage Configuration (continued) • LLAP Cache • Local SSDs will give the best performance • Combine multiple disks into a single volume • GCS • Same Region as Compute Cluster preferred • Multiple Buckets can be accessed from the same cluster – plan data accordingly
  • 34. 3 4 © Hortonworks Inc. 2011–2018. All rights reserved Network Configuration • General • 2Gbps per vcore, maxes out at 16Gbps • Persistent Disks count towards the network limit • Standard Network Tier Should be Sufficient • Query processing requires High bandwidth, but limited to within the cluster • Not a lot sent out of the cluster • Use Private IP addresses for within Cluster Network Traffic
  • 35. 3 5 © Hortonworks Inc. 2011–2018. All rights reserved Node Size • GCP very flexible in terms of what is available • Costs typically scale linearly based on vcores used • High Mem nodes available if required • Dedicated instances available if required • Master nodes can be smaller than workers • Keep In Mind • Network bandwidth linked to number of vcores • # disks associated with num vcores
  • 36. 3 6 © Hortonworks Inc. 2011–2018. All rights reserved Cluster Operations / Maintenance • General • Monitoring and Logging for Hadoop components can be part of Shared Services • Google Stackdriver for everything else • Tagging to track users, organizations, etc - metadata • Shared Service Clusters • Highly Available • Should be upgradable • Ephemeral Clusters • Single Master • Tear down and spin up new cluster if upgrade required
  • 37. 3 7 © Hortonworks Inc. 2011–2018. All rights reserved Putting it all Together • Cloudbreak-2.7 supports Shared services • HDP-2.6.5 includes the Google connector
  • 38. 3 8 © Hortonworks Inc. 2011–2018. All rights reserved • GCS Repo: https://github.com/GoogleCloudPlatform/bigdata-interop • GCS-1.9 release available for Hadoop-2.x and Hadoop-3.x • Cloudbreak Repo: https://github.com/hortonworks/cloudbreak Useful Resources
  • 39. 3 9 © Hortonworks Inc. 2011–2018. All rights reserved Questions