Running Apache Hadoop on the Google Cloud Platform

1 © Hortonworks Inc. 2011–2018. All rights
reserved
Reimagine Apache Hadoop
on Google Cloud Platform
Siddharth Seth (Hortonworks)
Christopher Crosbie (Google)

2 © Hortonworks Inc. 2011–2018. All rights
reserved
Outline of the session
4 Hadoop on Google Cloud - Resource Recommendations
3 Hadoop on Google Cloud - Deployment Patterns
2 Under the hood: Cloud Storage vs HDFS
1 Google Cloud Storage Overview

Compute
App Engine
Compute
Engine
Container
Engine
Container
Registry
Cloud
Functions
Networking
Cloud DNS
Virtual Private
Cloud
Cloud Load
Balancing
Cloud CDN
Cloud
Interconnect
Big Data
BigQuery
Cloud
Dataflow
Cloud
Dataproc
Cloud
Datalab
Cloud
Pub/Sub
Genomics
Storage and Databases
Cloud
Bigtable
Cloud
Storage
Cloud
Datastore
Cloud SQL
Cloud
Spanner
Identity & Security
Cloud IAM
Cloud Resource
Manager
Cloud Security
Scanner
BeyondCorp
Data Loss
Prevention
Identity-Aware
Proxy
Security Key
Enforcement
Persistent
Disk
Machine Learning
Cloud Machine
Learning
Cloud
Vision API
Cloud
Speech API
Cloud Natural
Language API
Cloud
Translation
API
Cloud
Jobs API
Networking
Key
Management
Service
Cloud
Router
VPN
Firewall
External IP
More than 60 Google Cloud Platform services

Management Tools
Stackdriver Monitoring Logging
Error
Reporting
Trace
Debugger
Cloud
Deployment
Manager
Cloud
Endpoints
Cloud
Console
Developer Tools
Cloud SDK
Cloud
Deployment
Manager
Cloud Source
Repositories
Cloud
Tools for
Android Studio
Cloud Tools
for IntelliJ
Cloud
Tools for
PowerShell
Cloud
Tools for
Visual Studio
Google Plug-in
for Eclipse
Cloud Test
Lab
Cloud Shell
Cloud Mobile
App
Cloud
Billing API
Cloud APIs
More than 60 Google Cloud Platform services

Google Cloud Platform Customers

Cloud Storage at Google Scale
Google Cloud Platform 8
● Unstructured Object Storage
● Stores Exabytes of Google products’ data on the
same backend (Google Docs, Photos, GMail)
● Each of our large external customers are
downloading/uploading Petabytes of data daily
● Each of them is doing billions of ops daily
● We have plenty of space and scale for your data

9
Hadoop FileSystem Abstraction
org.apache.hadoop.fs.FileSystem
Why abstraction for distributed file
system?
• File can be larger than any disk in the
network
• Having the abstraction at block level
simplifies storage subsystem
• A damaged block can be replicated
from another source

Google Cloud Storage alongside HDFS
Confidential & ProprietaryGoogle Cloud Platform 10
Google Cloud Storage
Spark
Hive for Analysts
MapReduce ETL Business Reporting
Hive for IT

More Benefits of the Cloud Storage Connector
Benefits
Direct Data access
HDFS compatibility
Cloud Interoperability
Data accessibility
High Data Availability
No Storage Management Overhead
Quick Startup
Compatibility with existing code
Google IAM Security

Cloud storage for
long term, less frequently
accessed content.
Cloud storage for use cases that
don't require high availability.
Take Advantage of Storage Classes for Hadoop
Google Cloud Platform
Nearline Storage
12
Regional Storage Coldline Storage
Cost
$0.026 - $0.02
Cost
$0.01
Cost
$0.007
1
Universal cloud storage
for any workload.
Can be Multi-Regional or Regional.
Use for interactive Hive/Spark
analysis or Batch jobs that occur
more than once a month
Use for batch jobs that only need
the data in historical
reporting/aggregations.
(at most once a month)
Use for post-processed data that
don’t expect to use again (no more
than once per year)

Google Cloud Storage in Action
Simple Name Substitutions

Hive
hive> create table dataworksdemo (col_value STRING)
location 'gs://CONFIGBUCKET/dir/file;
hive> show create table dataworksdemo;
Spark
sc = SparkContext(conf=conf)
df_testfile = spark.read.format("csv")
.option("header", "true")
.option("delimiter" ,"t")
.option("inferSchema", "true")
.load("gs://test_data_folder/test_tab_file.tsv")
df_testfile.show()Hadoop Shell
hadoop fs -ls gs://CONFIGBUCKET/dir/file
Simple Name Substitutions

GCS is NOT the Hadoop File System

1
6
© Hortonworks Inc. 2011–2018. All rights
reserved
• Place a single jar (gcs-shaded) in Hadoop Classpath, Tez tar, etc
• GCS repo link: https://github.com/GoogleCloudPlatform/bigdata-interop
• Release 1.9 - works with Hadoop 2.x and Hadoop 3.x
• Configs
• Credential distribution, and configuration in core-site
Getting Apache Hadoop to work with GCS

Google Cloud Storage Connector in HDP 2.6.5
Confidential & ProprietaryGoogle Cloud Platform 17
Google Cloud Storage
HDFS
HDFS No HDFS HDFS
GCS Connector
GCS Connector GCS Connector
GCS Connector
GCS Connector
No HDFS

1
8
reserved
Hand off to Sid

1
9
reserved
Deployment Architectures

2
0
reserved
Multi-User, Shared Hadoop Cluster
Typical On-Premise Deployment
Data
(HDFS)
Temp Data
(HDFS)
Metadata
(Hive metastore,
RDBMS)
AuthZ Policies,
Audit
(Ranger, RDBMS)
Compute: YARN
Hive Spark MR etc AuthN
Kerberos
, LDAP
Kafka, Storm,
etc

2
1
reserved
Multi user, secure
Hadoop Cluster
Typical On-Premise Deployment (continued)
Data Temp
Data
Metadata
AuthZ,
Policies, Audit
Compute
AuthN
• Multi-tenant cluster
• Resources per org – YARN queues
• Data/Temp Data – HDFS
• Metadata, Policies etc – defined once,
accessible to cluster users

2
2
reserved
Cloud Deployment Models

2
3
reserved
Simple Compute Only Clusters
• Optional Hadoop security
• Data – GCS
• No persistent metadata or policies
• Redefine metadata / policies for each new
cluster
• Run select components only
• Can be scaled easily
• Ad-Hoc Jobs
Single User
Ephemeral Cluster
Data (GCS)
Temp Data
(HDFS)
Metadata
AuthZ,
Policies, Audit
Compute

2
4
reserved
Shared Services
Shared Services - Externalize Metadata
• Optional Hadoop Security
• Data – GCS
• Metadata – Cloud SQL
• No persistent policies
• Redefine policies for each new cluster
• Metadata available across cluster launches
• Metadata can be shared
Single User
Ephemeral Cluster
Data (GCS)
Temp Data
(HDFS)
Metadata
AuthZ,
Policies, Audit
Compute

2
5
reserved
Shared Services
Shared Services – Externalize Policies, Audit
• Hadoop Security
• Data – GCS
• Metadata – Cloud SQL
• AuthZ Policies, Audit – Shared cluster, Cloud SQL
• Audit Logs to GCS
• AuthZ Policies, Metadata available across cluster
launches
• AuthZ Policies, Metadata can be shared
Single User
Ephemeral Cluster
Data (GCS)
Temp Data
(HDFS)
Metadata
Compute
AuthZ,
Policies, Audit
AuthN

2
6
reserved
Long Running, Highly Available, Shared Services Cluster
Shared Services - Recap
Data (GCS)
Metadata
(Hive MetaStore, Cloud SQL)
Ephemeral
Compute
AuthZ, Policies, Audit
(Ranger)
AuthN
Long Running
ComputeEphemeral
ComputeEphemeral
ComputeEphemeral
Compute
Long Running
Compute
Shared Service Cluster
• Long Running
• Highly Available
• Shared by Multiple ephemeral/long
running Compute clusters
Advantages
• Agility
• Capacity on Demand
• New Software Versions
• Test Instances
• Shared Metadata
• Define Policies Once

2
7
reserved
Multi user, secure
Hadoop Cluster
Lift and Shift (Data on HDFS)
Data
(HDFS)
Temp
Data
Metadata
AuthZ,
Policies, Audit
Compute
AuthN
• Simple* to reason about
• Scaling – Difficult since HDFS data on all nodes
• Complex HDFS configuration
• Persistent Disk – support re-attach
OR
• Local SSDs - Span Zones in a region, Model
zones as Racks
• Doesn’t provide much agility
AVOID THIS

2
8
reserved
Multi user, secure
Hadoop Cluster
Lift and Shift (Data on GCS) - Long Running
Data
(GCS)
Temp Data
(HDFS)
Metadata
AuthZ,
Policies, Audit
Compute
AuthN
• Typically Highly Available Masters
• Scaling
• Few HDFS nodes - easy
• Single task on a node makes it difficult to
remove
• Data is persisted and available across restarts
/ additional clusters

2
9
reserved
Cloud Deployment Models
• Ephemeral Clusters – standalone
• Ephemeral Clusters – with Shared Metadata
• Ephemeral Clusters – with Shared Metadata and Security Policies
• Long Running Clusters (HDFS) – w or w/o Shared Services
• Long Running Clusters (GCS) - w or w/o Shared Services

3
0
reserved
Cluster Shape / Resources
Master
Node(s)
Comput
e +
Temp
HDFS
Comput
e +
Temp
HDFS
Compute
+ Temp
HDFS – at
least 3
Compute
Only
Nodes
Compute
Only
Nodes
Compute
Only
Nodes
Compute
Only
Nodes
Compute
Only
Nodes
Compute
Only
Nodes
Preemptible
VMs
Non Preemptible
VMs

3
1
reserved
Resource
Recommendations

3
2
reserved
Disk/Storage Configuration
• General
• Disk performance increases with disk size
• Attached disks count towards networks bandwidth
• Shuffle Data (Compute)
• Locally Attached SSDs preferable – if data is small enough (3 TB per node)
• Configure VMs not to auto-migrate (Management tool would need to replace VMs)
• Avoid boot volume for Shuffle Data
• Configure independent disks
• Temporary HDFS - A few Standard Persistent Disks should be sufficient

3
3
reserved
Disk/Storage Configuration (continued)
• LLAP Cache
• Local SSDs will give the best performance
• Combine multiple disks into a single volume
• GCS
• Same Region as Compute Cluster preferred
• Multiple Buckets can be accessed from the same cluster – plan data accordingly

3
4
reserved
Network Configuration
• General
• 2Gbps per vcore, maxes out at 16Gbps
• Persistent Disks count towards the network limit
• Standard Network Tier Should be Sufficient
• Query processing requires High bandwidth, but limited to within the cluster
• Not a lot sent out of the cluster
• Use Private IP addresses for within Cluster Network Traffic

3
5
reserved
Node Size
• GCP very flexible in terms of what is available
• Costs typically scale linearly based on vcores used
• High Mem nodes available if required
• Dedicated instances available if required
• Master nodes can be smaller than workers
• Keep In Mind
• Network bandwidth linked to number of vcores
• # disks associated with num vcores

3
6
reserved
Cluster Operations / Maintenance
• General
• Monitoring and Logging for Hadoop components can be part of Shared Services
• Google Stackdriver for everything else
• Tagging to track users, organizations, etc - metadata
• Shared Service Clusters
• Highly Available
• Should be upgradable
• Ephemeral Clusters
• Single Master
• Tear down and spin up new cluster if upgrade required

3
7
reserved
Putting it all Together
• Cloudbreak-2.7 supports Shared services
• HDP-2.6.5 includes the Google connector

3
8
reserved
• GCS Repo: https://github.com/GoogleCloudPlatform/bigdata-interop
• GCS-1.9 release available for Hadoop-2.x and Hadoop-3.x
• Cloudbreak Repo: https://github.com/hortonworks/cloudbreak
Useful Resources

3
9
reserved
Questions

Running Apache Hadoop on the Google Cloud Platform

Recommended

Recommended

More Related Content

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Running Apache Hadoop on the Google Cloud Platform