This talk will cover various aspects of running Apache Hadoop, and ecosystem projects on cloud platforms with a focus on the Google Cloud Platform (GCP). We will compare HDFS with cloud-based object storage services for storing unstructured data. We will look under the hood of the Google Cloud Storage (GCS) Connector to better understand how cloud connectors share the file system interface which allows these cloud connectors to easily connect with Apache Hive, Apache Spark, and various other Hadoop ecosystem components.
These cloud storage connectors are key to freeing Apache Hadoop deployments from data locality restrictions and can enable scale-out and freedom from monolithic clusters. However, cloud object stores are not file systems and this can cause challenges for organizations as they migrate to the cloud. This talk will discuss some alternative deployment architectures for running Apache Hadoop, and ecosystem projects on the cloud, to work better with cloud storage, cloud security, and to take advantage of the agility that moving to the cloud brings. SIDDHARTH SETH, Principal Software Engineer, Hortonworks and CHRISTOPHER CROSBIE, Cloud Partner Engineering, Google
5. Compute
App Engine
Compute
Engine
Container
Engine
Container
Registry
Cloud
Functions
Networking
Cloud DNS
Virtual Private
Cloud
Cloud Load
Balancing
Cloud CDN
Cloud
Interconnect
Big Data
BigQuery
Cloud
Dataflow
Cloud
Dataproc
Cloud
Datalab
Cloud
Pub/Sub
Genomics
Storage and Databases
Cloud
Bigtable
Cloud
Storage
Cloud
Datastore
Cloud SQL
Cloud
Spanner
Identity & Security
Cloud IAM
Cloud Resource
Manager
Cloud Security
Scanner
BeyondCorp
Data Loss
Prevention
Identity-Aware
Proxy
Security Key
Enforcement
Persistent
Disk
Machine Learning
Cloud Machine
Learning
Cloud
Vision API
Cloud
Speech API
Cloud Natural
Language API
Cloud
Translation
API
Cloud
Jobs API
Networking
Key
Management
Service
Cloud
Router
VPN
Firewall
External IP
More than 60 Google Cloud Platform services
6. Management Tools
Stackdriver Monitoring Logging
Error
Reporting
Trace
Debugger
Cloud
Deployment
Manager
Cloud
Endpoints
Cloud
Console
Developer Tools
Cloud SDK
Cloud
Deployment
Manager
Cloud Source
Repositories
Cloud
Tools for
Android Studio
Cloud Tools
for IntelliJ
Cloud
Tools for
PowerShell
Cloud
Tools for
Visual Studio
Google Plug-in
for Eclipse
Cloud Test
Lab
Cloud Shell
Cloud Mobile
App
Cloud
Billing API
Cloud APIs
More than 60 Google Cloud Platform services
8. Cloud Storage at Google Scale
Google Cloud Platform 8
● Unstructured Object Storage
● Stores Exabytes of Google products’ data on the
same backend (Google Docs, Photos, GMail)
● Each of our large external customers are
downloading/uploading Petabytes of data daily
● Each of them is doing billions of ops daily
● We have plenty of space and scale for your data
9. 9
Hadoop FileSystem Abstraction
org.apache.hadoop.fs.FileSystem
Why abstraction for distributed file
system?
• File can be larger than any disk in the
network
• Having the abstraction at block level
simplifies storage subsystem
• A damaged block can be replicated
from another source
10. Google Cloud Storage alongside HDFS
Confidential & ProprietaryGoogle Cloud Platform 10
Google Cloud Storage
Spark
Hive for Analysts
MapReduce ETL Business Reporting
Hive for IT
11. More Benefits of the Cloud Storage Connector
Benefits
Direct Data access
HDFS compatibility
Cloud Interoperability
Data accessibility
High Data Availability
No Storage Management Overhead
Quick Startup
Compatibility with existing code
Google IAM Security
12. Cloud storage for
long term, less frequently
accessed content.
Cloud storage for use cases that
don't require high availability.
Take Advantage of Storage Classes for Hadoop
Google Cloud Platform
Nearline Storage
12
Regional Storage Coldline Storage
Cost
$0.026 - $0.02
Cost
$0.01
Cost
$0.007
1
Universal cloud storage
for any workload.
Can be Multi-Regional or Regional.
Use for interactive Hive/Spark
analysis or Batch jobs that occur
more than once a month
Use for batch jobs that only need
the data in historical
reporting/aggregations.
(at most once a month)
Use for post-processed data that
don’t expect to use again (no more
than once per year)