Bursting Apache Spark Workloads to the Cloud on Remote Data

Office Hour: Bursting Apache Spark Workloads to the
Cloud on Remote Data
2020/03/10 Office Hour
Bin Fan | Founding Engineer | Alluxio

Co-located
Co-located
compute & HDFS
on the same cluster
Disaggregated
compute & HDFS
on the same cluster
MR / Hive
HDFS
Hive
HDFS
Disaggregated
Burst HDFS data in
the cloud,
public or private
Enable & accelerate
access big data across
data centers
Support analytics across
datacenters
HDFS for Hybrid Cloud
Big data journey & innovation for enterprises

Challenge: Data Gets Increasingly Remote from Compute
▪ Challenging Scenarios
▪ Data-driven initiatives in need of more compute
▪ Hadoop system on-prem, but it’s remote
▪ Object data growth in a cloud region, but it’s remote
▪ How to make remote data local to the compute
without copies?
▪ Business benefits
▪ Data immediately available for quicker data-driven insights
▪ More cloud computing power to solve problems quicker
▪ Up to 80% lower egress costs
Datacenter

Solution: “Zero-copy” bursting to scale to the cloud
Spark
Alluxio
Accelerate big data frameworks
on the public cloud
Same instance /
container
Spark
Alluxio
Burst big data workloads in
hybrid cloud environments
On premise
Same instance /
container

Alluxio is Open-Source Data Orchestration
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver GCS Driver S3 Driver Azure Driver

Zero-Copy Burst: View the I/O Stack
6
FAST
104
- 105
MB/s
MODERATE 103
- 104
MB/s
SLOW 10 - 103
MB/s
Only when necessary
Limited
Often
SSD
HDD
Mem

The Alluxio Story
Originated as Tachyon project, at UC Berkley AMPLab by
Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
20192018
2019
Top 10 Big Data
2019
Top 10 Cloud Software

Fast-growing Open Source Community
4000+ Github Stars1000+ Contributors
Join the community on Slack
(FAQ for this office hour)
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio

Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations

Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL

Data Accessibility via popular APIs and API Translation
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift DriverS3 Driver NFS Driver

Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting with Transparent Naming

Unified Namespace: Global Data Accessibility
Transparent access to understorage makes all enterprise data
available locally
SUPPORTS
• HDFS
• NFS
• OpenStack
• Ceph
• Amazon S3
• Azure
• Google Cloud
IT OPS FRIENDLY
• Storage mounted into Alluxio
by central IT
• Security in Alluxio mirrors
source data
• Authentication through
LDAP/AD
• Wireline encryption
HDFS #1
Object Store
NFS
HDFS #2

DATA ORCHESTRATION
SPARK
HDFS
SPARK
HDFS
Public Cloud
Public Cloud
▪ Compute scales elastically independent of storage
▪ Faster time to insights with seamless data
orchestration
▪ Accelerated workloads with memory-first data
approach
Leading Hedge Fund
Fastest growing big hedge fund managing $46 billion for investors
Use case | Cloud bursting on-premise data

Machine Learning Case Study
Challenge –
Gain end to end view of business
with large volume of data
Queries were slow / not interactive,
resulting in operational inefficiency
Solution –
ETL Data from Teradata to Alluxio
Impact –
Faster Time to Market – “Now we
don’t have to work Sundays”
Use Case: http://bit.ly/2oMx95W
SPARK
TERADATA
SPARK
TERADATA

Walmart Use case
Why Walmart chose Alluxio’s
“Zero-Copy” burst solution:
• No requirement to
persist data into the cloud
• Improved query
performance and no
network hops on recurrent
queries
• Lower costs without the
need for creating data copies

Enterprises moving towards independent compute & storage

Incredible Open Source Momentum with growing community
1000+ contributors &
growing
4.5K+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Join the conversation on Slack
slackin.alluxio.io

Questions?
Join the Alluxio Community
www.alluxio.io | Twitter: @alluxio

Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network
latency and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute
Capacity• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
Solution: “Zero-copy” bursting to scale to the cloud

Use case | Data orchestration for agility
DATA ORCHESTRATION
SPARK
HDFS
SPARK
Kubernetes
OBJECT HBASE
ETLSPARK
HDFS OBJECT HBASE
▪ Single namespace to access & address all data
▪ Data local to compute accelerates workloads
China Unicom
Leading Chinese Telco serving 320 million subscribers

Analytics Use Case – Top Retailer
Challenge –
Bottleneck in Trend Analysis of
mission critical daily sales and
inventory management
Queries were slow / not interactive,
resulting in operational inefficiency
Solution –
With Alluxio, data queries are 10X
faster
Impact –
Higher operational efficiency
Use case: http://bit.ly/2ook8Nh
SPARK
HDFS
SPARK
HDFS

Customer Insights Use Case – Top
Telecom
Challenge –
Desired a central view of consumer
information in near real time for
proactive support.
Many HDFS, different distributions,
many incompatible versions.
On-prem & cloud. Integration
through heavy ETL.
Solution –
Alluxio integrates data into central
catalog for fast access to consumer
interaction records.
Impact –
Reduced integration time
Faster data speed & freshness
HADOOP ML HADOOP
HDFS HDFS HDFS
ML
ETL
HDP
HDFS
CDH
HDFS
MAPR
HDFS
HDFS

Bursting Apache Spark Workloads to the Cloud on Remote Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Alluxio, Inc.

Mehr von Alluxio, Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bursting Apache Spark Workloads to the Cloud on Remote Data