2. The Alluxio Story
Originated as Tachyon project, at the UC Berkeley’s AMP Lab
by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li.
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data for the Cloud for data driven apps
such as Big Data Analytics, ML and AI.
Focus: Accelerating modern app frameworks running on
HDFS/S3-based data lakes or warehouses
Hot top 10 Big Data
2020
Impact 50
2019
Trend-setting product
2019
Trend-setting product
2019
3. Consumer Travel & TransportationTelco & Media
Alluxio: Data-Driven Innovation Across Industries
Learn more
TechnologyFinancial Services Retail & Entertainment Data & Analytics Services
4. 4 big trends driving the need for a new architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store
5. Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Enable innovation with any frameworks
running on data stored anywhere
Data Analyst
Data Engineer
Storage Ops
Data Scientist
Lines of Business
6. Alluxio Data Orchestration for the Cloud
Structured
Data Catalog
Intelligent
Caching
Data
Transformation
Data
Management
Global
Namespace
7. Data Orchestration for the Cloud
Cross-platform Security & Governance
Authentication
Kerberos, Delegation token, LDAP, AD
Authorization
FS security model, AWS IAM model, Ranger integration
Encryption
On the wire with TLS, at rest with client-side encryption
Audit Logging
Track accesses to all data
8. Where are you in the cloud journey?
“I’m all in the cloud”
“I want a hybrid cloud”
“I want to migrate”“Hadoop in the DC”
| EMR w/ S3
| EC2 installed
| Dataproc w/ GCS
| GCE installed
| HDInsights w/ Blob
| VM installed
“Separate Compute &
Storage Tiers”
9. Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
Alluxio enables compute!
Alluxio Cloud Data Orchestration
Alluxio Data Orchestration and Control Service
Solution: Consistent High Performance
• Performance increases range from 1.5X
to 10X
• Data orchestration of multiple S3 buckets
• AWS EMR & Google Dataproc
integrations
• Fewer copies of data means lower costs
Problem: Object Stores have
inconsistent performance for analytics
and AI workloads
§ SLAs are hard to achieve
§ S3 metadata operations are expensive
§ Copied data storage costs add up making
the solution expensive
§ S3 is eventually consistent making it hard
to predict query results
10. Takeaways
• Nearly 2x performance
reduction for small range
queries
• Much more concurrency
with Alluxio
• This means ½ the
compute costs or 2x
more capacity with the
same environment
11. Using Alluxio with AWS EMR
Presto Hive
Instances
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
HDFS HDFSEMRFS EMRFS
Compute-driven
Continuous sync
Compute-driven
Continuous sync
12. Using Alluxio with Google Dataproc
Presto Hive
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Google
Dataproc
Cluster
Google Cloud Store Google Cloud Store
Single command initialization action brings up Alluxio in dataproc
Alluxio Initialization Action - https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/alluxio
14. Goal: Enable data workloads in the cloud on existing
on-prem data
Restrictions
§ Data cannot be persisted in a public cloud
§ Additional I/O capacity cannot be added to existing Hadoop infrastructure
§ On-prem level security needs to be maintained
§ Network bandwidth utilization needs to be minimal
Alternatives
Lift and Shift
Data copy by
workload
“Zero-copy” Bursting
15. Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network latency
and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute Capacity
• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
“Zero-copy” bursting to scale to the cloud
16. AWS
Alluxio Cloud Data Orchestration
Datacenter
GCP Azure
Step 3:.
Multicloud On-Demand Data Platform
• Orchestrates compute access to on-
prem and cloud data
• High performance
• Scales elastically
19. Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
21. Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
Read file /trades/us
Bucket Trades Bucket Customers
Data requests
Feature Highlight: Data Caching for faster compute
Read file /trades/us again Read file /trades/top
Read file /trades/top
Variable latency
with throttling
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again
22. Spark Presto Hive TensorFlow
RAM
Framework
Read file /trades/us
Trades Directory Customers Directory
Data requests
”Zero-copy” bursting under the hood
Read file /trades/us again Read file /trades/top
Read file /trades/top
Variable latency
with throttling
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again
23. Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
Bucket Trades Bucket Customers
Data requests
Feature Highlight - Intelligent Tiering for resource efficiency
Read file /customers/145
Out of memory
Variable latency
with throttling
Data moved to another tier
24. Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
New Trades
Policy Defined Move data > 90 days old to
Feature Highlight – Policy-driven Data Management
S3 Standard
Policy interval : Every day
Policy applied everyday
25. Alluxio Structured Data Management Preview
26
Presto
Alluxio Caching
Service
Alluxio Catalog
Service
Alluxio Transformation
Service
Hive
Connector
Alluxio
Connector
Hive
Metastore
Storage
26. Alluxio Catalog Service
27
Alluxio Catalog Service
Hive Metastore
Hive Under Database
Functionality
Manages metadata for structured data
Abstracts other database catalogs as
Under Database (UDB)
Benefits
Schema-aware optimizations
Simple deployment
27. 28
How to Use Alluxio Catalog CLI
alluxio table attachdb <udb type> <udb uri> <udb db name>
associate an Alluxio database with an UDB database
alluxio table detachdb <db name>
remove the association from “attach”
alluxio table ls [<db name> [<table name>]]
display information in the catalog
alluxio table sync <db name>
synchronize the Alluxio catalog with the UDB metadata
28. 29
Alluxio Presto Connector
Tighter integration with Presto
New plugin based on the Presto Hive connector
Available in Alluxio 2.1.0 distribution
Future: Merge connector into Presto codebase
30. 31
How to Use Transformation CLI
alluxio table transform <db name> <table name>
initiate a transformation on a table
alluxio table transformStatus <transform id>
display the status for a transformation
31. 32
Example
2 isolated AWS 10-node clusters
Presto + Hive Metastore + S3 Data
Presto + Alluxio + Hive Metastore + S3 Data
TPCDS dataset on S3
CSV format
~10,000 files
32. 33
Example Results
Attached existing Hive database into Alluxio Catalog
Alluxio Catalog served table metadata for Presto
Transformed store_sales by coalescing and converting CSV to Parquet
Presto Without
Alluxio
20s
Alluxio
Transformations
7s
Alluxio Transformations
With Caching
3s
34. Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations
35. Use Cases Data Orchestration Enables
Hive
Alluxio
Burst big data workloads in
hybrid cloud environments
On premise
Same instance
/ container
Alluxio
On-premise
PrestoSpark
Alluxio
Accelerate big data frameworks
on the public cloud
Same instance
/ container
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
or or
In the cloud
36. § S3 performance is variable and consistent
query SLAs are hard to achieve
§ S3 metadata operations are expensive making
workloads run longer
§ S3 egress costs add up making the
solution expensive
§ S3 is eventually consistent making it hard
to predict query results
Challenges with running Big Data workloads on S3 & Alluxio Solution
Compute caching for S3
Spark
Alluxio
Accelerate big data frameworks
on the public cloud
Same instance
/ container
37. § Accessing data over WAN too slow
§ Copying data to compute cloud time
consuming and complex
§ Using another storage system like S3
means expensive application changes
§ Using S3 via HDFS connector leads
to extremely low performance
Challenges with Hybrid Cloud & Alluxio Solution
HDFS for Hybrid Cloud
Hive
Alluxio
Burst big data workloads in
hybrid cloud environments
On premise
Same instance
/ container
In the cloud
3
Solution Benefits
§ Same performance as local
§ Same end-user experience
§ 100% of I/O is offloaded
38. Challenges with supporting more frameworks & Alluxio Solution
§ Running new frameworks on existing an HDFS cluster
can dramatically affect performance of existing
workloads
§ In a disaggregate environment, copying data to multiple
compute clouds time consuming and error prone
§ Migrating applications for new storage systems is
complex & time consuming
§ Storing and managing multiple copies of the data
becomes expensive
Support more frameworks
Any object store or HDFS
Same data
center / region
Presto
Enable big data on object stores
across single or multiple clouds
or
Spark
Alluxio Alluxio
4
39. Challenges running Big Data on Object Stores & Alluxio Solution
§ Object stores performance for big
data workloads can be very poor
§ No native support for popular
frameworks
§ Expensive metadata operations
reduce performance even more
§ No support for hybrid environments
directly
Transition to Object store
Alluxio
On-premise
Presto
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
or or
5
Solution Benefits
§ Same performance as HDFS
§ Uses HDFS APIs
§ Same end-user experience
§ Storage at fraction of the
cost of HDFS