Alluxio Community Office Hour
Apr 28, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Adit Madan
Bin Fan
Today’s conventional wisdom states that network latency across the two ends of a hybrid cloud prevents you from running analytic workloads in the cloud with the data on-prem. As a result, most companies copy their data into a cloud environment and maintain that duplicate data. All of this means that it is challenging to make both on-prem HDFS data accessible with the desired application performance.
In this talk, we will show you how to leverage any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) to scale analytics workloads directly on on-prem data without copying and synchronizing the data into the cloud.
In this Office Hour, we will go over:
- A strategy to embrace the hybrid cloud, including an architecture for running ephemeral compute clusters using on-prem HDFS.
- An example of running on-demand Presto, Spark, and Hive with Alluxio in the public cloud.
- An analysis of experiments with TPC-DS to demonstrate the benefits of the given architecture.
2. Open Source Project started at the UC Berkeley's AMP Lab,
with an incredible Open Source Momentum with growing community
1,000+
contributors
& growing
4000+
Git Stars
Apache 2.0 Licensed
Millions of downloads
3. Alluxio Use Cases
Presto
Alluxio
*Burst big data workloads in
hybrid cloud environments
On-premise
Public cloud
Alluxio
On-premise
Presto
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
or or
Same instance /
container
Spark
Alluxio
Accelerate big data frameworks on the
public cloud
4. Intro to EMR
▪ AWS Provided and Managed Hadoop Services
▪ Spark, HDFS, Presto, Hive
▪ Easy to configure and onboard
▪ Does the work for you
▪ Elastic and Flexible
4
5. EMR Service Integration: Bootstrap Actions
▪ EMR hooks into the main configuration files for Hadoop Services:
▪ hive-site.xml, core-site.xml, hadoop-env.sh, hive.properties
▪ Bootstrap Actions
▪ Alluxio can be deployed using a bootstrap action
7. A. Meta-data Locality with Active Sync
Synchronize Metadata for On-premise Mutations
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion, TTL
HDFS iNotify Based
Metadata Synchronization
Mutation
8. B. Data Locality with Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
9. C. No Hive Table Redefinitions in the Public Cloud
Feature Highlight - Use Transparent URI
• Begin with data and metadata on-premises
• HDFS has data on-premises
• Hive Metastore has meta-data with location hdfs://ns/table
• Launch a Cluster in Public Cloud
• Presto Catalog on EMR points to Hive Metastore On-premises
• Configure Catalog to use Alluxio Transparent URI
• Alluxio intercepts Presto calls to hdfs://ns/table
• Start Querying in the Public Cloud
• Accesses to HDFS on-premises are now served by Alluxio
10. Benchmark Report - TPC-DS
▪ Setup
▪ 10+1 r5.4xlarge instances in both clusters
▪ Latency: 175ms
12. Benchmark Report - TPC-DS
▪ Maximum Improvement - All Queries
▪ q9 (7.1x)
▪ Maximum Improvement - By Class
▪ Reporting: q27 (3.1x)
▪ Interactive: q73 (3.9x)
▪ Deep Analytics: q34 (4.2x)
13. Additional Resources
▪ “Zero-Copy” Hybrid Cloud for Data Analytics - Strategy, Architecture and Benchmark Report
https://www.alluxio.io/resources/whitepapers/zero-copy-hybrid-cloud-for-data-analytics-strategy-architecture-an
d-benchmark-report/
▪ Running Presto with Alluxio
https://docs.alluxio.io/os/user/stable/en/compute/Presto.html
▪ Using Transparent URI
https://docs.alluxio.io/ee/user/stable/en/operation/Transparent-Uri.html
▪ Top 5 performance tips running Presto with Alluxio
https://www.alluxio.io/blog/top-5-performance-tuning-tips-for-running-presto-on-alluxio-1
▪ Getting Started with EMR and Alluxio
https://docs.alluxio.io/os/user/stable/en/cloud/AWS-EMR.html
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html
14. Questions?
How are you using EMR?
Welcome to join the Alluxio Open Source Community!
www.alluxio.io | @alluxio