Alluxio Tech Talk
January 21, 2020
Speakers:
Matt Fuller, Starburst
Dipti Borkar, Alluxio
With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data.
Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about:
- The architecture of Presto, an open source distributed SQL engine
- How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics
- Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted
4. Company Overview
Founded 2017
• Team includes the creators of Presto
and many of the largest committers,
contributors, and community
members of Presto
• Former Facebook, Teradata, Vertica,
Netezza, and Ab Initio
Enterprise Presto Offering
• AWS, Azure, GCP, On Premises
• Kubernetes
5. Why Presto?
Speed Efficiency Freedom
Fast federated ANSI SQL engine Separation storage & compute Open Source; No vendor lock-in
● Proven scalability
● High concurrency
● Cost-based query
optimization
● Scale storage & compute
independently
● No ETL required
● SQL-on-anything
● No Hadoop vendor lock-in
● No storage vendor lock-in
● No cloud vendor lock-in
● Community driven
6. Why Starburst?
Even Faster Speed Enterprise-Grade Features 24x7 Support
Starburst Distro performs faster Security, automation & connectors From the Presto experts
● Fully tested, stable releases
● Curated by the Presto
creators
● Most up-to-date cost-based
query optimizer
● RBAC + data encryption
● Automated cluster
deployment
● Auto scaling + graceful
shutdown
● 36+ connectors
● 24x7 we’ve got your back
● Hot fixes + security patches
● Access to customer success
team of data architects
8. Presto Extensibility with Connectors
Presto Coordinator
Metadata SPI
Distributed
Cassandra
Kafka
Teradata
Snowflake
Data Statistics SPI
Distributed
Cassandra
Kafka
Teradata
Snowflake
Presto Worker
Data Stream SPI
Distributed
Cassandra
Kafka
Teradata
Snowflake
Data Location SPI
Distributed
Cassandra
Kafka
Teradata
Snowflake
9. Starburst Product Offerings
Starburst Presto Community
Free version of Starburst Presto that includes limited additional features.
Starburst Presto Enterprise
Starburst Presto built for the enterprise that includes additional features &
connectors, security integrations, premium 24x7 support, rigorous testing, patch
releases/hotfixes, long term support, additional tooling, and cloud integrations.
10. Distributed Storage Connector
• Access data stored in scalable and cost effective storage
○ HDFS
○ AWS S3
○ Google GCS
○ Azure Blob & ADLS
○ S3-Compatible (i.e. Minio, Ceph)
• Schema information stored in Hive Metastore or AWS
Glue Catalog
• Uses “Hive-Style” Table format
• Partitions and Bucketing are recognized and used
• Does not use Hive runtime to perform execution
11. Relational Database Connectivity
• Query relational data through Presto
as the consumption layer
• Federate over multiple data sources
• MySQL
• PostgreSQL
• Redshift
• SQL Server
• Google BigQuery
• Oracle
• DB2
• Teradata
• Snowflake
12. Non Relational Data Sources
• Apache Accumulo
• Apache Cassandra
• Apache Phoenix
• Elasticsearch
• Apache Kafka
• Apache Kudu
• MongoDB
• Redis
13. The Alluxio Story
Originated as Tachyon project, at the UC Berkeley’s AMP Lab
by then Ph.D. student & nowAlluxio CTO, Haoyuan (H.Y.) Li.
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data for the Cloud for data driven apps
such as Big Data Analytics, ML and AI.
Focus: Accelerating modern app frameworks running on
HDFS/S3-based data lakes or warehouses
Hot top 10 Big Data
2020
Impact 50
2019
Trend-setting product
2019
Trend-setting product
2019
14. Consumer Travel & TransportationTelco & Media
Alluxio: Data-Driven Innovation Across Industries
Learn more
TechnologyFinancial Services Retail & Entertainment Data & Analytics
Services
15. Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Enable innovation with any frameworks
running on data stored anywhere
Data Analyst
Data Engineer
Storage Ops
Data Scientist
Lines of Business
16. Alluxio Data Orchestration for the Cloud
Structured
Data Catalog
Intelligent
Caching
Data
Transformatio
n
Data
Management
Global
Namespace
17. Where are you in the cloud
journey?
“I’m all in the cloud”
“I want a hybrid cloud”
“I want to migrate”“Hadoop in the DC”
| EMR w/ S3
| EC2 installed
| Dataproc w/ GCS
| GCE installed
| HDInsights w/ Blob
| VM installed
“Separate Compute &
Storage Tiers”
18. Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
Alluxio enables compute!
Alluxio Cloud Data Orchestration
Alluxio Data Orchestration and Control Service
Solution: Consistent High
Performance• Performance increases range from 1.5X
to 10X
• Dramatically reduced operational costs
up to 80%
Problem: Object Stores have
inconsistent performance for analytics
and AI workloads
§ SLAs are hard to achieve
§ S3 metadata operations are expensive
§ Copied data storage costs add up
making the solution expensive
19. Takeaways
• Nearly 2x performance
reduction for small range
queries
• Much more concurrency
with Alluxio
• This means ½ the
compute costs or 2x
more capacity with the
same environment
20. Now Available: Starburst Presto + Alluxio on
▪ AWS AMI pre-configured to speed up Presto
queries using Alluxio caching
▪ 2x - 5x performance boost depending on
dataset and workload
▪ Tutorial:
https://www.alluxio.io/products/aws/starburst-
alluxio-cft-tutorial/
+
https://aws.amazon.com/marketplace/pp/Starburst-Starburst-Enterprise-Presto-with-Caching/B07ZTHJ9YF
22. Goal: Enable data workloads in the cloud on existing
on-prem data
Restrictions
§ Data cannot be persisted in a public cloud
§ Additional I/O capacity cannot be added to existing Hadoop infrastructure
§ On-prem level security needs to be maintained
§ Network bandwidth utilization needs to be minimal
Alternatives
Lift and Shift
Data copy by
workload
“Zero-copy” Bursting
23. Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network latency
and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute Capacity
• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
“Zero-copy” bursting to scale to the cloud
26. Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
Read file
/trades/us
Bucket Trades Bucket Customers
Data requests
Feature Highlight: Data Caching for faster compute
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Variable latency
with throttling
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Read file /trades/us again
27. Spark Presto Hive TensorFlow
RAM
Framework
Read file
/trades/us
Trades Directory Customers Directory
Data requests
”Zero-copy” bursting under the hood
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Variable latency
with throttling
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Read file /trades/us again Read file
/trades/top
Read file
/trades/top
Read file /trades/us again
28. Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
Bucket Trades Bucket Customers
Data requests
Feature Highlight - Intelligent Tiering for resource efficiency
Read file /customers/145
Out of memory
Variable latency
with throttling
Data moved to another tier
29. Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
New Trades
Policy Defined Move data > 90 days old to
Feature Highlight – Policy-driven Data Management
S3 Standard
Policy interval : Every day
Policy applied everyday
30. Alluxio Structured Data Management Preview
30
Presto
Alluxio Caching
Service
Alluxio Catalog
Service
Alluxio Transformation
Service
Hive
Connector
Alluxio
Connector
Hive
Metastore
Storage
31. Starburst Presto + Alluxio AMI & CFT
AMI & CFT:
https://aws.amazon.com/marketplace/pp/Starburst-Starburst-Enterprise-Presto-with-
Caching/B07ZTHJ9YF
Documentation:
https://docs.starburstdata.com/latest/aws/deploy_caching.html
Tutorial:
https://www.alluxio.io/products/aws/starburst-alluxio-cft-tutorial/