Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 32 Anzeige

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Herunterladen, um offline zu lesen

Alluxio Tech Talk
January 21, 2020

Speakers:
Matt Fuller, Starburst
Dipti Borkar, Alluxio

With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data.

Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about:

- The architecture of Presto, an open source distributed SQL engine
- How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics
- Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted

Alluxio Tech Talk
January 21, 2020

Speakers:
Matt Fuller, Starburst
Dipti Borkar, Alluxio

With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data.

Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about:

- The architecture of Presto, an open source distributed SQL engine
- How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics
- Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack (20)

Anzeige

Weitere von Alluxio, Inc. (20)

Aktuellste (20)

Anzeige

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

  1. 1. Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack Matt Fuller | Co-founder & VP, Engineering Dipti Borkar | VP, Product
  2. 2. About Me Matt Fuller Co-Founder at Starburst matt@starburstdata.com www.linkedin.com/in/mfuller/
  3. 3. Starburst: SQL on Anything Query anything, anywhere
  4. 4. Company Overview Founded 2017 • Team includes the creators of Presto and many of the largest committers, contributors, and community members of Presto • Former Facebook, Teradata, Vertica, Netezza, and Ab Initio Enterprise Presto Offering • AWS, Azure, GCP, On Premises • Kubernetes
  5. 5. Why Presto? Speed Efficiency Freedom Fast federated ANSI SQL engine Separation storage & compute Open Source; No vendor lock-in ● Proven scalability ● High concurrency ● Cost-based query optimization ● Scale storage & compute independently ● No ETL required ● SQL-on-anything ● No Hadoop vendor lock-in ● No storage vendor lock-in ● No cloud vendor lock-in ● Community driven
  6. 6. Why Starburst? Even Faster Speed Enterprise-Grade Features 24x7 Support Starburst Distro performs faster Security, automation & connectors From the Presto experts ● Fully tested, stable releases ● Curated by the Presto creators ● Most up-to-date cost-based query optimizer ● RBAC + data encryption ● Automated cluster deployment ● Auto scaling + graceful shutdown ● 36+ connectors ● 24x7 we’ve got your back ● Hot fixes + security patches ● Access to customer success team of data architects
  7. 7. Presto Architecture Processor Processor Processor COORDINATOR WORKER WORKER DATA SOURCES Parser Optimizer Scheduler Azure SQL Database
  8. 8. Presto Extensibility with Connectors Presto Coordinator Metadata SPI Distributed Cassandra Kafka Teradata Snowflake Data Statistics SPI Distributed Cassandra Kafka Teradata Snowflake Presto Worker Data Stream SPI Distributed Cassandra Kafka Teradata Snowflake Data Location SPI Distributed Cassandra Kafka Teradata Snowflake
  9. 9. Starburst Product Offerings Starburst Presto Community Free version of Starburst Presto that includes limited additional features. Starburst Presto Enterprise Starburst Presto built for the enterprise that includes additional features & connectors, security integrations, premium 24x7 support, rigorous testing, patch releases/hotfixes, long term support, additional tooling, and cloud integrations.
  10. 10. Distributed Storage Connector • Access data stored in scalable and cost effective storage ○ HDFS ○ AWS S3 ○ Google GCS ○ Azure Blob & ADLS ○ S3-Compatible (i.e. Minio, Ceph) • Schema information stored in Hive Metastore or AWS Glue Catalog • Uses “Hive-Style” Table format • Partitions and Bucketing are recognized and used • Does not use Hive runtime to perform execution
  11. 11. Relational Database Connectivity • Query relational data through Presto as the consumption layer • Federate over multiple data sources • MySQL • PostgreSQL • Redshift • SQL Server • Google BigQuery • Oracle • DB2 • Teradata • Snowflake
  12. 12. Non Relational Data Sources • Apache Accumulo • Apache Cassandra • Apache Phoenix • Elasticsearch • Apache Kafka • Apache Kudu • MongoDB • Redis
  13. 13. The Alluxio Story Originated as Tachyon project, at the UC Berkeley’s AMP Lab by then Ph.D. student & nowAlluxio CTO, Haoyuan (H.Y.) Li. 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data for the Cloud for data driven apps such as Big Data Analytics, ML and AI. Focus: Accelerating modern app frameworks running on HDFS/S3-based data lakes or warehouses Hot top 10 Big Data 2020 Impact 50 2019 Trend-setting product 2019 Trend-setting product 2019
  14. 14. Consumer Travel & TransportationTelco & Media Alluxio: Data-Driven Innovation Across Industries Learn more TechnologyFinancial Services Retail & Entertainment Data & Analytics Services
  15. 15. Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver Enable innovation with any frameworks running on data stored anywhere Data Analyst Data Engineer Storage Ops Data Scientist Lines of Business
  16. 16. Alluxio Data Orchestration for the Cloud Structured Data Catalog Intelligent Caching Data Transformatio n Data Management Global Namespace
  17. 17. Where are you in the cloud journey? “I’m all in the cloud” “I want a hybrid cloud” “I want to migrate”“Hadoop in the DC” | EMR w/ S3 | EC2 installed | Dataproc w/ GCS | GCE installed | HDInsights w/ Blob | VM installed “Separate Compute & Storage Tiers”
  18. 18. Public Cloud IaaS Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service Alluxio enables compute! Alluxio Cloud Data Orchestration Alluxio Data Orchestration and Control Service Solution: Consistent High Performance• Performance increases range from 1.5X to 10X • Dramatically reduced operational costs up to 80% Problem: Object Stores have inconsistent performance for analytics and AI workloads § SLAs are hard to achieve § S3 metadata operations are expensive § Copied data storage costs add up making the solution expensive
  19. 19. Takeaways • Nearly 2x performance reduction for small range queries • Much more concurrency with Alluxio • This means ½ the compute costs or 2x more capacity with the same environment
  20. 20. Now Available: Starburst Presto + Alluxio on ▪ AWS AMI pre-configured to speed up Presto queries using Alluxio caching ▪ 2x - 5x performance boost depending on dataset and workload ▪ Tutorial: https://www.alluxio.io/products/aws/starburst- alluxio-cft-tutorial/ + https://aws.amazon.com/marketplace/pp/Starburst-Starburst-Enterprise-Presto-with-Caching/B07ZTHJ9YF
  21. 21. Compute Storage 2–5 Mins 2–5 Mins Elastic P Elastic P Data Engineers not efficient as data not available 2–4 Weeks Request Data Request Review Find Dataset Code Script/Job Run ETL jobs Grant Permissions Not Elastic ! Dataset
  22. 22. Goal: Enable data workloads in the cloud on existing on-prem data Restrictions § Data cannot be persisted in a public cloud § Additional I/O capacity cannot be added to existing Hadoop infrastructure § On-prem level security needs to be maintained § Network bandwidth utilization needs to be minimal Alternatives Lift and Shift Data copy by workload “Zero-copy” Bursting
  23. 23. Problem: HDFS cluster is compute- bound & complex to maintain AWS Public Cloud IaaS Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service On Premises Connectivity Datacenter Spark Presto Hive Tensor Flow Alluxio Data Orchestration and Control Service Barrier 1: Prohibitive network latency and bandwidth limits • Makes hybrid analytics unfeasible Barrier 2: Copying data to cloud • Difficult to maintain copies • Data security and governance • Costs of another silo Step 1: Hybrid Cloud for Burst Compute Capacity • Orchestrates compute access to on-prem data • Working set of data, not FULL set of data • Local performance • Scales elastically • On-Prem Cluster Offload (both Compute & I/O) Step 2: Online Migration of Data Per Policy • Flexible timing to migrate, with less dependencies • Instead of hard switch over, migrate at own pace • Moves the data per policy – e.g. last 7 days “Zero-copy” bursting to scale to the cloud
  24. 24. High Level Architecture
  25. 25. Alluxio MasterZookeeper / RAFT Standby Master WAN Alluxio Client Alluxio Client Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture … … Application Application Under Store 1 Under Store 2
  26. 26. Spark Presto Hive TensorFlow RAM SSD Disk Framework Read file /trades/us Bucket Trades Bucket Customers Data requests Feature Highlight: Data Caching for faster compute Read file /trades/us again Read file /trades/top Read file /trades/top Variable latency with throttling Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again
  27. 27. Spark Presto Hive TensorFlow RAM Framework Read file /trades/us Trades Directory Customers Directory Data requests ”Zero-copy” bursting under the hood Read file /trades/us again Read file /trades/top Read file /trades/top Variable latency with throttling Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again
  28. 28. Spark Presto Hive TensorFlow RAM SSD Disk Framework Bucket Trades Bucket Customers Data requests Feature Highlight - Intelligent Tiering for resource efficiency Read file /customers/145 Out of memory Variable latency with throttling Data moved to another tier
  29. 29. Spark Presto Hive TensorFlow RAM SSD Disk Framework New Trades Policy Defined Move data > 90 days old to Feature Highlight – Policy-driven Data Management S3 Standard Policy interval : Every day Policy applied everyday
  30. 30. Alluxio Structured Data Management Preview 30 Presto Alluxio Caching Service Alluxio Catalog Service Alluxio Transformation Service Hive Connector Alluxio Connector Hive Metastore Storage
  31. 31. Starburst Presto + Alluxio AMI & CFT AMI & CFT: https://aws.amazon.com/marketplace/pp/Starburst-Starburst-Enterprise-Presto-with- Caching/B07ZTHJ9YF Documentation: https://docs.starburstdata.com/latest/aws/deploy_caching.html Tutorial: https://www.alluxio.io/products/aws/starburst-alluxio-cft-tutorial/
  32. 32. Questions? Matt Fuller | matt@starburstdata.com Dipti Borkar | dipti@alluxio.com

×