Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Presto: Query Anything - Data Engineer’s perspective
Speakers:
Kamil Bajda-Pawlikowski, Starburst, Presto Company
Martin Traverso, Presto Software Foundation
For more Alluxio events: https://www.alluxio.io/events/
Presto: Query Anything - Data Engineer’s perspective
1. Query Anything - Data Engineer’s perspective
Kamil Bajda-Pawlikowski
Co-founder / CTO
@prestosql @starburstdata
Data Orchestration Summit
Nov 2019 @ Mountain View
Martin Traverso
Creator of Presto
2. Why Presto?
Community-driven
open source project
High performance ANSI SQL engine
• Cost-Based Query Optimizer
• Proven scalability
• High concurrency
Separation of compute
and storage
• Scale storage and compute
independently
• No ETL or data integration
necessary to get to insights
• SQL-on-anything
No vendor lock-in
• No Hadoop distro vendor lock-in
• No storage engine vendor lock-in
• No cloud vendor lock-in
3. Built for Performance
● MPP-style pipelined in-memory execution
● Multi-threaded multi-core execution
● Columnar and vectorized data processing
● Runtime query bytecode compilation
● Memory efficient data structures
● Optimized readers for columnar formats (ORC and Parquet)
● Predicate and column projection pushdown
● Cost-Based Optimizer
5. Example - Join multiple sources
SELECT
country,
approx_percentile(date_diff('year', birthdate, now()), array[0.25, 0.5, 0.75])
FROM
elasticsearch.default."movies: overview:space~ +fiction" movies
JOIN hive.default.views USING (movie_id)
JOIN mysql.default.users USING (user_id)
GROUP BY ROLLUP(country)
Per country age distribution of people that watched space fiction movies
6. Example - Join historical with recent data
CREATE VIEW visits AS
TABLE hive.visits_historical
UNION ALL
TABLE mysql.visits_recent
SELECT city, count(*) total
FROM visits
GROUP BY city
ORDER BY total DESC
8. Presto Software Foundation
“An independent, non-profit organization with the mission of supporting a community
of passionate users and developers devoted to the advancement of the Presto
distributed SQL query engine for big data.”
“It is dedicated to preserving the vision of high quality, performant, and dependable
software.”
“Ensuring the project remains open, collaborative and independent for decades to
come.”
10. Recent Improvements (last ~10 months)
● FETCH FIRST … WITH TIES syntax
● OFFSET syntax
● COMMENT ON <table> IS …
● [LEFT/RIGHT/FULL] JOIN LATERAL (…) ON
● IGNORE NULLS for window functions
● .* for ROW expressions
● Pass-through security (client provided
credentials)
● Impersonation for Hive Metastore
● Kerberos security improvements
● Support for Hadoop KMS
● Role-based security
● Secure query results in client API
● Current user security mode for views
● Support for Azure Data Lake
● Hive Bucketing V2
● Docker image
● Spill-to-disk improvements
● CLI output formats
● Syntax highlighting in CLI
● UUID type and functions
● format(), combinations() functions
● ORC bloom filters (non-legacy)
● Connector-provided view definitions
● Elasticsearch Connector
● Google Sheets Connector
● Amazon Kinesis Connector
● Apache Phoenix Connector
● LZ4/ZSTD support for ORC/Parquet
● More type mappings for various connectors
● Performance improvements for GCS and S3
● Performance improvements for UNNEST
… and more! https://prestosql.io/docs/current/release.htm
13. Starburst: SQL on Anything, Anywhere
Data Orchestration with caching, even with remote data
A dozen more
orchestrated cloud data
sources
14. Available Soon: Starburst Presto + Alluxio on
▪ AWS AMI pre-configured to speed up
Presto queries using Alluxio caching
▪ Start in minutes: AWS CloudFormation
Template to create a Presto Alluxio
cluster
▪ Seamless Hive Metastore / AWS Glue
integration, no location / path changes
needed
▪ Tutorial:
https://www.alluxio.io/products/aws/s
tarburst-alluxio-cft-tutorial/
+
15. Administrative challenges
● Configuring and managing clusters
● Autotuning properties based on the hardware provisioned
● High Availability for Presto Coordinator
● Scaling cluster elastically based on query load
● Gracefully decommissioning Presto Workers to avoid killing queries
● Monitoring of hardware and software layers
https://www.starburstdata.com/technical-blog/presto-on-kubernetes/
16. https://docs.starburstdata.com/latest/kubernetes.html
Presto on Kubernetes (K8S)
Presto Worker
Pod
Presto Worker
Pod
16
Presto Coordinator
Pod
Presto Worker
Pod
Horizontal Pod
Autoscaler (HPA)
Presto Operator
K8s Operator
Presto
Service
Hive Metastore Service
Pod
Hadoop / Hive
RDBMS
● RedHat OpenShift
● Google (GKE)
● Azure (AKS)
● Amazon (EKS)