Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
6. Starburst
6
Enterprise
Grade Security
On-Prem,
or Cloud
Rapid Time to
Insights
Low Cost of
Ownership
24x7 Expert
Support
ANSI SQL MPP
Query Engine
High
Concurrency
Our Platform
Named Open Source
Startup to Watch 2020
600% Growth YoY
100+
Enterprise Customers
NPS Score
80+
Massive
Scale
7. Starburst Enterprise Presto
Performance Connectivity Security Management
30+ supported enterprise
connectors
High performance parallel
connectors for Oracle,
Teradata, Snowflake and
more
Support
From petabytes to exabytes
– query data from disparate
sources using SQL – with
high concurrency
Control your
price/performance with the
latest cost-based optimizer
Caching available for
frequently accessed data
Kerberos & LDAP
integration
Global Security for fine-
grained Access Control
Data encryption
Data masking
Query auditing
Configuration
Autoscaling
High availability
Monitoring
Deploy anywhere
The largest team of Presto
experts in the world
Fully-tested, stable
releases, curated by the
Presto creators
Hot fixes & security
patches
24x7 support, 365 – we’ve
got your back
7
10. Why Delta Lake?
▪ ACID properties over data lake
▪ Open source table format
▪ Stored as Parquet files
▪ Object storage support
▪ Schema evolution
▪ Time travel feature
▪ Metadata & statistics
▪ Data skipping & z-ordering
11. Native Presto Delta Lake Reader
Supports data skipping & dynamic filtering
Optimizes query using file statistics
Supports reading the Delta transaction log
Native connector written from scratch
12. Native Delta Lake Reader Performance
▪ 2x average speedup across 22 queries
▪ 6x best query speedup
▪ “What we have here is game changing for
our industry. Especially now that the native
Delta reader works as fast as it does. We
have people lining up to now use this data”
▪ “We have queries that were running in 10
minutes that are now running in 47
seconds"
Feedback from customers:Standard TPC-H benchmark:
14. Starburst Platform
Data Scientists Data AnalystsFinance Marketers
The Data Consumption Layer
Existing analytics tools
Data Masking Global Security
Column + Row-
level permissions
Query Auditing Fine-grained
access control
Data Encryption
Data Lakes Relational Databases NoSQL Stores Publish/Subscribe
Azure Event Hub
15. Different SQL Technologies In Your Toolbelt
Streaming Ingestion
Machine Learning
Data Investigation
Large Batch Jobs
Fast Federated Queries
High Concurrency SQL Engine
High Performance Ad Hoc
Reporting/Analytics
Optionality
Cloud Data Warehouse
Rapid Ad Hoc Reporting/Analytics
Fast, but everything must live in
Snowflake (ETL/ELT is required)
Vendor and data lock in
19. Data Flow Diagram
Using a combination of Databricks and Starburst Presto to
bring a full data ingestion and analytical environment to life
20. Data Ingestion and Transformation
● Real-time ingestion of event data into
Delta tables
● Customer and inventory data ingested
every hour
● Modified customer information merged
into Delta Lake table
● Data marts created using streaming and
batch data
21. Query-time Data Federation
● Single point of access to numerous data
sources
● Query Delta Lake and federate with
legacy databases as well as many
NoSQL data stores
● Enforce table, column and row level
policies to ensure maximum data
security
● Mask column data for different groups
and users
22. Data Consumption & Analytics BI Reporting Tools
SQL Query Tools
• Connect using a variety of BI and SQL
tools including Looker, Tableau, Power
BI and DBeaver
• JDBC, ODBC and many libraries
including Python, R and Java
SELECT id, COUNT(*), SUM(active_seconds)
FROM delta.iot.events e
JOIN snowflake.sales.customer c ON (e.customer_id = c.id)
WHERE e.event_date >= current_date
AND c.region = 'US'
AND c.id IN
(SELECT l.customer_id
FROM elastic.web.logs l
WHERE l.visit_date >= date '2020-01-01')
GROUP BY id;