SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Presto @ Netflix: Interactive
Queries at Petabyte Scale
Nezih Yigitbasi and Zhenxiao Luo
Outline
Our big data platform
Presto @ Netflix
Netflix integration
Our contributions
What’s next?
Cloud
Apps
S3
Suro/Kafka Ursula
SSTable
s
Cassandra Aegisthus
Event Data
500 bn/day, 15m
Daily
Dimension Data
Netflix Data Pipeline
Data
Warehouse
Service
Tool
s
Gateways
Our Big Data Platform
Prod
Clients
Clusters
Query Prod TestProd
Big Data API/Portal
Metacat
Our Use Cases
Batch jobs (Pig, Hive)
ETL jobs
reporting and other analysis
Interactive jobs
Presto @ Netflix
What is Presto?
An open source distributed SQL engine
for running interactive queries against
large datasets
Why we love Presto?
Fast
0
200
400
600
800
Group By Join + Group By Needle in
Haystack
Presto Hive
QueryCompletionTime[s]
Why we Love Presto?
Fast
Scalable
ANSI SQL
Open source
Works well on AWS
Hadoop friendly
presto-cli, Python, R, BI tools (ODBC/JDBC), etc.
Our Deployment
v 0.100
Java 8™
1 coordinator (r3.4xlarge)
~220 workers (r3.4xlarge)
Clients
15+ PB
Total data size
2.5K
Queries/day
300+
Presto users
Data Size
100MB 1GB 1TB 10TB
0
20
40
60
80
100
%ofQueries
Query Runtime
0
20
40
60
80
100
4s 1m 5m 10m
%ofQueries
Netflix Integration
S3
Atlas
Sidecar
PrestoAmazon EMR
Amazon
RDS
HCat
Server
Coordinator
Worker
S3
Atlas
Sidecar
PrestoAmazon EMR
Data Lineage
query
completion
events
S3
Atlas
Sidecar
PrestoAmazon EMR
Monitoring
metrics
S3
Suro
Atlas
Sidecar
PrestoAmazon EMR
BI Tools
Our Contributions
S3 Filesystem
Query Optimizer
Parquet File Format
Complex Types
Multipart upload
Instance credentials
Role support
Reliability
Single distinct => Group By
Joins with similar subqueries
Schema evolution
Parquet 1.6
Various new
functions
Comparability
presto-cli
other
clients
Odbc/Jd
bc
S3
Worker
Worker
Worker
Parser Optimizer
Scheduler
Distributed
Planner
Coordinator
Functions
Type
System
1
2 3
4
5
6
7
Single Distinct => Group By
select
count(distinct c)
from t
select count(*)
from (select c
from t
group by c)
Output
Count
Aggregation
masks = {column$distinct}
Distinct
marker = column$distinct
Table Scan
Output
Count
Aggregation
masks = {}
Group By
Aggregation
count
Table Scan
Joins with Similar Subqueries
select *
from (select k,
agg1,
agg2
from t
group by k) a
join (select k,
agg3,
agg4
from t
group by k) b
on ( a.k = b.k )
Output
Table Scan
table = t
Join
key= k
Group By
Aggregation
key= k
agg1, agg2
Group By
Aggregation
key= k
agg3, agg4
Table Scan
table = t
Output
Table Scan
table = t
Group By
Aggregation
key= k
agg1, agg2, agg3, agg4
select k, agg1,
agg2, agg3,
agg4
from t
group by k
Joins with Similar Subqueries
presto-cli
other
clients
Odbc/Jd
bc
S3
Worker
Worker
Worker
Parser Optimizer
Scheduler
Distributed
Planner
Coordinator
Functions
Type
System
1
2 3
4
5
6
7
Complex Type Support
map_agg()
map_keys()
map_values()
map<K,V> row(F T)
=, !=
bug fixes
array<T>
array_join()
sort_array()
concat()
=, !=, <, >
presto-cli
other
clients
Odbc/Jd
bc
S3
Worker
S3
Filesystem
Worker
Worker
S3
Filesystem
Parser Optimizer
Scheduler
Distributed
Planner
Coordinator
Functions
Type
System
1
2 3
4
5
6
7
Presto S3 FileSystem
(multipart upload, instance/static credentials,
assume role, reliability, etc.)
S3
open() seek() list()
Get Object Get Object
Metadata
List Objects
presto-cli
other
clients
Odbc/Jd
bc
S3
Worker
S3
Filesystem
Worker
Worker
S3
Filesystem
Parser Optimizer
Scheduler
Distributed
Planner
Coordinator
Functions
Type
System
1
2 3
4
5
6
7
Parquet
Cursor
Parquet
Cursor
RowGroup Metadata
codec, encoding, etc.
Column Chunk
Page
Page
Page
Column Chunk
Page
Page
Page
Column Chunk
Page
Page
Page
RowGroup
Footer
schema, version, etc.
Column Metadata
value count
size,
min, max
Column Metadata
value count
size,
min, max
Column Metadata
value count
size,
min, max
What’s next?
Parquet optimizations
vectorized reader
predicate pushdown
lazy load
lazy decompression/decoding
Better resource management
Better BI tool integration
THANK YOU

Weitere ähnliche Inhalte

Was ist angesagt?

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
Zhenxiao Luo
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
Amazon Web Services
 

Was ist angesagt? (20)

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017
 
Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013
Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013
Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013
 
Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3
 
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
 
AWS re:Invent 2016: How Citus Enables Scalable PostgreSQL on AWS (DAT207)
AWS re:Invent 2016: How Citus Enables Scalable PostgreSQL on AWS (DAT207)AWS re:Invent 2016: How Citus Enables Scalable PostgreSQL on AWS (DAT207)
AWS re:Invent 2016: How Citus Enables Scalable PostgreSQL on AWS (DAT207)
 
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
 
(BDT205) Your First Big Data Application On AWS
(BDT205) Your First Big Data Application On AWS(BDT205) Your First Big Data Application On AWS
(BDT205) Your First Big Data Application On AWS
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 

Ähnlich wie Presto @ Netflix: Interactive Queries at Petabyte Scale

Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
HostedbyConfluent
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
n5712036
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Qbeast
 

Ähnlich wie Presto @ Netflix: Interactive Queries at Petabyte Scale (20)

presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Interactively querying Google Analytics reports from R using ganalytics
Interactively querying Google Analytics reports from R using ganalyticsInteractively querying Google Analytics reports from R using ganalytics
Interactively querying Google Analytics reports from R using ganalytics
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Data visualization in python/Django
Data visualization in python/DjangoData visualization in python/Django
Data visualization in python/Django
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
R basics
R basicsR basics
R basics
 
MongoDB.local Berlin: Building a GraphQL API with MongoDB, Prisma and Typescript
MongoDB.local Berlin: Building a GraphQL API with MongoDB, Prisma and TypescriptMongoDB.local Berlin: Building a GraphQL API with MongoDB, Prisma and Typescript
MongoDB.local Berlin: Building a GraphQL API with MongoDB, Prisma and Typescript
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQL
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Presto @ Netflix: Interactive Queries at Petabyte Scale

Hinweis der Redaktion

  1. ----- Meeting Notes (6/3/15 10:31) ----- more story telling here of why we chose presto