Procella: A fast versatile SQL query engine powering data at Youtube

•Als PPTX, PDF herunterladen•

4 gefällt mir•2,084 views

Procella is a distributed SQL query engine built for flexible workloads within YouTube. Procella is highly scalable and is designed primarily to serve high volumes of queries at low latencies while ingesting realtime data. It is used to serve video/channel statistics for users watching videos as well as OLAP style queries for video analytics (youtube.com/analytics) and public dashboards (artists.youtube.com). Procella also supports complex SQL operations over structured data and is used by YouTube analysts for ad-hoc analysis. Procella works on the Google distributed computing stack working directly on data residing in accessible columnar formats on the Google distributed file system Colossus. The underlying data is thus producible and directly consumable by other tools such as MapReduce and Dremel. The compute runs directly on shared machines on Borg clusters, and does not need dedicated virtual (or physical) machines. These features allows Procella to fit nicely in the Google ecosystem, scale compute and storage independently, and to gracefully handle evictions and machine failures without compromising availability or performance. Procella has been in production for over two years and is currently serving billions of SQL queries per day across various workloads at YouTube and several other Google product areas. Speaker Aniket Mokashi, Google, Senior Software Engineer

Technologie

aniketmokashi@
June 2018
Procella: A Fast Versatile SQL Query Engine

About me
● Tech Lead on YouTube Data Team
○ Procella
○ Youtube Data Warehouse
○ Youtube Analytics Backend
○ Data Quality: Anomaly Detection
● Prior to Google
○ Data Infra teams at Twitter, Netflix
○ Apache Parquet PMC
○ Apache Pig PMC
○ Contributor: Hadoop, Hive

Agenda
● Products at YouTube and Google
● Use cases and features
● Architecture
● Key features deep dive
● Q&A

Youtube Analytics (youtube.com/analytics)

Youtube Music Insights (youtube.com/artists)

And more...
● Youtube Metrics
● Firebase Performance Console
● Google Play Console
● Youtube Internal Analytics
● ...

About Procella
A widely used SQL query engine at Google
● Fully featured: Most of SQL++: Structured, joins, set ops, analytic functions
● Super fast: Most queries are sub-second.
● Highly scalable: Petabytes of data, thousands of tables, trillions of rows, ...
● High QPS: Millions of point queries @ msec, thousands of reporting queries @
sub-second, hundreds of ad-hoc queries @ seconds
● Designed for real-time: Millions QPS instant ingestion, highly efficient, …
● Easy to use: SQL, DDL, Virtual tables, …
● Compatible: Capacitor, Dremel, ACLs, Encryption, …

External Reporting/Analytics
● Use cases
○ YouTube Analytics
○ Firebase Perf Monitoring
○ Google Play Console
○ Data Studio
● Properties
○ High QPS, low latency
ingestion and queries
○ Hundreds of metrics across
dozens of dimensions
○ SQL
● Technology:
○ On the fly aggregations
○ Indexing & partitioning
○ Real-time ingestion
○ Batch & Real-time Stitching
■ Lambda architecture
○ Caching

Internal Reporting/Dashboards
● Use cases
○ Dashboards
○ Experiments Analysis
○ Custom reporting
● Properties
○ Low qps
○ Speed & scale
○ SQL
● Technology
○ Stateful caching
○ Columnar optimized data
format (Artus)
○ Vector evaluation
○ Virtual tables
○ Tools (Dremel) and format
(Capacitor) compatibility

Real Time Insights/Monitoring
● Use cases
○ YTA Real Time (External)
○ YTA Insights (External)
○ YT Site Health (Internal)
● Properties
○ Native time-series support
○ Complex compaction, complex
metrics
○ High scan rate
○ Very high real time ingestion rate
● Technology
○ TTL
○ SQL based compaction
○ Approx algorithms
(Quantiles, Top, ..)

Real Time Stats Serving
● Use cases
○ YouTube metrics
(subscriptions, likes, etc.) on
various YT pages
● Properties
○ Millions of rows ingested per
sec
○ Millions of queries per sec
○ Milliseconds response time
○ Simple point queries
○ Many replicas
● Technology
○ Indexable columnar format
(Artus)
○ Heavily optimized serving
path
○ Pubsub based ingestion

Ad-hoc Analytics
● Use cases
○ Data Science
○ Analysts
● Properties
○ Complex SQL queries
○ Multi-stage queries
○ Semi structured and
complex data (eg - user
profiles)
○ Joins
● Technology
○ Efficient RDMA Shuffle
○ Efficient Joins - Broadcast,
Lookup, Pipelined, Shuffled,
Co-partitioned
○ Multi-stage queries
○ Optimizer (adaptive)
○ UDF

Data Pipelines
● Use cases
○ Pipelines
○ Logs Processing
● Properties
○ Complex business logic
○ Multi-stage queries
○ Large shuffles
○ Joins
○ UDFs
● Technology
○ Efficient RDMA Shuffle
○ Efficient Joins - Broadcast,
Lookup, Pipelined, Shuffled,
Co-partitioned
○ Multi-stage queries
○ Optimizer (adaptive)
○ UDFs

Procella Scale
Property Value
Queries executed 450+ million per day
Queries on real-time data 200+ million per day
Leaf plans executed 90+ billion per day
Rows scanned 20+ quadrillion per day
Rows returned 100+ billion per day
Peak scan rate per query 100+ billion rows per second
Schema size 200+ metrics and dimensions
Tables 100+

Procella Latencies
Property p50 p99 p99.9
E2E Latency 25 ms 450 ms 3000 ms
Rows Scanned 17 M 270 M 1000 M
MDS Latency 6 ms 120 ms 250 ms
DS Latency 1 ms 75 ms 220 ms
Query Size 300B 4.7KB 10KB

Metadata Server Latencies
Percentile Latency (ms) Tablets Pruned Tablets Scanned
p50 6 110,000 60
p90 11 900,000 250
p95 15 1,000,000 310
p99 120 33,000,000 4000
p99.9 250 38,000,000 7200
p99.99 5100 38,050,000 11,500

Procella Secret Sauce
● Stateful Caching
○ Cached indexes - zone-maps, dictionaries, bloom filters etc.
○ File handle, metadata caching
○ Data segments caching
○ Data segments have data server affinity (Primary DS, Secondary DS,...)
○ Cached expressions
● Tail Latency Reduction
○ Backup RPCs - at about 90% of total RPCs.
○ Minibatch backup RPCs - one per n RPCs as a backup for slowest of
those n RPCs.

Procella Secret Sauce
● Indexing
○ Separate metadata server optimized for compact storage of zone maps
○ Dynamic range partitioning
○ Use of dictionaries and bloom filters
○ Posting lists for repeated data
● Distributed Execution
○ N tier execution, X000 parallel
● Vectorized Evaluation: Superluminal
○ Performs block eval
○ Natively understands RLE and dictionary
○ Columnar evaluation

Procella Secret Sauce
● Artus File Format
○ Multi-pass adaptive encodings to select the best encoding for a column
○ Uses adaptive encodings vs generalized compression
○ O(log n) lookups, O(1) seeks
○ Length based representation for repeated data types (arrays)
○ Exposes dictionary, RLE information to execution engine
○ Allows for rich metadata, inverted index

Procella Secret Sauce
● Real-time Ingestion
○ Data ingested into memory buffers
○ Periodically checkpointed to files
○ Small files are periodically aggregated and compacted into large files
● Virtual Tables
○ Index aware aggregate selection
○ Stitched Queries
○ Lambda architecture awareness
○ Normalization friendly (dimensional joins)

TPC-H queries*
* Run of TPC-H queries on:
● Static artus data
● Large shared instance
● Manually optimized queries
Geomean: 9.6 Seconds

Empfohlen

Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia

OSS NA 2019 - Demo Booth deck overview of EgeriaODPi

Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative

The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks

Data Lake OverviewJames Serra

Data Engineering.pdfDatacademy.ai

Building an open data platform with apache icebergAlluxio, Inc.

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks

Empfohlen

Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia

OSS NA 2019 - Demo Booth deck overview of EgeriaODPi

Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative

The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks

Data Lake OverviewJames Serra

Data Engineering.pdfDatacademy.ai

Building an open data platform with apache icebergAlluxio, Inc.

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks

Cloud Data WarehousesAsis Mohanty

Kappa vs Lambda Architectures and Technology ComparisonKai Wähner

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Achieving Lakehouse Models with Spark 3.0Databricks

Free Training: How to Build a LakehouseDatabricks

Introduction to AWS GlueAmazon Web Services

Data engineeringParimala Killada

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !

Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech TalksAmazon Web Services

Apache Hudi: The Path ForwardAlluxio, Inc.

Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...Databricks

Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...HostedbyConfluent

Sprint's Data Modernization JourneyHortonworks

Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation

Rds data lake @ Robinhood BalajiVaradarajan13

Lakehouse in AzureSergio Zenatti Filho

Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi

Apache Atlas: Tracking dataset lineage across Hadoop componentsDataWorks Summit/Hadoop Summit

Future of Data EngineeringC4Media

Introduction to Apache Tajo: Future of Data WarehouseGruter

Introduction to Apache Tajo: Future of Data WarehouseJihoon Son

Weitere ähnliche Inhalte

Was ist angesagt?

Cloud Data WarehousesAsis Mohanty

Kappa vs Lambda Architectures and Technology ComparisonKai Wähner

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Achieving Lakehouse Models with Spark 3.0Databricks

Free Training: How to Build a LakehouseDatabricks

Introduction to AWS GlueAmazon Web Services

Data engineeringParimala Killada

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !

Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech TalksAmazon Web Services

Apache Hudi: The Path ForwardAlluxio, Inc.

Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...Databricks

Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...HostedbyConfluent

Sprint's Data Modernization JourneyHortonworks

Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation

Rds data lake @ Robinhood BalajiVaradarajan13

Lakehouse in AzureSergio Zenatti Filho

Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi

Apache Atlas: Tracking dataset lineage across Hadoop componentsDataWorks Summit/Hadoop Summit

Future of Data EngineeringC4Media

Was ist angesagt? (20)

Cloud Data Warehouses

Kappa vs Lambda Architectures and Technology Comparison

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Achieving Lakehouse Models with Spark 3.0

Free Training: How to Build a Lakehouse

Introduction to AWS Glue

Data engineering

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks

Apache Hudi: The Path Forward

Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...

Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...

Sprint's Data Modernization Journey

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

Rds data lake @ Robinhood

Lakehouse in Azure

Building a Data Pipeline using Apache Airflow (on AWS / GCP)

Apache Atlas: Tracking dataset lineage across Hadoop components

Future of Data Engineering

Ähnlich wie Procella: A fast versatile SQL query engine powering data at Youtube

Introduction to Apache Tajo: Future of Data WarehouseGruter

Introduction to Apache Tajo: Future of Data WarehouseJihoon Son

Elasticsearch as a time series databasefelixbarny

Extracting Insights from Data at TwitterPrasad Wagle

Machine learning and big data @ uber a tale of two systemsZhenxiao Luo

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent

Data Lessons Learned at Scale - Big Data DCCharlie Reverte

What's new in SQL on Hadoop and BeyondDataWorks Summit/Hadoop Summit

Real-time analytics with Druid at AppsflyerMichael Spector

PostgreSQL and Sphinx pgcon 2013Emanuel Calvo

Presto Bangalore Meetup1 Repertoire@MyntraShubham Tagra

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty

Data Lessons Learned at ScaleCharlie Reverte

20131008 - Wajug - TweetWall ProPascal Alberty

Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon

Data Science in the Cloud @StitchFixC4Media

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son

Presto at Hadoop Summit 2016kbajda

Big data @ Hootsuite analtyicsClaudiu Coman

Ähnlich wie Procella: A fast versatile SQL query engine powering data at Youtube (20)

Introduction to Apache Tajo: Future of Data Warehouse

Elasticsearch as a time series database

Extracting Insights from Data at Twitter

Machine learning and big data @ uber a tale of two systems

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...

Data Lessons Learned at Scale - Big Data DC

What's new in SQL on Hadoop and Beyond

Real-time analytics with Druid at Appsflyer

PostgreSQL and Sphinx pgcon 2013

Presto Bangalore Meetup1 Repertoire@Myntra

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

Data Lessons Learned at Scale

20131008 - Wajug - TweetWall Pro

Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...

Data Science in the Cloud @StitchFix

The Parquet Format and Performance Optimization Opportunities

Introduction to Apache Tajo: Data Warehouse for Big Data

Presto at Hadoop Summit 2016

Big data @ Hootsuite analtyics

Mehr von DataWorks Summit

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Mehr von DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Kürzlich hochgeladen

Artificial Intelligence: Facts and MythsJoaquim Jorge

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Slack Application Development 101 Slidespraypatel2

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Histor y of HAM Radio presentation slidevu2urc

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Real Time Object Detection Using Open CVKhem

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

A Year of the Servo Reboot: Where Are We Now?Igalia

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Kürzlich hochgeladen (20)

Artificial Intelligence: Facts and Myths

Advantages of Hiring UIUX Design Service Providers for Your Business

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Slack Application Development 101 Slides

CNv6 Instructor Chapter 6 Quality of Service

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Histor y of HAM Radio presentation slide

Axa Assurance Maroc - Insurer Innovation Award 2024

Real Time Object Detection Using Open CV

08448380779 Call Girls In Friends Colony Women Seeking Men

A Year of the Servo Reboot: Where Are We Now?

Automating Google Workspace (GWS) & more with Apps Script

Handwritten Text Recognition for manuscripts and early printed texts

Finology Group – Insurtech Innovation Award 2024

Boost Fertility New Invention Ups Success Rates.pdf

How to Troubleshoot Apps for the Modern Connected Worker

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Procella: A fast versatile SQL query engine powering data at Youtube

1. aniketmokashi@ June 2018 Procella: A Fast Versatile SQL Query Engine

2. About me ● Tech Lead on YouTube Data Team ○ Procella ○ Youtube Data Warehouse ○ Youtube Analytics Backend ○ Data Quality: Anomaly Detection ● Prior to Google ○ Data Infra teams at Twitter, Netflix ○ Apache Parquet PMC ○ Apache Pig PMC ○ Contributor: Hadoop, Hive

3. Agenda ● Products at YouTube and Google ● Use cases and features ● Architecture ● Key features deep dive ● Q&A

4. Youtube Analytics (youtube.com/analytics)

5. Youtube Music Insights (youtube.com/artists)

6. And more... ● Youtube Metrics ● Firebase Performance Console ● Google Play Console ● Youtube Internal Analytics ● ...

7. About Procella A widely used SQL query engine at Google ● Fully featured: Most of SQL++: Structured, joins, set ops, analytic functions ● Super fast: Most queries are sub-second. ● Highly scalable: Petabytes of data, thousands of tables, trillions of rows, ... ● High QPS: Millions of point queries @ msec, thousands of reporting queries @ sub-second, hundreds of ad-hoc queries @ seconds ● Designed for real-time: Millions QPS instant ingestion, highly efficient, … ● Easy to use: SQL, DDL, Virtual tables, … ● Compatible: Capacitor, Dremel, ACLs, Encryption, …

8. External Reporting/Analytics ● Use cases ○ YouTube Analytics ○ Firebase Perf Monitoring ○ Google Play Console ○ Data Studio ● Properties ○ High QPS, low latency ingestion and queries ○ Hundreds of metrics across dozens of dimensions ○ SQL ● Technology: ○ On the fly aggregations ○ Indexing & partitioning ○ Real-time ingestion ○ Batch & Real-time Stitching ■ Lambda architecture ○ Caching

9. Internal Reporting/Dashboards ● Use cases ○ Dashboards ○ Experiments Analysis ○ Custom reporting ● Properties ○ Low qps ○ Speed & scale ○ SQL ● Technology ○ Stateful caching ○ Columnar optimized data format (Artus) ○ Vector evaluation ○ Virtual tables ○ Tools (Dremel) and format (Capacitor) compatibility

10. Real Time Insights/Monitoring ● Use cases ○ YTA Real Time (External) ○ YTA Insights (External) ○ YT Site Health (Internal) ● Properties ○ Native time-series support ○ Complex compaction, complex metrics ○ High scan rate ○ Very high real time ingestion rate ● Technology ○ TTL ○ SQL based compaction ○ Approx algorithms (Quantiles, Top, ..)

11. Real Time Stats Serving ● Use cases ○ YouTube metrics (subscriptions, likes, etc.) on various YT pages ● Properties ○ Millions of rows ingested per sec ○ Millions of queries per sec ○ Milliseconds response time ○ Simple point queries ○ Many replicas ● Technology ○ Indexable columnar format (Artus) ○ Heavily optimized serving path ○ Pubsub based ingestion

12. Ad-hoc Analytics ● Use cases ○ Data Science ○ Analysts ● Properties ○ Complex SQL queries ○ Multi-stage queries ○ Semi structured and complex data (eg - user profiles) ○ Joins ● Technology ○ Efficient RDMA Shuffle ○ Efficient Joins - Broadcast, Lookup, Pipelined, Shuffled, Co-partitioned ○ Multi-stage queries ○ Optimizer (adaptive) ○ UDF

13. Data Pipelines ● Use cases ○ Pipelines ○ Logs Processing ● Properties ○ Complex business logic ○ Multi-stage queries ○ Large shuffles ○ Joins ○ UDFs ● Technology ○ Efficient RDMA Shuffle ○ Efficient Joins - Broadcast, Lookup, Pipelined, Shuffled, Co-partitioned ○ Multi-stage queries ○ Optimizer (adaptive) ○ UDFs

14. Architecture

15. Procella Scale Property Value Queries executed 450+ million per day Queries on real-time data 200+ million per day Leaf plans executed 90+ billion per day Rows scanned 20+ quadrillion per day Rows returned 100+ billion per day Peak scan rate per query 100+ billion rows per second Schema size 200+ metrics and dimensions Tables 100+

16. Procella Latencies Property p50 p99 p99.9 E2E Latency 25 ms 450 ms 3000 ms Rows Scanned 17 M 270 M 1000 M MDS Latency 6 ms 120 ms 250 ms DS Latency 1 ms 75 ms 220 ms Query Size 300B 4.7KB 10KB

17. Metadata Server Latencies Percentile Latency (ms) Tablets Pruned Tablets Scanned p50 6 110,000 60 p90 11 900,000 250 p95 15 1,000,000 310 p99 120 33,000,000 4000 p99.9 250 38,000,000 7200 p99.99 5100 38,050,000 11,500

18. Procella Secret Sauce ● Stateful Caching ○ Cached indexes - zone-maps, dictionaries, bloom filters etc. ○ File handle, metadata caching ○ Data segments caching ○ Data segments have data server affinity (Primary DS, Secondary DS,...) ○ Cached expressions ● Tail Latency Reduction ○ Backup RPCs - at about 90% of total RPCs. ○ Minibatch backup RPCs - one per n RPCs as a backup for slowest of those n RPCs.

19. Procella Secret Sauce ● Indexing ○ Separate metadata server optimized for compact storage of zone maps ○ Dynamic range partitioning ○ Use of dictionaries and bloom filters ○ Posting lists for repeated data ● Distributed Execution ○ N tier execution, X000 parallel ● Vectorized Evaluation: Superluminal ○ Performs block eval ○ Natively understands RLE and dictionary ○ Columnar evaluation

20. Procella Secret Sauce ● Artus File Format ○ Multi-pass adaptive encodings to select the best encoding for a column ○ Uses adaptive encodings vs generalized compression ○ O(log n) lookups, O(1) seeks ○ Length based representation for repeated data types (arrays) ○ Exposes dictionary, RLE information to execution engine ○ Allows for rich metadata, inverted index

21. Procella Secret Sauce ● Real-time Ingestion ○ Data ingested into memory buffers ○ Periodically checkpointed to files ○ Small files are periodically aggregated and compacted into large files ● Virtual Tables ○ Index aware aggregate selection ○ Stitched Queries ○ Lambda architecture awareness ○ Normalization friendly (dimensional joins)

22. TPC-H queries* * Run of TPC-H queries on: ● Static artus data ● Large shared instance ● Manually optimized queries Geomean: 9.6 Seconds

23. Questions?

Hinweis der Redaktion

Hi Everyone, I’m Aniket Mokashi, I am a tech lead at Google. Let me give you a little bit of background about myself. I work on Youtube’s Data team. At Youtube, I primarily contribute to Procella, which is what this talk is about. In addition, I work on Youtube Analytics Backend and on variety of projects under Youtube Data warehouse. Prior to Google, I have worked on data infra teams at Twitter and Netflix. I’m a Apache Parquet PMC member. I’ve contributed a few encodings, pig integration to Parquet. I was responsible for rolling out for first production use of Parquet at Twitter. I’m also a Apache Pig PMC member. My main contributions have been support of UDFs in scripting languages, Native Mapreduce, scalar values support, auto local mode and jar caching. I enjoy working on large scale data processing systems and in this talk I will cover Procella, a versatile analytics engine we’ve built at Youtube to serve most of our analytics needs.
We will cover a number of topics in this talk. First, we will look at products at Youtube and Google that are powered by Procella. Next, we will explore variety of use cases enabled by Procella and we will discuss features that are required in order to support those use cases. After that, I will give you all an overview of Procella’s architecture. And lastly, we will spend some time to deep dive into some of the key features that make Procella work so well. In interest of time, I will be taking questions at the end.
Let’s look at some products that are powered by Procella. But, before we get started, let me ask you all a few questions: So, raise of hands, How many of you use YouTube regularly like at least once a week… Perfect - that makes all of you. How many of you are Youtube Creators - that is you have uploaded at least one video to Youtube? Those of you who haven’t done that yet, I will highly encourage you to try it out. Especially to get rich analytics... :) How many of you have used youtube analytics for tracking their videos or at least seen youtube analytics before? Excellent. So, Youtube Analytics is this amazingly rich analytics dashboard that lets Youtube content creators or content owners analyze usage and popularity of their videos and channels on the Youtube platform. It lets them track all of success metrics such as number of views, amount of watch time, revenue from monetized videos across various dimensions such as demographics, geo, devices, playback location etc. We also have a mobile application, that you see on the right, called creator studio which has similar details. Some of this information is also available in real-time. Three years back, we started building Procella primarily to serve as a backend for Youtube analytics which is often referred as YTA. We wanted to make sure that backend of YTA will continue to scale horizontally as Youtube grows and as we add more features to the product. We had a few design goals in mind when we started working on Procella. First, we wanted to provide widely understood SQL interface so that navigating through this data is easy and integrating various frontends with the backend was less cumbersome. Second, we also wanted to provide flexibility to extend the product without a lot of involvement of the backend team, so we wanted to build as generic product as possible. We launched Procella as a backend for Youtube Analytics about 2 and half years ago. Since then, we have been able to make several changes to the the product seamlessly. For example, recently, we started showing impressions and impressions CTR in YTA and it required almost no involvement of the Procella team.
In addition to Youtube Analytics, over last few years Procella is being used for several other internal and external facing analytics dashboards or products. One example is Youtube Artists product that you can see on the slides. This product allows everyone to track popularity of artists and their music videos in a region on Youtube platform. If you haven’t already, I recommend you to check it out.
There are few other fairly large products at Google that are powered by Procella. For example - Firebase Performance Console. It lets app developers track various growth and success related metrics for their apps that are integrated with firebase platform. Google Play Console - that lets android application developers track performance and popularity of their apps. Also, now, various metric numbers like subscriptions, likes, dislikes that you see on the youtube.com website or app are also powered by Procella. Given the popularity of Youtube, as you can imagine, this was a significant achievement in terms of scale. I will describe later in the talk, the system properties that allowed us achieve it.
So now that you have looked at these interesting products that are using Procella. Let’s look at Procella. What is Procella? Procella is a widely used SQL query engine at Google. It supports most of the standard SQL functionality including joins, set operations and analytic functions. It also supports queries on top of complex or structured data types like arrays, structs and maps. What makes it unique is that it’s built to be super fast at scale - so most of the queries running on Procella have sub-second latencies. It’s highly scalable - works on petabytes of data, on thousands of tables storing trillions of rows. It can handle high qps of queries - like millions of qps of point lookup queries at millisecond latencies and thousands of qps of dashboard queries at sub-second latencies and hundreds of ad-hoc queries running in seconds. It is also designed to serve data ingested in real-time. So, it supports millions of qps of ingestion of data rows. It’s easy to use with SQL interface that supports DDL statements to create tables and partitions. It also supports virtual tables. Virtual tables are used to hide the complexity of data model that is required to serve a use case. Procella is developed to be compatible with existing tools at Google so that it can adopted easily. It exposes Dremel’s query interface which is a popular query system at Google and it can process files in Capacitor file format which is Dremel’s native file format.
Procella is a composable and versatile system. It powers a variety of use cases ranging from external facing metrics reporting applications to data pipelines. Let’s go through these use cases one by one to understand the motivations behind the architecture and various features in Procella. External reporting applications are public facing products like Youtube analytics, firebase performance console, Google play console those I showed before. Being public facing, these applications have high qps and low-blink-of-an-eye latency requirements. These products typically surface over a hundred metrics sliced and diced by a few dozen of dimensions for a given entity or a customer. Having a SQL interface instead of an API interface makes it easy build the backends and frontends independently. These applications usually work on large datasets. Precomputing all the required metrics numbers required to power these dashboards is expensive or sometimes even infeasible. So, having ability to perform fast aggregations of these metrics on the fly is important. To enable these applications, Procella provides efficient indexing and tablet pruning functionality as most of these queries need to process a small number of rows out of a very large amount of data. These applications also need ability to ingest real-time data to enable fast insights. And, in most cases, the ability to stitch between realtime datasets and batch datasets using lambda architecture is crucial. These applications can leverage temporal locality of data, so building caching at different layers helps in performance.
Another use case is internal reporting or dashboards. Some of the examples internal reporting are: product dashboards, experiment analysis dashboards, custom reports. Internal reporting has much lower qps and relatively less aggressive latency requirements compared to external reporting. However, internal dashboards typically work on much complex and larger datasets. So, data scan efficiency and ability to handle complex data types are important to be able to serve their queries. To enable these use cases, we have developed features such as fast optimized data formats, vectorized evaluation library.
Real-time time-series analysis and monitoring is another use case. YTA provides insights in real-time. In addition, we also to track Youtube site health using Procella. These applications require native support of time dimension especially to filter based on different time boundaries like last 60 mins or last 2 days.
As I mentioned previously, use of Procella to power metrics on various pages of the Youtube platform is one of the most exciting use case. Youtube gets millions of activity updates per second. So, this requires millions of rows per second of reliable ingestion in real-time. This is mainly done through pubsub which is persistent message queue. Updates are replicated to multiple ingestion servers for reliability. On the serving side, this use case requires ability to serve millions of simple point lookup queries per seconds. These queries take only a few milliseconds to run on the ingested data. To make this possible, we have heavily optimized the query serving path with additional features that can lookup and scan required data efficiently.
Ad-hoc analytics essentially covers interactive querying at a low qps. Data scientists and analysts make use of tables stored in Procella to identify trends, growth factors and other important custom business metrics. This typically requires querying complex data types like arrays, nested structs and ability to join arbitrary datasets. We support a number of joins that are required for these queries to work. In particular, we support broadcast join - which broadcasts the right side of the joins to all the leaf nodes so that they can construct local hash maps for lookups during the joins. We support remote lookup joins that leverage our lookup-friendly data format. We support pipelined joins that run in stages. They first compute small amount of join data from right side so that it can be used as a filter on the left side in the subsequent stages. Shuffled joins, which are essentially reduce-side joins that can scale for arbitrary large datasets. Lastly, co-partitioned joins, which leverage partitioning of left and right sides to efficiently merge them during the join.
Supporting SQL based data pipelines requires ability to handle large amount of data and process it using a complex business logic. These are typically multi-stage queries that join various data sources. These joins and aggregations require large shuffles. For logic that cannot be expressed in SQL, UDF support is necessary. To enable this use case, we have plugable shuffle component and we leverage Google wide available efficient RDMA shuffle service. We have also implemented adaptive cost based optimizer that can optimize parts of these queries on the fly.
All of the categories of use cases I mentioned so far are being powered by Procella in production at Youtube. Procella’s unique scalable, composable architecture makes that possible. Let’s dive into the architecture now. This diagram roughly shows architecture of Procella. All rectangular boxes in the diagram are essentially independent services in the Procella query engine. Towards your left is metadata server, metadata store and registration server. Users register their tables with Procella using DDL commands like CREATE/ALTER table or programmatically using RPCs. This is done to define the schemas of the tables. Then users setup upstream batch processes to periodically create partitions of datasets. These datasets are stored on Google’s distributed file system called Colossus. After datasets are generated, users register them with Procella using ALTER table commands or programmatically with RPCs. These datasets are typically stored in Columnar file formats such as Capacitor or Artus. Each dataset consists of many blocks of rows called tablets. A tablet typically has tens of thousands to millions of rows. Data is generally range or hash partitioned to achieve clustering of data by columns that are frequently filtered by. During the data registration, registration server extracts metadata information of all tablets such as stats, zone-maps, dictionaries, bloom filters and stores it in the metadata store in a highly compact and organized way. In the query path, shown on your right, with yellow colored arrows, dashboard clients or human users send their SQL queries as strings. These are compiled into distributed multi-level query plans to be executed on data servers. Root server is responsible for compiling and coordinating execution of the query. It also does the last stage of query execution, before returning results back to the users. Metadata server is primarily responsible for pruning tablets based on the partition metadata stored in the metadata store. To make this efficient, metadata is cached in the metadata servers in a LRU cache. We use zone-maps, dictionaries and bloom filters for pruning tablets required for the query. Once the tablets to be processed are identified, distributed query plan is executed on the data servers for those tablets. Data servers load the columns required for query execution on need basis and cache them into large local caches at every data server. Data shuffles between stages of the query are handled using remote RDMA shuffle service. In realtime ingestion, shown with blue arrows, data producers send rows of data directly via RPC or using a persistent pub sub queue. Based on the pre-configured partitioning on selected columns, data is ingested into two ingestion servers, primary and secondary for each row. This data is held into efficient lookup memory buffers and are periodically checkpointed to Colossus. These checkpointed files are eventually compacted or aggregated into larger files for efficiency of querying. Data ingested in Procella is servable from the point it is ingested into the system. So, queries on realtime tables are served from buffers and small files and the large compacted files. The lifecycle of these buffers, small files and large files is coordinated with transactions on the metadata store via registration server.
Now that we have looked at the architecture. Let’s look at some scale and latency numbers. These are numbers from one of our instances serving analytics on Youtube. On this instance, we execute about 450+ million queries per day. Amongst which 200+ million are queries on real-time datasets. This results into over 90+ billion of leaf plans that run on the data servers. In total, this scans over 20 quadrillion rows per day and returns over 100 billion rows. We can achieve peak scan rate of 100+ billion rows per second. This instance has over 100 tables with more than 200 columns.
Let’s look at the latency numbers for our analytics instance. As you can see, our queries have just 25 ms of e2e latency at median. Our slowest queries take about 3 seconds possibly because of the size of the query and cache misses. At median, a query scans about 17M rows.
On range partitioned data, Procella metadata server can prune large number of tablets to be scanned. This is done by using efficient data structures to store tablet boundaries in memory and then binary searching through them to identify the tablets that satisfy the given filter clause.
Now, let’s deep dive into some of the features that make Procella work so well. These few features are essentially secret sauce that make Procella so fast and efficient. Stateful caching: For fast and efficient processing, we cache at various layers in Procella. On the metadata server, constraints or zone-maps and other indexes are cached in a LRU cache to avoid going to metadata store. These are stored in a versioned cache so that invalidation of cache is easy. As I mentioned before, data processed by Procella is stored remotely on Colossus, distributed file system. To allow efficient fetch of data, we cache file handles which essentially hold location of the remote Colossus server and metadata information like number of chunks, size and location of chunks. We also cache column level metadata information including size of various columns, their offset in the file. In addition, in a separate cache, we store blocks of columns required during query processing. We use a smart age and cost aware caching policy for this cache. Data segments have data server affinity to achieve high cache hit rate. So, any data segment is loaded and processed only by two data servers, a primary and a secondary for that segment. Any request to process a data segment first goes to primary and if primary does not respond fast enough, secondary handles that request. In addition, we do support expression caching that allows users to cache expensive expressions in the query. Now, lets talk about tail latency reduction. To achieve tail latency reduction, we use speculative execution for data server RPCs. So, during execution of any query, after 90% of query execution completes, we send backup RPCs to secondary data servers for the rest 10% incomplete RPCs. Also, we send minibatch backup RPCs, during the execution of query at regular intervals. That is, one backup RPC is sent for slowest RPCs in every set of n RPCs.
Indexing is one of the thing that makes Procella architecturally different from other query engines. Procella uses a separate metadata server to store indexing information and does a separate efficient processing pass over it before query execution to prune and select tablets. Procella uses dynamic range partitioning, for both batch and realtime tables, to achieve uniform distribution of data across tablets. Dictionary pruning is helpful when there are relatively small number of values compared to number of rows and bloom filters help significantly for needle in the haystack kind of queries for example - getting metrics for less popular videos like views for videos on my channel over last 28 days. Our query execution is fully distributed and massively parallel so we can leverage power of all the data servers during execution of a query. We also have a vectorized evaluation engine, called Superluminal that can work on block of values and leverage modern CPU instructions during evaluation. Superluminal is columnar evaluation library that can natively understand and leverage RLE and dictionary encodings to make group by and filtering operations much cheaper to execute.
Procella supports querying data stored in two columnar file formats- Capacitor and Artus. Capacitor is Dremel’s native file format which is widely used at Google. It supports RLE and dictionary encoded columns and has support for storing bloom filters for cheaply checking if a value exists in a column. We have also developed another file format, Artus which in addition to features in Capacitor provides efficient lookup and seek APIs on columns to make Procella evaluation significantly more efficient. In Artus, we use multi-pass adaptive encoding to select best encoding for a column. Artus gets space savings from this without using generalized compressions as used by other file formats. Artus uses adaptive encodings instead of generalized compressions to avoid full decompression costs of columns. It allows for lookups using binary search and O(1) seek APIs to move between rows. It also simplifies handling of complex data types by using length based representation for arrays instead of using repetition and definition levels. Artus exposes dictionary and RLE information to execution engine so that operations like filtering and group-bys can be made more efficient.
We’ve already talked about realtime ingestion, but one more thing to add to it is that once the data is persisted to the files, queries on those are served by data servers. So ingestion server only manage memory buffers and queries on them. Lastly Procella supports virtual tables. These are used for various purposes. One, to simplify development and allow flexibility, they allow for efficient aggregate selection. So, users can write their queries to select metrics and dimensions from the virtual table without knowing what physical aggregates have them and Procella will rewrite these to query most efficient aggregate that can correctly serve this query. This also allows backend teams to add necessary aggregates, change backend data models independently without modifying any incoming queries. Two: virtual tables provide automatic stitching between batch and realtime tables. So, for pipelines using lambda architecture, as the batch processed data arrives, queries automatically add necessary filters on real-time data to show a stitched view of datasets. And lastly, virtual tables are normalization friendly. They provide denomalized view of the tables to the users and add necessary joins during the rewrite to perform denormalizations on the fly. I haven’t covered all the features here but these key features are primarily responsible for making Procella a success.
Here are some results of running TPC-H benchmark queries on Procella. As you can see, using the techniques discussed before we are able to achieve one of the best results for this benchmark compared to other query engines. Note that these are slightly older numbers from running queries on static artus TPCH dataset, using a large instance with manually rewritten queries to add join hints etc.
With that, let me conclude my talk by summarizing what we learned so far. In this talk, we learned about Procella, a composable, scalable and versatile query engine we have built at Youtube. We learned about the architecture and what makes it work so well. If this is something that excites you, feel free to come talk to me later to learn about opportunities. I can take questions now. But, before that, I would like to mention that I will not be sharing any numbers other than ones mentioned on the slides in this talk. Also, I will not comment on comparison of Procella with any other existing Google products.