SlideShare a Scribd company logo
1 of 40
Download to read offline
A REAL TIME DATA QUERY ENGINE
Michael Natkovich & Nate Speidel
Allow Myself to Introduce . . . Myself
■ Nate Speidel
● nspeidel@oath.com
● Software Engineer
● 2+ years of solving data problems at Yahoo
Allow Myself to Introduce . . . Myself
■ Michael Natkovich
● mln@oath.com
● Director Engineer
● 10+ years of causing data problems at Yahoo
Motivation: Cycle of Sadness
■ Instrumentation validation is unbearably slow
● Needs to be seconds not hours
● Needs to be easy to query
● Needs programmatic access
Typical Query Engine
Data Flow
Persistence
Queries
Look Forward Query Engine
Data Flow
Query Engine
Current Queryable Data
Future Queryable Data Old Un-Queryable Data
Query Results
Typical Streaming Query Cost
Storm Query 1 Storm Query 2 Storm Query 3 Spark Query 1
Input: 2MM events/sec
Throughput: 1K events/sec/core
Resources: 2K cores/query
Total: 8K cores
Bullet Query Cost
Bullet Query 1 Bullet Query 2 Bullet Query 3 Bullet Query 4
Input: 2MM events/sec
Throughput: 1K events/sec/core
Resources: 2K cores
Total: 2K cores
Bullet
■ Retrieves data that arrives after query submission
● Look Forward!
■ No persistence layer
■ Light-weight, fast, and scalable
■ UI for Ad-Hoc queries
■ API for programmatic querying
■ Pluggable interface to integrate with streaming data
What It’s For
Single stream,
multiple
consumers
Adhoc interactive
usage
Programmatic
short lived queries
What It’s Not For
Repeatable
queries
Currently no joins Not meant for ETL
Querying in Bullet
■ Support filtering, logical operators on typed data
■ Supports aggregations
● Group By, Count Distincts, Top K, Distributions
● DataSketches based
■ Queries have life spans
● All queries run for a specified duration (or infinitely)
■ Results are Windowed
● Windows can be time or record based
● Raw record or aggregate based
Streaming Aggregations
■ Motivation
● Calculating cardinality
● Getting live latency distributions
● Validate experimentation bucket sizes
■ Aggregations are Hard
● Data skew
● Intermediate results are large and expensive to move
● The longer you run, the more memory you need
● Incremental results can’t be combined
Overwhelm Single Combiner
Count Distinct: Naive
1. Read Input
2. Round Robin
3. Extract Field
4. Send to Combiner
5. Count Distincts
Vulnerable to Data Skew
Count Distinct: Typical
1. Read Input
2. Round Robin
3. Extract Field
4. Hash Partition
5. Count Distincts
6. Send Count
7. Combine Counts
Count Distinct: Sketches
1. Read Input
2. Round Robin
3. Build Sketch
4. Send to Combiner
5. Merge Sketches
Data Sketches
■ Sketches are a class of stochastic
streaming algorithms
■ Provides approximate results (if data
is too large)
■ Provable error bounds
■ Fixed memory footprint
■ Mergeable, allowing for parallel
processing
Data Sketches in Streams
■ Accurate to a Point
● Sketches sized correctly will be 100% accurate
● Error rate is inversely proportional to size of a Sketch
■ Fixed Memory Ceiling
● Maximum Sketch size is configured in advance
● Memory cost of a query is thus known in advance
■ Allows Non-additive Operations to be Additive
● Sketches can be merged into a single Sketch without over
counting
● Allows tasks to be parallelized and cheaply combined later
● Allows results to be combined across windows of execution
Bullet’s Use of Data Sketches
Data Sketch Query Type
Theta Sketch Count Distinct
Tuple Sketch Group By
Quantile Sketch Distributions
Frequent Items Sketch Top K
Windowing
■ A way of breaking up an endless stream into digestible
components
■ Typically broken using time or records
■ Needed for incremental results
■ A window is the unit of incrementation
Windowing
■ Tumbling Windows*
● Contiguous non-overlapping windows at regular intervals
■ Hopping Windows
● Contiguous (possibly) overlapping windows at regular intervals
■ Sliding Windows*
● Event based windows looking back at regular event intervals
■ Cascading Windows
● Sliding windows that reset at a regular intervals too
■ Session Windows
● Sliding windows that reset if distance between events is exceeded
Why Windowing
■ Example: Number of distinct users in the next 60 seconds
■ Option 1: Wait 60 secs to get results
● No feedback :(
■ Option 2: Every 5 secs, get current state until end
● Continuous feedback with same final results
● Stop queries early (sufficient information gleaned, query bad, etc.)
● Quickly iterate queries
Tumbling Window
0 5 10 15 20 25 30
1 2 3 4 5 6 7 8 9
1 2 3 4 5
6 7
8 9
10 second window
Tumbling Window
3 record window
0 5 10 15 20 25 30
1 2 3 4 5 6 7 8 9
1 2 3
4 5 6
7 8 9
Sliding Window
3 record window
1 record slide
0 5 10
1 2 3 4 5
1
1 2
1 2 3
2 3 4
3 4 5
Query
& ID
Request
Processor
Data
Processor
Combiner
Bullet Data Stream
Bullet
WS
Performance Stats
Sensor Data
User Activity
IoT Data
Query
Results
Results Query & ID
Query & ID
Data Records
Matching
Events & ID
Core Design Principles
■ No persistence
● Tradeoff: Query Speed, Low Storage Cost > Repeatability
■ Scale for data and queries
● Each query cost is fixed and negligible, relative to data ingestion
■ Pluggable everything
● Run on top of any stream processor (Spark, Storm, etc.)
● Read from any data source (Kafka, Kinesis, etc.)
● Choose an implementation of the PubSub (Kafka, REST, etc.)
■ Tune everything
● Example: Sketch size vs Sketch accuracy
Overall Architecture
Backend Layer Detailed Architecture: Storm
Backend Layer Detailed Architecture: Spark
Performance: Linearly Scales for Data
Performance: Linearly Scales for Queries
Demos
■ Bullet Reddit
● https://youtu.be/p6rOy9F7K8U
■ Bullet Finance
● https://youtu.be/RMMT4Phdhr8
In Summary
■ Bullet is a lightweight and cheap stream query engine
■ It offers raw record and OLAP style queries
■ Leverages the power of Data Sketches
■ Only need to enough hardware to read data
● Queries are basically free!
■ Abstraction layer that can sit on any Stream Framework
● Implementations available for Storm and Spark
■ Pluggable allowing for consumption from any data source
■ Fully open sourced!!
Future Work
■ BQL: SQL-like interface support (already supported in WS)
■ More stream processor support (Flink)
■ All the Windows!
■ More aggregations (Group By Count Distinct)
Links
■ Documentation: https://bullet-db.github.io/
■ Github: https://github.com/bullet-db
■ Contact Us
● Developers: bullet-dev@googlegroups.com
● Users: bullet-users@googlegroups.com
■ Data Sketches: https://datasketches.github.io/
■ Reddit API: https://www.reddit.com/dev/api/
QUESTIONS

More Related Content

More from Yahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Yahoo Developer Network
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
 
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark ClustersApril 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
Yahoo Developer Network
 
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
Yahoo Developer Network
 

More from Yahoo Developer Network (20)

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark ClustersApril 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
 
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Bullet - Open Source Real-Time Data Query Engine, Michael Natkovich, Director Software Dev Engineering & Nate Speidel, Software Engineer, Oath

  • 1. A REAL TIME DATA QUERY ENGINE Michael Natkovich & Nate Speidel
  • 2. Allow Myself to Introduce . . . Myself ■ Nate Speidel ● nspeidel@oath.com ● Software Engineer ● 2+ years of solving data problems at Yahoo
  • 3. Allow Myself to Introduce . . . Myself ■ Michael Natkovich ● mln@oath.com ● Director Engineer ● 10+ years of causing data problems at Yahoo
  • 4. Motivation: Cycle of Sadness ■ Instrumentation validation is unbearably slow ● Needs to be seconds not hours ● Needs to be easy to query ● Needs programmatic access
  • 5. Typical Query Engine Data Flow Persistence Queries
  • 6. Look Forward Query Engine Data Flow Query Engine Current Queryable Data Future Queryable Data Old Un-Queryable Data Query Results
  • 7. Typical Streaming Query Cost Storm Query 1 Storm Query 2 Storm Query 3 Spark Query 1 Input: 2MM events/sec Throughput: 1K events/sec/core Resources: 2K cores/query Total: 8K cores
  • 8. Bullet Query Cost Bullet Query 1 Bullet Query 2 Bullet Query 3 Bullet Query 4 Input: 2MM events/sec Throughput: 1K events/sec/core Resources: 2K cores Total: 2K cores
  • 9. Bullet ■ Retrieves data that arrives after query submission ● Look Forward! ■ No persistence layer ■ Light-weight, fast, and scalable ■ UI for Ad-Hoc queries ■ API for programmatic querying ■ Pluggable interface to integrate with streaming data
  • 10. What It’s For Single stream, multiple consumers Adhoc interactive usage Programmatic short lived queries
  • 11. What It’s Not For Repeatable queries Currently no joins Not meant for ETL
  • 12.
  • 13. Querying in Bullet ■ Support filtering, logical operators on typed data ■ Supports aggregations ● Group By, Count Distincts, Top K, Distributions ● DataSketches based ■ Queries have life spans ● All queries run for a specified duration (or infinitely) ■ Results are Windowed ● Windows can be time or record based ● Raw record or aggregate based
  • 14. Streaming Aggregations ■ Motivation ● Calculating cardinality ● Getting live latency distributions ● Validate experimentation bucket sizes ■ Aggregations are Hard ● Data skew ● Intermediate results are large and expensive to move ● The longer you run, the more memory you need ● Incremental results can’t be combined
  • 15. Overwhelm Single Combiner Count Distinct: Naive 1. Read Input 2. Round Robin 3. Extract Field 4. Send to Combiner 5. Count Distincts
  • 16. Vulnerable to Data Skew Count Distinct: Typical 1. Read Input 2. Round Robin 3. Extract Field 4. Hash Partition 5. Count Distincts 6. Send Count 7. Combine Counts
  • 17. Count Distinct: Sketches 1. Read Input 2. Round Robin 3. Build Sketch 4. Send to Combiner 5. Merge Sketches
  • 18. Data Sketches ■ Sketches are a class of stochastic streaming algorithms ■ Provides approximate results (if data is too large) ■ Provable error bounds ■ Fixed memory footprint ■ Mergeable, allowing for parallel processing
  • 19. Data Sketches in Streams ■ Accurate to a Point ● Sketches sized correctly will be 100% accurate ● Error rate is inversely proportional to size of a Sketch ■ Fixed Memory Ceiling ● Maximum Sketch size is configured in advance ● Memory cost of a query is thus known in advance ■ Allows Non-additive Operations to be Additive ● Sketches can be merged into a single Sketch without over counting ● Allows tasks to be parallelized and cheaply combined later ● Allows results to be combined across windows of execution
  • 20. Bullet’s Use of Data Sketches Data Sketch Query Type Theta Sketch Count Distinct Tuple Sketch Group By Quantile Sketch Distributions Frequent Items Sketch Top K
  • 21. Windowing ■ A way of breaking up an endless stream into digestible components ■ Typically broken using time or records ■ Needed for incremental results ■ A window is the unit of incrementation
  • 22. Windowing ■ Tumbling Windows* ● Contiguous non-overlapping windows at regular intervals ■ Hopping Windows ● Contiguous (possibly) overlapping windows at regular intervals ■ Sliding Windows* ● Event based windows looking back at regular event intervals ■ Cascading Windows ● Sliding windows that reset at a regular intervals too ■ Session Windows ● Sliding windows that reset if distance between events is exceeded
  • 23. Why Windowing ■ Example: Number of distinct users in the next 60 seconds ■ Option 1: Wait 60 secs to get results ● No feedback :( ■ Option 2: Every 5 secs, get current state until end ● Continuous feedback with same final results ● Stop queries early (sufficient information gleaned, query bad, etc.) ● Quickly iterate queries
  • 24. Tumbling Window 0 5 10 15 20 25 30 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 second window
  • 25. Tumbling Window 3 record window 0 5 10 15 20 25 30 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
  • 26. Sliding Window 3 record window 1 record slide 0 5 10 1 2 3 4 5 1 1 2 1 2 3 2 3 4 3 4 5
  • 27.
  • 28. Query & ID Request Processor Data Processor Combiner Bullet Data Stream Bullet WS Performance Stats Sensor Data User Activity IoT Data Query Results Results Query & ID Query & ID Data Records Matching Events & ID
  • 29. Core Design Principles ■ No persistence ● Tradeoff: Query Speed, Low Storage Cost > Repeatability ■ Scale for data and queries ● Each query cost is fixed and negligible, relative to data ingestion ■ Pluggable everything ● Run on top of any stream processor (Spark, Storm, etc.) ● Read from any data source (Kafka, Kinesis, etc.) ● Choose an implementation of the PubSub (Kafka, REST, etc.) ■ Tune everything ● Example: Sketch size vs Sketch accuracy
  • 31. Backend Layer Detailed Architecture: Storm
  • 32. Backend Layer Detailed Architecture: Spark
  • 35.
  • 36. Demos ■ Bullet Reddit ● https://youtu.be/p6rOy9F7K8U ■ Bullet Finance ● https://youtu.be/RMMT4Phdhr8
  • 37. In Summary ■ Bullet is a lightweight and cheap stream query engine ■ It offers raw record and OLAP style queries ■ Leverages the power of Data Sketches ■ Only need to enough hardware to read data ● Queries are basically free! ■ Abstraction layer that can sit on any Stream Framework ● Implementations available for Storm and Spark ■ Pluggable allowing for consumption from any data source ■ Fully open sourced!!
  • 38. Future Work ■ BQL: SQL-like interface support (already supported in WS) ■ More stream processor support (Flink) ■ All the Windows! ■ More aggregations (Group By Count Distinct)
  • 39. Links ■ Documentation: https://bullet-db.github.io/ ■ Github: https://github.com/bullet-db ■ Contact Us ● Developers: bullet-dev@googlegroups.com ● Users: bullet-users@googlegroups.com ■ Data Sketches: https://datasketches.github.io/ ■ Reddit API: https://www.reddit.com/dev/api/