SlideShare a Scribd company logo
1 of 57
Download to read offline
Streaming SQL for
Data Engineers: The
Next Big Thing?
Streaming SQL Products
● Apache Flink
● Apache Spark
● Apache Beam
● AWS Kinesis
● Google Cloud Dataflow
● Databricks
● ksqlDB
● …
● Meta
● LinkedIn
● Pinterest
● DoorDash
● Alibaba
● …
Companies building
internal platforms
Open source and
vendor solutions
👋 Hi, I’m Yaroslav
👋 Hi, I’m Yaroslav
● Principal Software Engineer @ Goldsky
● Staff Data Engineer @ Shopify
● Software Architect @ Activision
● …
👋 Hi, I’m Yaroslav
● Principal Software Engineer @ Goldsky
● Staff Data Engineer @ Shopify
● Software Architect @ Activision
● …
❤ Apache Flink
🤔
TableEnvironment tableEnv = TableEnvironment.create(/*…*/);
Table revenue = tableEnv.sqlQuery(
"SELECT cID, cName, SUM(revenue) AS revSum " +
"FROM Orders " +
"WHERE cCountry = 'FRANCE' " +
"GROUP BY cID, cName"
);
… but why SQL?
Why SQL?
● Wide adoption
● Declarative transformation model
● Planner!
● Common type system
What instead of How
User
Intention Execution
Runtime
←
Imperative Style
→
User
Intention Execution
Runtime
→
Planning
Planner
→
Declarative SQL Style
SELECT * FROM Orders
INNER JOIN Product
ON Orders.productId = Product.id
● LOTS of code!
● Create an operator to connect
two streams
● Define and accumulate state
● Implement a mechanism for
emitting the latest value per
key
SQL API DataStream API
Declarative Transformation Model
SELECT * FROM Orders
INNER JOIN Product
ON Orders.productId = Product.id
SQL API Why not Table API?
val orders = tEnv.from("Orders")
.select($"productId", $"a", $"b")
val products = tEnv.from("Products")
.select($"id", $"c", $"d")
val result = orders
.join(products)
.where($"productId" === $"id")
.select($"a", $"b", $"c")
Declarative Transformation Model
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ticker
ORDER BY price DESC) AS row_num
FROM stock_table)
WHERE row_num <= 10;
Top-N Query
Declarative Transformation Model
Row Pattern Recognition in SQL
(ISO/IEC TR 19075-5:2016)
SELECT *
FROM stock_table
MATCH_RECOGNIZE(
PARTITION BY ticker
ORDER BY event_time
MEASURES
A.event_time AS initialPriceTime,
C.event_time AS dropTime,
A.price - C.price AS dropDiff,
A.price AS initialPrice,
C.price AS lastPrice
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B* C) WITHIN INTERVAL '10' MINUTES
DEFINE
B AS B.price > A.price - 500
)
Flink Planner Migration
From https://www.ververica.com/blog/a-journey-to-beating-flinks-sql-performance
Planner Decoupling
Planner Optimizations & Query Rewrite
● Predicate push down
● Projection push down
● Join rewrite
● Join elimination
● Constant inlining
● …
SQL API DataStream API
val postgresSink: SinkFunction[Envelope] = JdbcSink.sink(
"INSERT INTO table " +
"(id, number, timestamp, author, difficulty, size, vid, block_range) " +
"VALUES (?, ?, ?, ?, ?, ?, ?, ?) " +
"ON CONFLICT (id) DO UPDATE SET " +
"number = excluded.number, " +
"timestamp = excluded.timestamp, " +
"author = excluded.author, " +
"difficulty = excluded.difficulty, " +
"size = excluded.size, " +
"vid = excluded.vid, " +
"block_range = excluded.block_range " +
"WHERE excluded.vid > table.vid",
new JdbcStatementBuilder[Envelope] {
override def accept(statement: PreparedStatement, record: Envelope): Unit = {
val payload = record.payload
payload.id.foreach { id => statement.setString(1, id) }
payload.number.foreach { number => statement.setBigDecimal(2, new java.math.BigDecimal(number)) }
payload.timestamp.foreach { timestamp => statement.setBigDecimal(3, new java.math.BigDecimal(timestamp)) }
payload.author.foreach { author => statement.setString(4, author) }
payload.difficulty.foreach { difficulty => statement.setBigDecimal(5, new java.math.BigDecimal(difficulty)) }
payload.size.foreach { size => statement.setBigDecimal(6, new java.math.BigDecimal(size)) }
payload.vid.foreach { vid => statement.setLong(7, vid.toLong) }
payload.block_range.foreach { block_range => statement.setObject(8, new PostgresIntRange(block_range), Types.O
}
},
CREATE TABLE TABLE (
id BIGINT,
number INTEGER,
timestamp TIMESTAMP,
author STRING,
difficulty STRING,
size INTEGER,
vid BIGINT,
block_range STRING
PRIMARY KEY (vid) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'table-name' = 'table'
);
😱
Common Type System
When you start using SQL
you get access to the
decades of advancements
in database design
When NOT to use
● Complex serialization / deserialization logic
● Low-level optimizations, especially with state and timers
● Not always debugging-friendly
Dealing with Complexity
UDFs for heavy lifting
● Calling 3rd-party
libraries
● External calls
● Enrichments
Templating
● Control structures
● dbt-style macros
and references
Convinced? Let’s use it!
Ways to use
Structured Statements
dbt-style Project Notebooks
Managed Runtime
Requirements
● Version control
● Code organization
● Testability
● CI/CD
● Observability
Structured Statements
def revenueByCountry(country: String): Table = {
tEnv.sqlQuery(
s"""
|SELECT name, SUM(revenue) AS totalRevenue
|FROM Orders
|WHERE country = '${country}'
|GROUP BY name""".stripMargin
)
}
Structured Statements
def revenueByCountry(country: String): Table = {
tEnv.sqlQuery(
s"""
|SELECT name, SUM(revenue) AS totalRevenue
|FROM Orders
|WHERE country = '${country}'
|GROUP BY name""".stripMargin
)
}
✅ structure
✅ mock/stub
for testing
Structured Statements
● Treat them like code
● Only make sense when Table API is not available
● Mix with other API flavours
● SQL also has style guides
● Otherwise it’s a typical streaming application!
Structured Statements
● Version control: 🟢
● Code organization: 🟢
● Testability: 🟡
● CI/CD: 🟡
● Observability: 🟢
dbt-style Project
➔ models
◆ common
● users.sql
● users.yml
◆ sales.sql
◆ sales.yml
◆ …
➔ tests
◆ …
dbt-style Project
➔ models
◆ common
● users.sql
● users.yml
◆ sales.sql
◆ sales.yml
◆ …
➔ tests
◆ …
✅ structured
✅ schematized
✅ testable
dbt-style Project
SELECT
((text::jsonb)->>'bid_price')::FLOAT AS bid_price,
(text::jsonb)->>'order_quantity' AS order_quantity,
(text::jsonb)->>'symbol' AS symbol,
(text::jsonb)->>'trade_type' AS trade_type,
to_timestamp(((text::jsonb)->'timestamp')::BIGINT) AS ts
FROM {{ REF('market_orders_raw') }}
{{ config(materialized='materializedview') }}
SELECT symbol,
AVG(bid_price) AS avg
FROM {{ REF('market_orders') }}
GROUP BY symbol
dbt-style Project
● Works well for heavy analytical use-cases
● Could write tests in Python/Scala/etc.
● Probably needs more tooling than you think (state
management, observability, etc.)
● Check dbt adapter from Materialize!
dbt-style Project
● Version control: 🟢
● Code organization: 🟢
● Testability: 🟡
● CI/CD: 🟡
● Observability: 🟡
Notebooks
Apache Zeppelin
Notebooks
Apache Zeppelin
Notebooks
● Great UX
● Ideal for exploratory analysis and BI
● Complements all other patterns really well
● Way more important for realtime workloads
Notebooks
We don't recommend productionizing notebooks and
instead encourage empowering data scientists to build
production-ready code with the right programming
frameworks
https://www.thoughtworks.com/en-ca/radar/technique
s/productionizing-notebooks
Notebooks
● Version control: 🟡
● Code organization: 🔴
● Testability: 🔴
● CI/CD: 🔴
● Observability: 🔴
Managed Runtime
decodable
Managed Runtime
● Managed ≈ “Serverless”
● Auto-scaling
● Automated deployments, rollbacks, etc.
● Testing for different layers is decoupled
(runtime vs jobs)
Managed Runtime
Reference Architecture
Control Plane Data Plane
API Reconciler
Streaming Job
UI CLI
Any managed runtime
requires excellent
developer experience
to succeed
Managed Runtime: Ideal Developer Experience
Notebooks UX
SELECT * …
SELECT * …
Managed Runtime: Ideal Developer Experience
Version Control Integration
SELECT * …
SELECT * …
Managed Runtime: Ideal Developer Experience
dbt-style Project Structure
SELECT * …
SELECT * …
➔ models
◆ common
◆ sales
◆ shipping
◆ marketing
◆ …
Managed Runtime: Ideal Developer Experience
Versioning
SELECT * …
SELECT * …
● Version 1
● Version 2
● Version 3
● …
Managed Runtime: Ideal Developer Experience
Previews
SELECT * …
SELECT * …
User Count
Irene 100
Alex 53
Josh 12
Jane 1
Managed Runtime
● Version control: 🟢
● Code organization: 🟢
● Testability: 🟡
● CI/CD: 🟢
● Observability: 🟢
Summary
Structured
Statements
dbt-style Project Notebooks Managed
Runtime
Version Control 🟢 🟢 🟡 🟢
Code
Organization
🟢 🟢 🔴 🟢
Testability 🟡 🟡 🔴 🟡
CI/CD 🟡 🟡 🔴 🟢
Observability 🟢 🟡 🔴 🟢
Complexity 🟢 🟡 🟡 🔴
General Guidelines
● Long-running streaming apps require special attention
to state management
● Try to avoid mutability: every change is a new version
● Integration testing > unit testing
● Embrace the SRE mentality
Really dislike SQL?
Malloy PRQL
Questions?
@sap1ens

More Related Content

What's hot

Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 

What's hot (20)

Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
 

Similar to Streaming SQL for Data Engineers: The Next Big Thing?

Similar to Streaming SQL for Data Engineers: The Next Big Thing? (20)

Shaping serverless architecture with domain driven design patterns - py web-il
Shaping serverless architecture with domain driven design patterns - py web-ilShaping serverless architecture with domain driven design patterns - py web-il
Shaping serverless architecture with domain driven design patterns - py web-il
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
Sprint 58
Sprint 58Sprint 58
Sprint 58
 
Serverless in-action
Serverless in-actionServerless in-action
Serverless in-action
 
Shaping serverless architecture with domain driven design patterns
Shaping serverless architecture with domain driven design patternsShaping serverless architecture with domain driven design patterns
Shaping serverless architecture with domain driven design patterns
 
Shaping serverless architecture with domain driven design patterns
Shaping serverless architecture with domain driven design patternsShaping serverless architecture with domain driven design patterns
Shaping serverless architecture with domain driven design patterns
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Sprint 45 review
Sprint 45 reviewSprint 45 review
Sprint 45 review
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Sprint 55
Sprint 55Sprint 55
Sprint 55
 
Advanced Code Flow, Notes From the Field
Advanced Code Flow, Notes From the FieldAdvanced Code Flow, Notes From the Field
Advanced Code Flow, Notes From the Field
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
 
Modular Web Applications With Netzke
Modular Web Applications With NetzkeModular Web Applications With Netzke
Modular Web Applications With Netzke
 
How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...
How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...
How Level Infinite Implemented CQRS and Event Sourcing on Top of Apache Pulsa...
 
GraphQL the holy contract between client and server
GraphQL the holy contract between client and serverGraphQL the holy contract between client and server
GraphQL the holy contract between client and server
 
Sprint 59
Sprint 59Sprint 59
Sprint 59
 
Revealing ALLSTOCKER
Revealing ALLSTOCKERRevealing ALLSTOCKER
Revealing ALLSTOCKER
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...
 

More from Yaroslav Tkachenko

Dynamic Change Data Capture with Flink CDC and Consistent Hashing
Dynamic Change Data Capture with Flink CDC and Consistent HashingDynamic Change Data Capture with Flink CDC and Consistent Hashing
Dynamic Change Data Capture with Flink CDC and Consistent Hashing
Yaroslav Tkachenko
 
Быстрая и безболезненная разработка клиентской части веб-приложений
Быстрая и безболезненная разработка клиентской части веб-приложенийБыстрая и безболезненная разработка клиентской части веб-приложений
Быстрая и безболезненная разработка клиентской части веб-приложений
Yaroslav Tkachenko
 

More from Yaroslav Tkachenko (18)

Dynamic Change Data Capture with Flink CDC and Consistent Hashing
Dynamic Change Data Capture with Flink CDC and Consistent HashingDynamic Change Data Capture with Flink CDC and Consistent Hashing
Dynamic Change Data Capture with Flink CDC and Consistent Hashing
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 
Apache Kafka: New Features That You Might Not Know About
Apache Kafka: New Features That You Might Not Know AboutApache Kafka: New Features That You Might Not Know About
Apache Kafka: New Features That You Might Not Know About
 
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
 
Designing Scalable and Extendable Data Pipeline for Call Of Duty Games
Designing Scalable and Extendable Data Pipeline for Call Of Duty GamesDesigning Scalable and Extendable Data Pipeline for Call Of Duty Games
Designing Scalable and Extendable Data Pipeline for Call Of Duty Games
 
10 tips for making Bash a sane programming language
10 tips for making Bash a sane programming language10 tips for making Bash a sane programming language
10 tips for making Bash a sane programming language
 
Actors or Not: Async Event Architectures
Actors or Not: Async Event ArchitecturesActors or Not: Async Event Architectures
Actors or Not: Async Event Architectures
 
Kafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processing
 
Building Stateful Microservices With Akka
Building Stateful Microservices With AkkaBuilding Stateful Microservices With Akka
Building Stateful Microservices With Akka
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
 
Akka Microservices Architecture And Design
Akka Microservices Architecture And DesignAkka Microservices Architecture And Design
Akka Microservices Architecture And Design
 
Why Actor-Based Systems Are The Best For Microservices
Why Actor-Based Systems Are The Best For MicroservicesWhy Actor-Based Systems Are The Best For Microservices
Why Actor-Based Systems Are The Best For Microservices
 
Why actor-based systems are the best for microservices
Why actor-based systems are the best for microservicesWhy actor-based systems are the best for microservices
Why actor-based systems are the best for microservices
 
Building Eventing Systems for Microservice Architecture
Building Eventing Systems for Microservice Architecture  Building Eventing Systems for Microservice Architecture
Building Eventing Systems for Microservice Architecture
 
Быстрая и безболезненная разработка клиентской части веб-приложений
Быстрая и безболезненная разработка клиентской части веб-приложенийБыстрая и безболезненная разработка клиентской части веб-приложений
Быстрая и безболезненная разработка клиентской части веб-приложений
 

Recently uploaded

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Recently uploaded (20)

7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 

Streaming SQL for Data Engineers: The Next Big Thing?

  • 1. Streaming SQL for Data Engineers: The Next Big Thing?
  • 2.
  • 4. ● Apache Flink ● Apache Spark ● Apache Beam ● AWS Kinesis ● Google Cloud Dataflow ● Databricks ● ksqlDB ● … ● Meta ● LinkedIn ● Pinterest ● DoorDash ● Alibaba ● … Companies building internal platforms Open source and vendor solutions
  • 5.
  • 6. 👋 Hi, I’m Yaroslav
  • 7. 👋 Hi, I’m Yaroslav ● Principal Software Engineer @ Goldsky ● Staff Data Engineer @ Shopify ● Software Architect @ Activision ● …
  • 8. 👋 Hi, I’m Yaroslav ● Principal Software Engineer @ Goldsky ● Staff Data Engineer @ Shopify ● Software Architect @ Activision ● … ❤ Apache Flink
  • 9. 🤔 TableEnvironment tableEnv = TableEnvironment.create(/*…*/); Table revenue = tableEnv.sqlQuery( "SELECT cID, cName, SUM(revenue) AS revSum " + "FROM Orders " + "WHERE cCountry = 'FRANCE' " + "GROUP BY cID, cName" );
  • 10. … but why SQL?
  • 11. Why SQL? ● Wide adoption ● Declarative transformation model ● Planner! ● Common type system
  • 15. SELECT * FROM Orders INNER JOIN Product ON Orders.productId = Product.id ● LOTS of code! ● Create an operator to connect two streams ● Define and accumulate state ● Implement a mechanism for emitting the latest value per key SQL API DataStream API Declarative Transformation Model
  • 16. SELECT * FROM Orders INNER JOIN Product ON Orders.productId = Product.id SQL API Why not Table API? val orders = tEnv.from("Orders") .select($"productId", $"a", $"b") val products = tEnv.from("Products") .select($"id", $"c", $"d") val result = orders .join(products) .where($"productId" === $"id") .select($"a", $"b", $"c") Declarative Transformation Model
  • 17. SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY ticker ORDER BY price DESC) AS row_num FROM stock_table) WHERE row_num <= 10; Top-N Query Declarative Transformation Model
  • 18. Row Pattern Recognition in SQL (ISO/IEC TR 19075-5:2016) SELECT * FROM stock_table MATCH_RECOGNIZE( PARTITION BY ticker ORDER BY event_time MEASURES A.event_time AS initialPriceTime, C.event_time AS dropTime, A.price - C.price AS dropDiff, A.price AS initialPrice, C.price AS lastPrice ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW PATTERN (A B* C) WITHIN INTERVAL '10' MINUTES DEFINE B AS B.price > A.price - 500 )
  • 19. Flink Planner Migration From https://www.ververica.com/blog/a-journey-to-beating-flinks-sql-performance Planner Decoupling
  • 20. Planner Optimizations & Query Rewrite ● Predicate push down ● Projection push down ● Join rewrite ● Join elimination ● Constant inlining ● …
  • 21. SQL API DataStream API val postgresSink: SinkFunction[Envelope] = JdbcSink.sink( "INSERT INTO table " + "(id, number, timestamp, author, difficulty, size, vid, block_range) " + "VALUES (?, ?, ?, ?, ?, ?, ?, ?) " + "ON CONFLICT (id) DO UPDATE SET " + "number = excluded.number, " + "timestamp = excluded.timestamp, " + "author = excluded.author, " + "difficulty = excluded.difficulty, " + "size = excluded.size, " + "vid = excluded.vid, " + "block_range = excluded.block_range " + "WHERE excluded.vid > table.vid", new JdbcStatementBuilder[Envelope] { override def accept(statement: PreparedStatement, record: Envelope): Unit = { val payload = record.payload payload.id.foreach { id => statement.setString(1, id) } payload.number.foreach { number => statement.setBigDecimal(2, new java.math.BigDecimal(number)) } payload.timestamp.foreach { timestamp => statement.setBigDecimal(3, new java.math.BigDecimal(timestamp)) } payload.author.foreach { author => statement.setString(4, author) } payload.difficulty.foreach { difficulty => statement.setBigDecimal(5, new java.math.BigDecimal(difficulty)) } payload.size.foreach { size => statement.setBigDecimal(6, new java.math.BigDecimal(size)) } payload.vid.foreach { vid => statement.setLong(7, vid.toLong) } payload.block_range.foreach { block_range => statement.setObject(8, new PostgresIntRange(block_range), Types.O } }, CREATE TABLE TABLE ( id BIGINT, number INTEGER, timestamp TIMESTAMP, author STRING, difficulty STRING, size INTEGER, vid BIGINT, block_range STRING PRIMARY KEY (vid) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'table-name' = 'table' ); 😱 Common Type System
  • 22. When you start using SQL you get access to the decades of advancements in database design
  • 23. When NOT to use ● Complex serialization / deserialization logic ● Low-level optimizations, especially with state and timers ● Not always debugging-friendly
  • 24. Dealing with Complexity UDFs for heavy lifting ● Calling 3rd-party libraries ● External calls ● Enrichments Templating ● Control structures ● dbt-style macros and references
  • 26. Ways to use Structured Statements dbt-style Project Notebooks Managed Runtime
  • 27. Requirements ● Version control ● Code organization ● Testability ● CI/CD ● Observability
  • 28. Structured Statements def revenueByCountry(country: String): Table = { tEnv.sqlQuery( s""" |SELECT name, SUM(revenue) AS totalRevenue |FROM Orders |WHERE country = '${country}' |GROUP BY name""".stripMargin ) }
  • 29. Structured Statements def revenueByCountry(country: String): Table = { tEnv.sqlQuery( s""" |SELECT name, SUM(revenue) AS totalRevenue |FROM Orders |WHERE country = '${country}' |GROUP BY name""".stripMargin ) } ✅ structure ✅ mock/stub for testing
  • 30. Structured Statements ● Treat them like code ● Only make sense when Table API is not available ● Mix with other API flavours ● SQL also has style guides ● Otherwise it’s a typical streaming application!
  • 31. Structured Statements ● Version control: 🟢 ● Code organization: 🟢 ● Testability: 🟡 ● CI/CD: 🟡 ● Observability: 🟢
  • 32. dbt-style Project ➔ models ◆ common ● users.sql ● users.yml ◆ sales.sql ◆ sales.yml ◆ … ➔ tests ◆ …
  • 33. dbt-style Project ➔ models ◆ common ● users.sql ● users.yml ◆ sales.sql ◆ sales.yml ◆ … ➔ tests ◆ … ✅ structured ✅ schematized ✅ testable
  • 34. dbt-style Project SELECT ((text::jsonb)->>'bid_price')::FLOAT AS bid_price, (text::jsonb)->>'order_quantity' AS order_quantity, (text::jsonb)->>'symbol' AS symbol, (text::jsonb)->>'trade_type' AS trade_type, to_timestamp(((text::jsonb)->'timestamp')::BIGINT) AS ts FROM {{ REF('market_orders_raw') }} {{ config(materialized='materializedview') }} SELECT symbol, AVG(bid_price) AS avg FROM {{ REF('market_orders') }} GROUP BY symbol
  • 35. dbt-style Project ● Works well for heavy analytical use-cases ● Could write tests in Python/Scala/etc. ● Probably needs more tooling than you think (state management, observability, etc.) ● Check dbt adapter from Materialize!
  • 36. dbt-style Project ● Version control: 🟢 ● Code organization: 🟢 ● Testability: 🟡 ● CI/CD: 🟡 ● Observability: 🟡
  • 39. Notebooks ● Great UX ● Ideal for exploratory analysis and BI ● Complements all other patterns really well ● Way more important for realtime workloads
  • 40. Notebooks We don't recommend productionizing notebooks and instead encourage empowering data scientists to build production-ready code with the right programming frameworks https://www.thoughtworks.com/en-ca/radar/technique s/productionizing-notebooks
  • 41. Notebooks ● Version control: 🟡 ● Code organization: 🔴 ● Testability: 🔴 ● CI/CD: 🔴 ● Observability: 🔴
  • 43. Managed Runtime ● Managed ≈ “Serverless” ● Auto-scaling ● Automated deployments, rollbacks, etc. ● Testing for different layers is decoupled (runtime vs jobs)
  • 44. Managed Runtime Reference Architecture Control Plane Data Plane API Reconciler Streaming Job UI CLI
  • 45. Any managed runtime requires excellent developer experience to succeed
  • 46. Managed Runtime: Ideal Developer Experience Notebooks UX SELECT * … SELECT * …
  • 47. Managed Runtime: Ideal Developer Experience Version Control Integration SELECT * … SELECT * …
  • 48. Managed Runtime: Ideal Developer Experience dbt-style Project Structure SELECT * … SELECT * … ➔ models ◆ common ◆ sales ◆ shipping ◆ marketing ◆ …
  • 49. Managed Runtime: Ideal Developer Experience Versioning SELECT * … SELECT * … ● Version 1 ● Version 2 ● Version 3 ● …
  • 50. Managed Runtime: Ideal Developer Experience Previews SELECT * … SELECT * … User Count Irene 100 Alex 53 Josh 12 Jane 1
  • 51. Managed Runtime ● Version control: 🟢 ● Code organization: 🟢 ● Testability: 🟡 ● CI/CD: 🟢 ● Observability: 🟢
  • 52. Summary Structured Statements dbt-style Project Notebooks Managed Runtime Version Control 🟢 🟢 🟡 🟢 Code Organization 🟢 🟢 🔴 🟢 Testability 🟡 🟡 🔴 🟡 CI/CD 🟡 🟡 🔴 🟢 Observability 🟢 🟡 🔴 🟢 Complexity 🟢 🟡 🟡 🔴
  • 53. General Guidelines ● Long-running streaming apps require special attention to state management ● Try to avoid mutability: every change is a new version ● Integration testing > unit testing ● Embrace the SRE mentality
  • 56.