SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Advances in Stream Analytics:
Google Cloud Dataflow and Apache Beam
Kyiv, October 5th, 2019
Sergei Sokolenko
Google
Your choices for doing Streaming Processing in Google Cloud
Separating State Storage from Compute
Autoscaling
Making Streaming Easy
Session overview
Google
Cloud
Platform
Our global infrastructure
PLCN (HK, LA) 2019
Faster (US, JP, TW) 2016
Unity (US, JP) 2010
Dunant (US, FR) 2020
Monet (US, BR) 2017
Junior (Rio, Santos) 2018
Tannat (BR, UY, AR) 2018
SJC (JP, HK, SG) 2013
Indigo (SG, ID, AU) 2019
HK-G (HK, GU) 2019 JGA (AU, GU, JP) 2019
Curie (CL, US) 2019
Havfrue (US, IE, DK) 2019
Network
Edge points
of presence
CDN nodes
Mumbai
Singapore
Kuala Lumpur
Sydney
Tokyo
Chennai
Taipei
Seattle
San Francisco
Montréal
Hamburg
Zurich
Madrid
Paris
London
Hong
Kong
Osaka
Toronto
Chicago
Los Angeles
Denver
Dallas
Miami
Atlanta
Washington DC
New York
Rio de Janeiro
São Paulo
Buenos Aires
Munich
Milan
Marseille
Amsterdam
Stockholm
Frankfurt
Dedicated Interconnect
Current regions
and number of zones
Future regions
and number of zones Mumbai
Singapore
Jakarta
Sydney
Tokyo
Osaka
Hong Kong Taiwan
3
3
3
3
3
3 3
3
3 33
3
3
3
4
3
3
Oregon
Los Angeles
Iowa
S. Carolina
N. Virginia
Montréal
São Paulo
Finland
Frankfurt
Zurich
3
Belgium
London
Netherlands
3Seoul
3
3
Salt Lake City
3 3
A comprehensive Big Data platform, not just infrastructure
Data ingestion
at any scale
Reliable streaming
data pipeline
Advanced
analytics
Data warehousing
and data lake
Apache
Beam
Cloud Pub/Sub Cloud
Dataflow
Cloud
Dataproc
BigQuery Cloud
Storage
Data Transfer
Service
Cloud Composer
Cloud IoT
Core
Cloud Dataprep
Cloud AI
Services
Google
Data Studio
Tensorflow Sheets
Storage Transfer
Service
Data Catalog
Data Fusion
Google’s data processing timeline
20122002 2004 2006 2008 2010
MapReduce
GFS Big Table
Dremel
Pregel
FlumeJava
Colossus
Spanner
2014
MillWheel
Dataflow
2016
Apache Beam
Why FlumeJava
Mapreduce
MAP MAP MAP MAP MAP
RED RED RED
(K,V)
(K,V*)
(K,W)
MapReduce can quickly get out of hand
One Google pipeline had
116 stages!
DAGs offer a better abstraction from execution
Filter
Filter
Join
Group
Filter
Filter
fs://
Databasefs://
Database
By 2025, more than a
quarter of data created in
the global datasphere will
be real time in nature.
*IDC
9:008:00 14:0013:0012:0011:0010:00
Processing time
8:00
8:00
8:00
Event time
Data Streams and Late Arriving Data
Goal: Grouping by Event Time into Time Windows
9:00 14:0013:0012:0011:0010:00Event
time
9:00 14:0013:0012:0011:0010:00Processing
time
Input
Output
MillWheel - low-latency, accurate data-processing
Common steps in Stream Analytics
End-user
apps
Cloud Composer
Orchestrate
IoT
Events
Cloud Pub/Sub Dataflow Streaming
DBs
Ingest & distribute
Aggregate,
enrich, detect
Backfill,
reprocess
Cloud AI
Platform
Bigtable Dataflow Batch
Action
Reference architecture of Streaming Processing in GCP
BigQuery
BigQuery Streaming API
Machine Learning
Data Warehousing
What is Beam and Dataflow?
Open source programming model
Unified batch and streaming
Top Apache project by dev@ activity
Runner and language portability
Cloud
Dataflow
Automatic optimizations scale to millions of QPS
Serverless, fully managed data processing
State storage in Shuffle and Streaming Engine
Exactly-once streaming semantics
SDK
The Beam Vision
Input.apply
(Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
stats.Sum(s, input)
Go
SELECT key, SUM(value)
FROM input GROUP BY key
SQL
Cloud Dataflow
Apache Spark
Apache Flink
Apache Apex
Gearpump
Apache Samza
Apache Nemo
(incubating)
IBM Streams
Sum Per Key
● Separating compute from state storage
● Automatic scaling
● Building Streaming systems can be hard, but it does not have to be
Lessons Learned While Building Cloud Dataflow
Separating compute from
state storage to improve
scalability
Traditional Distributed Data Processing Architecture
User code
VM
User code
VM
User code
VM
User code
VM
State storage
● Jobs executed on
clusters of VMs
● Job state stored on
network-attached
volumes
● Control plane
orchestrates data plane
Network
Control plane
VM
State storage State storage State storage
Traditional Architecture works well ...
Filter
Filter
Join
Group
Filter
Filter
fs://
Databasefs://
Database
… except for Joins and
Group By’s
Shuffling key-value pairs
● Unsorted Data Elements
<key1, record>
<key5, record>
<key3, record>
<key8, record>
<key4, record>
...
<key5, record>
<key5, record>
<key2, record>
<key3, record>
<key8, record>
...
<key3, record>
<key3, record>
<key8, record>
<key3, record>
<key6, record>
...
<key2, record>
<key1, record>
<key5, record>
<key8, record>
<key4, record>
...
● Unsorted data elements
● Goal: sort data elements
by key
<key1, record>
<key1, record>
<key2, record>
<key2, record>
<key2, record>
...
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key4, record>
...
<key5, record>
<key5, record>
<key5, record>
<key5, record>
<key6, record>
...
<key7, record>
<key8, record>
<key8, record>
<key8, record>
...
Shuffling key-value pairs
● Unsorted data elements
● Goal: sort data elements
by key
● KV pairs need to be
exchanged between
nodes
<key1, record>
<key5, record>
<key3, record>
<key8, record>
<key4, record>
...
<key5, record>
<key5, record>
<key2, record>
<key3, record>
<key8, record>
...
<key3, record>
<key3, record>
<key8, record>
<key3, record>
<key6, record>
...
<key2, record>
<key1, record>
<key5, record>
<key8, record>
<key4, record>
...
Shuffling key-value pairs
● Unsorted data elements
● Goal: sort data elements
by key
● KV pairs need to be
exchanged between
nodes
● Until everything is sorted
Shuffling key-value pairs
<key1, record>
<key1, record>
<key2, record>
<key2, record>
<key2, record>
...
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key4, record>
...
<key5, record>
<key5, record>
<key5, record>
<key5, record>
<key6, record>
...
<key7, record>
<key8, record>
<key8, record>
<key8, record>
...
key1, key 2 key3, key4 key5, key6 key7, key8
Traditional Architecture Requires Manual Tuning
User code
VM
User code
VM
User code
VM
User code
VM
State storage
● When data volumes
exceed dozens of TBs
Network
Control plane
VM
State storage State storage State storage
Distributed in-memory Shuffle in batch Cloud Dataflow
Compute
Petabit
network
Dataflow Shuffle
Region
Zone ‘a’ Zone ‘b’
Zone ‘c’Distributed
in-memory
file system
Distributed
on-disk
file system
Shuffle
proxy
Autozone placement
No tuning required
Dataflow Shuffle is usually
faster than worker-based
shuffle, including those using
SSD-PD.
Faster Processing
Runtime of shuffle
Runtime
(mins)
Shuffle 200TB+
Dataflow shuffle has been
used to shuffle 200TB+
datasets.
Supporting larger datasets
Dataset size of shuffle
Dataset
size (TB)
Storing state
What about streaming pipelines?
Streaming shuffle
Just like in batch, need to group and join
streams
Distributed streaming shuffle
Window data elements
Late Arriving Data requires buffering
time window data
Accumulate elements until triggering
conditions occur
Goal: Grouping by Event Time into Time Windows
9:00 14:0013:0012:0011:0010:00Event
time
9:00 14:0013:0012:0011:0010:00Processing
time
Input
Output
Even more state to store on disks in streaming
User code
VM
User code
VM
User code
VM
User code
VM
Shuffle data elements
● Key ranges are assigned
to workers
● Data elements of these
keys is stored on
Persistent Disks
State storage State storage State storage State storage
key 0000 ...
… key 1234
key 1235 ...
… key ABC2
key ABC3 ...
… key DEF5
key DEF6 ...
… key GHI2
Time window data
● Also assigned to workers
● When time windows
close, data processed on
workers
Dataflow Streaming Engine
Benefits
● Better supportability
● Less worker resources
● Smoother autoscaling
User code
Streaming engine
Worker
User code
Worker
User code
Worker
User code
Worker
Window state storage Streaming shuffle
Autoscaling: Even better with separate Compute and State Storage
User code
Streaming engine
Worker
User code
Worker
Window state storage Streaming shuffle
Dataflow with Streaming Engine
User code
VM
User code
VM
State storage State storage
key 0000 ...
… key 1234
key 1235 ...
… key ABC2
Dataflow without Streaming Engine
Dataflow with Streaming Engine Dataflow without Streaming Engine
Streaming can be hard,
but does not have to be
We’ve set out to make Streaming
as accessible as Batch.
Easy Stream Analytics in SQL
Group by
Input1
Output
Join
Input2 SELECT input1.*, input2.*
FROM input1 LEFT OUTER JOIN input2
ON input1.Id = input2.Id
Use Dataflow SQL from BigQuery UI:
● Join Pub/Sub Streams with Files or Tables
● Write into BigQuery for dashboarding
● Store Pub/Sub schema in Data Catalog
● Use SQL skills for streaming data processing
Demo
Demo: Streaming Analytics with SQL
Transactions
PubSub
Dataflow BigQuery
SELECT
sr.sales_region,
TUMBLE_START("INTERVAL 5 SECOND") AS period_start,
SUM(tr.payload.amount) as amount
FROM `pubsub.dataflow-sql.transactions` AS tr
INNER JOIN
`bigquery.dataflow-sql.opsdb.us_state_salesregions` AS sr
ON tr.payload.state = sr.state_code
GROUP BY
sr.sales_region,
TUMBLE(tr.event_timestamp, "INTERVAL 5 SECOND")
PubSub topic
Streaming SQL
pipeline
Table
Table
Google Cloud offers both infrastructure-as-a-service as well as fully managed services
Separating compute from state storage help make stream and batch processing scalable
SQL brings complexity of Streaming Processing way down
Main takeaways
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Wayne State University & DataStax: World's best data modeling tool for Apache...
Wayne State University & DataStax: World's best data modeling tool for Apache...Wayne State University & DataStax: World's best data modeling tool for Apache...
Wayne State University & DataStax: World's best data modeling tool for Apache...
DataStax Academy
 
World’s Best Data Modeling Tool
World’s Best Data Modeling ToolWorld’s Best Data Modeling Tool
World’s Best Data Modeling Tool
Artem Chebotko
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 

Was ist angesagt? (20)

Wayne State University & DataStax: World's best data modeling tool for Apache...
Wayne State University & DataStax: World's best data modeling tool for Apache...Wayne State University & DataStax: World's best data modeling tool for Apache...
Wayne State University & DataStax: World's best data modeling tool for Apache...
 
Transforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big DataTransforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big Data
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
 
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
World’s Best Data Modeling Tool
World’s Best Data Modeling ToolWorld’s Best Data Modeling Tool
World’s Best Data Modeling Tool
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 

Ähnlich wie Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud Dataflow deep-dive"

Ähnlich wie Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud Dataflow deep-dive" (20)

AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - Slides
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Activity Recognition project
Activity Recognition projectActivity Recognition project
Activity Recognition project
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Dive into H2O: NYC
Dive into H2O: NYCDive into H2O: NYC
Dive into H2O: NYC
 
Activity Recognition
Activity RecognitionActivity Recognition
Activity Recognition
 
Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge
 
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics PlatformWSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 

Mehr von Fwdays

Mehr von Fwdays (20)

"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y..."How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
 
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
 
"Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl..."Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl...
 
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T..."How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
 
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ..."The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
 
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu..."[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
 
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care..."[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
 
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"..."4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
 
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast..."Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
 
"Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others..."Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others...
 
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
 
"Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv..."Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv...
 
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin..."How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud Dataflow deep-dive"

  • 1. Advances in Stream Analytics: Google Cloud Dataflow and Apache Beam Kyiv, October 5th, 2019 Sergei Sokolenko Google
  • 2. Your choices for doing Streaming Processing in Google Cloud Separating State Storage from Compute Autoscaling Making Streaming Easy Session overview
  • 3. Google Cloud Platform Our global infrastructure PLCN (HK, LA) 2019 Faster (US, JP, TW) 2016 Unity (US, JP) 2010 Dunant (US, FR) 2020 Monet (US, BR) 2017 Junior (Rio, Santos) 2018 Tannat (BR, UY, AR) 2018 SJC (JP, HK, SG) 2013 Indigo (SG, ID, AU) 2019 HK-G (HK, GU) 2019 JGA (AU, GU, JP) 2019 Curie (CL, US) 2019 Havfrue (US, IE, DK) 2019 Network Edge points of presence CDN nodes Mumbai Singapore Kuala Lumpur Sydney Tokyo Chennai Taipei Seattle San Francisco Montréal Hamburg Zurich Madrid Paris London Hong Kong Osaka Toronto Chicago Los Angeles Denver Dallas Miami Atlanta Washington DC New York Rio de Janeiro São Paulo Buenos Aires Munich Milan Marseille Amsterdam Stockholm Frankfurt Dedicated Interconnect Current regions and number of zones Future regions and number of zones Mumbai Singapore Jakarta Sydney Tokyo Osaka Hong Kong Taiwan 3 3 3 3 3 3 3 3 3 33 3 3 3 4 3 3 Oregon Los Angeles Iowa S. Carolina N. Virginia Montréal São Paulo Finland Frankfurt Zurich 3 Belgium London Netherlands 3Seoul 3 3 Salt Lake City 3 3
  • 4. A comprehensive Big Data platform, not just infrastructure Data ingestion at any scale Reliable streaming data pipeline Advanced analytics Data warehousing and data lake Apache Beam Cloud Pub/Sub Cloud Dataflow Cloud Dataproc BigQuery Cloud Storage Data Transfer Service Cloud Composer Cloud IoT Core Cloud Dataprep Cloud AI Services Google Data Studio Tensorflow Sheets Storage Transfer Service Data Catalog Data Fusion
  • 5. Google’s data processing timeline 20122002 2004 2006 2008 2010 MapReduce GFS Big Table Dremel Pregel FlumeJava Colossus Spanner 2014 MillWheel Dataflow 2016 Apache Beam
  • 6. Why FlumeJava Mapreduce MAP MAP MAP MAP MAP RED RED RED (K,V) (K,V*) (K,W)
  • 7. MapReduce can quickly get out of hand One Google pipeline had 116 stages!
  • 8. DAGs offer a better abstraction from execution Filter Filter Join Group Filter Filter fs:// Databasefs:// Database
  • 9. By 2025, more than a quarter of data created in the global datasphere will be real time in nature. *IDC
  • 11. Goal: Grouping by Event Time into Time Windows 9:00 14:0013:0012:0011:0010:00Event time 9:00 14:0013:0012:0011:0010:00Processing time Input Output
  • 12. MillWheel - low-latency, accurate data-processing
  • 13. Common steps in Stream Analytics End-user apps Cloud Composer Orchestrate IoT Events Cloud Pub/Sub Dataflow Streaming DBs Ingest & distribute Aggregate, enrich, detect Backfill, reprocess Cloud AI Platform Bigtable Dataflow Batch Action Reference architecture of Streaming Processing in GCP BigQuery BigQuery Streaming API Machine Learning Data Warehousing
  • 14. What is Beam and Dataflow? Open source programming model Unified batch and streaming Top Apache project by dev@ activity Runner and language portability Cloud Dataflow Automatic optimizations scale to millions of QPS Serverless, fully managed data processing State storage in Shuffle and Streaming Engine Exactly-once streaming semantics SDK
  • 15. The Beam Vision Input.apply (Sum.integersPerKey()) Java input | Sum.PerKey() Python stats.Sum(s, input) Go SELECT key, SUM(value) FROM input GROUP BY key SQL Cloud Dataflow Apache Spark Apache Flink Apache Apex Gearpump Apache Samza Apache Nemo (incubating) IBM Streams Sum Per Key
  • 16. ● Separating compute from state storage ● Automatic scaling ● Building Streaming systems can be hard, but it does not have to be Lessons Learned While Building Cloud Dataflow
  • 17. Separating compute from state storage to improve scalability
  • 18. Traditional Distributed Data Processing Architecture User code VM User code VM User code VM User code VM State storage ● Jobs executed on clusters of VMs ● Job state stored on network-attached volumes ● Control plane orchestrates data plane Network Control plane VM State storage State storage State storage
  • 19. Traditional Architecture works well ... Filter Filter Join Group Filter Filter fs:// Databasefs:// Database … except for Joins and Group By’s
  • 20. Shuffling key-value pairs ● Unsorted Data Elements <key1, record> <key5, record> <key3, record> <key8, record> <key4, record> ... <key5, record> <key5, record> <key2, record> <key3, record> <key8, record> ... <key3, record> <key3, record> <key8, record> <key3, record> <key6, record> ... <key2, record> <key1, record> <key5, record> <key8, record> <key4, record> ...
  • 21. ● Unsorted data elements ● Goal: sort data elements by key <key1, record> <key1, record> <key2, record> <key2, record> <key2, record> ... <key3, record> <key3, record> <key3, record> <key3, record> <key3, record> <key4, record> ... <key5, record> <key5, record> <key5, record> <key5, record> <key6, record> ... <key7, record> <key8, record> <key8, record> <key8, record> ... Shuffling key-value pairs
  • 22. ● Unsorted data elements ● Goal: sort data elements by key ● KV pairs need to be exchanged between nodes <key1, record> <key5, record> <key3, record> <key8, record> <key4, record> ... <key5, record> <key5, record> <key2, record> <key3, record> <key8, record> ... <key3, record> <key3, record> <key8, record> <key3, record> <key6, record> ... <key2, record> <key1, record> <key5, record> <key8, record> <key4, record> ... Shuffling key-value pairs
  • 23. ● Unsorted data elements ● Goal: sort data elements by key ● KV pairs need to be exchanged between nodes ● Until everything is sorted Shuffling key-value pairs <key1, record> <key1, record> <key2, record> <key2, record> <key2, record> ... <key3, record> <key3, record> <key3, record> <key3, record> <key3, record> <key4, record> ... <key5, record> <key5, record> <key5, record> <key5, record> <key6, record> ... <key7, record> <key8, record> <key8, record> <key8, record> ... key1, key 2 key3, key4 key5, key6 key7, key8
  • 24. Traditional Architecture Requires Manual Tuning User code VM User code VM User code VM User code VM State storage ● When data volumes exceed dozens of TBs Network Control plane VM State storage State storage State storage
  • 25. Distributed in-memory Shuffle in batch Cloud Dataflow Compute Petabit network Dataflow Shuffle Region Zone ‘a’ Zone ‘b’ Zone ‘c’Distributed in-memory file system Distributed on-disk file system Shuffle proxy Autozone placement
  • 26. No tuning required Dataflow Shuffle is usually faster than worker-based shuffle, including those using SSD-PD. Faster Processing Runtime of shuffle Runtime (mins)
  • 27. Shuffle 200TB+ Dataflow shuffle has been used to shuffle 200TB+ datasets. Supporting larger datasets Dataset size of shuffle Dataset size (TB)
  • 28. Storing state What about streaming pipelines? Streaming shuffle Just like in batch, need to group and join streams Distributed streaming shuffle Window data elements Late Arriving Data requires buffering time window data Accumulate elements until triggering conditions occur
  • 29. Goal: Grouping by Event Time into Time Windows 9:00 14:0013:0012:0011:0010:00Event time 9:00 14:0013:0012:0011:0010:00Processing time Input Output
  • 30. Even more state to store on disks in streaming User code VM User code VM User code VM User code VM Shuffle data elements ● Key ranges are assigned to workers ● Data elements of these keys is stored on Persistent Disks State storage State storage State storage State storage key 0000 ... … key 1234 key 1235 ... … key ABC2 key ABC3 ... … key DEF5 key DEF6 ... … key GHI2 Time window data ● Also assigned to workers ● When time windows close, data processed on workers
  • 31. Dataflow Streaming Engine Benefits ● Better supportability ● Less worker resources ● Smoother autoscaling User code Streaming engine Worker User code Worker User code Worker User code Worker Window state storage Streaming shuffle
  • 32. Autoscaling: Even better with separate Compute and State Storage User code Streaming engine Worker User code Worker Window state storage Streaming shuffle Dataflow with Streaming Engine User code VM User code VM State storage State storage key 0000 ... … key 1234 key 1235 ... … key ABC2 Dataflow without Streaming Engine
  • 33. Dataflow with Streaming Engine Dataflow without Streaming Engine
  • 34. Streaming can be hard, but does not have to be
  • 35. We’ve set out to make Streaming as accessible as Batch.
  • 36. Easy Stream Analytics in SQL Group by Input1 Output Join Input2 SELECT input1.*, input2.* FROM input1 LEFT OUTER JOIN input2 ON input1.Id = input2.Id Use Dataflow SQL from BigQuery UI: ● Join Pub/Sub Streams with Files or Tables ● Write into BigQuery for dashboarding ● Store Pub/Sub schema in Data Catalog ● Use SQL skills for streaming data processing
  • 37. Demo
  • 38. Demo: Streaming Analytics with SQL Transactions PubSub Dataflow BigQuery SELECT sr.sales_region, TUMBLE_START("INTERVAL 5 SECOND") AS period_start, SUM(tr.payload.amount) as amount FROM `pubsub.dataflow-sql.transactions` AS tr INNER JOIN `bigquery.dataflow-sql.opsdb.us_state_salesregions` AS sr ON tr.payload.state = sr.state_code GROUP BY sr.sales_region, TUMBLE(tr.event_timestamp, "INTERVAL 5 SECOND") PubSub topic Streaming SQL pipeline Table Table
  • 39. Google Cloud offers both infrastructure-as-a-service as well as fully managed services Separating compute from state storage help make stream and batch processing scalable SQL brings complexity of Streaming Processing way down Main takeaways