In this session, Sergei Sokolenko, the Google product manager for Cloud Dataflow, will share the implementation details of many of the unique features available in Apache Beam and Cloud Dataflow, including:
- autoscaling of resources based on data inputs;
- separating compute and state storage for better scaling of resources;
- simultaneous grouping and joining of 100s of Terabytes in a hybrid in-memory/on-desk file system;
- dynamic work rebalancing of work items away from overutilized worker nodes and many others.
Customers benefit from these advances through faster execution of jobs, resource savings, and a fully managed data processing environment that runs in the Cloud and removes the need to manage infrastructure.
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud Dataflow deep-dive"
1. Advances in Stream Analytics:
Google Cloud Dataflow and Apache Beam
Kyiv, October 5th, 2019
Sergei Sokolenko
Google
2. Your choices for doing Streaming Processing in Google Cloud
Separating State Storage from Compute
Autoscaling
Making Streaming Easy
Session overview
3. Google
Cloud
Platform
Our global infrastructure
PLCN (HK, LA) 2019
Faster (US, JP, TW) 2016
Unity (US, JP) 2010
Dunant (US, FR) 2020
Monet (US, BR) 2017
Junior (Rio, Santos) 2018
Tannat (BR, UY, AR) 2018
SJC (JP, HK, SG) 2013
Indigo (SG, ID, AU) 2019
HK-G (HK, GU) 2019 JGA (AU, GU, JP) 2019
Curie (CL, US) 2019
Havfrue (US, IE, DK) 2019
Network
Edge points
of presence
CDN nodes
Mumbai
Singapore
Kuala Lumpur
Sydney
Tokyo
Chennai
Taipei
Seattle
San Francisco
Montréal
Hamburg
Zurich
Madrid
Paris
London
Hong
Kong
Osaka
Toronto
Chicago
Los Angeles
Denver
Dallas
Miami
Atlanta
Washington DC
New York
Rio de Janeiro
São Paulo
Buenos Aires
Munich
Milan
Marseille
Amsterdam
Stockholm
Frankfurt
Dedicated Interconnect
Current regions
and number of zones
Future regions
and number of zones Mumbai
Singapore
Jakarta
Sydney
Tokyo
Osaka
Hong Kong Taiwan
3
3
3
3
3
3 3
3
3 33
3
3
3
4
3
3
Oregon
Los Angeles
Iowa
S. Carolina
N. Virginia
Montréal
São Paulo
Finland
Frankfurt
Zurich
3
Belgium
London
Netherlands
3Seoul
3
3
Salt Lake City
3 3
4. A comprehensive Big Data platform, not just infrastructure
Data ingestion
at any scale
Reliable streaming
data pipeline
Advanced
analytics
Data warehousing
and data lake
Apache
Beam
Cloud Pub/Sub Cloud
Dataflow
Cloud
Dataproc
BigQuery Cloud
Storage
Data Transfer
Service
Cloud Composer
Cloud IoT
Core
Cloud Dataprep
Cloud AI
Services
Google
Data Studio
Tensorflow Sheets
Storage Transfer
Service
Data Catalog
Data Fusion
13. Common steps in Stream Analytics
End-user
apps
Cloud Composer
Orchestrate
IoT
Events
Cloud Pub/Sub Dataflow Streaming
DBs
Ingest & distribute
Aggregate,
enrich, detect
Backfill,
reprocess
Cloud AI
Platform
Bigtable Dataflow Batch
Action
Reference architecture of Streaming Processing in GCP
BigQuery
BigQuery Streaming API
Machine Learning
Data Warehousing
14. What is Beam and Dataflow?
Open source programming model
Unified batch and streaming
Top Apache project by dev@ activity
Runner and language portability
Cloud
Dataflow
Automatic optimizations scale to millions of QPS
Serverless, fully managed data processing
State storage in Shuffle and Streaming Engine
Exactly-once streaming semantics
SDK
15. The Beam Vision
Input.apply
(Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
stats.Sum(s, input)
Go
SELECT key, SUM(value)
FROM input GROUP BY key
SQL
Cloud Dataflow
Apache Spark
Apache Flink
Apache Apex
Gearpump
Apache Samza
Apache Nemo
(incubating)
IBM Streams
Sum Per Key
16. ● Separating compute from state storage
● Automatic scaling
● Building Streaming systems can be hard, but it does not have to be
Lessons Learned While Building Cloud Dataflow
18. Traditional Distributed Data Processing Architecture
User code
VM
User code
VM
User code
VM
User code
VM
State storage
● Jobs executed on
clusters of VMs
● Job state stored on
network-attached
volumes
● Control plane
orchestrates data plane
Network
Control plane
VM
State storage State storage State storage
19. Traditional Architecture works well ...
Filter
Filter
Join
Group
Filter
Filter
fs://
Databasefs://
Database
… except for Joins and
Group By’s
22. ● Unsorted data elements
● Goal: sort data elements
by key
● KV pairs need to be
exchanged between
nodes
<key1, record>
<key5, record>
<key3, record>
<key8, record>
<key4, record>
...
<key5, record>
<key5, record>
<key2, record>
<key3, record>
<key8, record>
...
<key3, record>
<key3, record>
<key8, record>
<key3, record>
<key6, record>
...
<key2, record>
<key1, record>
<key5, record>
<key8, record>
<key4, record>
...
Shuffling key-value pairs
23. ● Unsorted data elements
● Goal: sort data elements
by key
● KV pairs need to be
exchanged between
nodes
● Until everything is sorted
Shuffling key-value pairs
<key1, record>
<key1, record>
<key2, record>
<key2, record>
<key2, record>
...
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key4, record>
...
<key5, record>
<key5, record>
<key5, record>
<key5, record>
<key6, record>
...
<key7, record>
<key8, record>
<key8, record>
<key8, record>
...
key1, key 2 key3, key4 key5, key6 key7, key8
24. Traditional Architecture Requires Manual Tuning
User code
VM
User code
VM
User code
VM
User code
VM
State storage
● When data volumes
exceed dozens of TBs
Network
Control plane
VM
State storage State storage State storage
25. Distributed in-memory Shuffle in batch Cloud Dataflow
Compute
Petabit
network
Dataflow Shuffle
Region
Zone ‘a’ Zone ‘b’
Zone ‘c’Distributed
in-memory
file system
Distributed
on-disk
file system
Shuffle
proxy
Autozone placement
26. No tuning required
Dataflow Shuffle is usually
faster than worker-based
shuffle, including those using
SSD-PD.
Faster Processing
Runtime of shuffle
Runtime
(mins)
27. Shuffle 200TB+
Dataflow shuffle has been
used to shuffle 200TB+
datasets.
Supporting larger datasets
Dataset size of shuffle
Dataset
size (TB)
28. Storing state
What about streaming pipelines?
Streaming shuffle
Just like in batch, need to group and join
streams
Distributed streaming shuffle
Window data elements
Late Arriving Data requires buffering
time window data
Accumulate elements until triggering
conditions occur
29. Goal: Grouping by Event Time into Time Windows
9:00 14:0013:0012:0011:0010:00Event
time
9:00 14:0013:0012:0011:0010:00Processing
time
Input
Output
30. Even more state to store on disks in streaming
User code
VM
User code
VM
User code
VM
User code
VM
Shuffle data elements
● Key ranges are assigned
to workers
● Data elements of these
keys is stored on
Persistent Disks
State storage State storage State storage State storage
key 0000 ...
… key 1234
key 1235 ...
… key ABC2
key ABC3 ...
… key DEF5
key DEF6 ...
… key GHI2
Time window data
● Also assigned to workers
● When time windows
close, data processed on
workers
31. Dataflow Streaming Engine
Benefits
● Better supportability
● Less worker resources
● Smoother autoscaling
User code
Streaming engine
Worker
User code
Worker
User code
Worker
User code
Worker
Window state storage Streaming shuffle
32. Autoscaling: Even better with separate Compute and State Storage
User code
Streaming engine
Worker
User code
Worker
Window state storage Streaming shuffle
Dataflow with Streaming Engine
User code
VM
User code
VM
State storage State storage
key 0000 ...
… key 1234
key 1235 ...
… key ABC2
Dataflow without Streaming Engine
35. We’ve set out to make Streaming
as accessible as Batch.
36. Easy Stream Analytics in SQL
Group by
Input1
Output
Join
Input2 SELECT input1.*, input2.*
FROM input1 LEFT OUTER JOIN input2
ON input1.Id = input2.Id
Use Dataflow SQL from BigQuery UI:
● Join Pub/Sub Streams with Files or Tables
● Write into BigQuery for dashboarding
● Store Pub/Sub schema in Data Catalog
● Use SQL skills for streaming data processing
38. Demo: Streaming Analytics with SQL
Transactions
PubSub
Dataflow BigQuery
SELECT
sr.sales_region,
TUMBLE_START("INTERVAL 5 SECOND") AS period_start,
SUM(tr.payload.amount) as amount
FROM `pubsub.dataflow-sql.transactions` AS tr
INNER JOIN
`bigquery.dataflow-sql.opsdb.us_state_salesregions` AS sr
ON tr.payload.state = sr.state_code
GROUP BY
sr.sales_region,
TUMBLE(tr.event_timestamp, "INTERVAL 5 SECOND")
PubSub topic
Streaming SQL
pipeline
Table
Table
39. Google Cloud offers both infrastructure-as-a-service as well as fully managed services
Separating compute from state storage help make stream and batch processing scalable
SQL brings complexity of Streaming Processing way down
Main takeaways