1. Google Cloud Dataflow is a fully managed service that allows users to define data processing pipelines that can run batch or streaming computations.
2. The Dataflow programming model defines pipelines as directed graphs of transformations on collections of data elements. This provides flexibility in how computations are defined across batch and streaming workloads.
3. The Dataflow service handles graph optimization, scaling of workers, and monitoring of jobs to efficiently execute user-defined pipelines on Google Cloud Platform.
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by Default
1. Stream processing by default
Modern processing for Big Data, as offered
by Google Cloud Dataflow and Flink
William Vambenepe
Lead Product Manager for Data Processing
Google Cloud Platform
@vambenepe / vbp@google.com
19. FlumeJava: Easy and Efficient MapReduce Pipelines
● Higher-level API with simple data
processing abstractions.
○ Focus on what you want to do to
your data, not what the
underlying system supports.
● A graph of transformations is
automatically transformed into an
optimized series of MapReduces.
24. MillWheel: Streaming Computations
● Framework for building low-latency
data-processing applications
● User provides a DAG of
computations to be performed
● System manages state and
persistent flow of elements
31. What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
32. What are you computing?
• A Pipeline represents a graph
of data processing
transformations
• PCollections flow through the
pipeline
• Optimized and executed as a
unit for efficiency
33. What are you computing?
• A PCollection<T> is a collection
of data of type T
• Maybe be bounded or unbounded
in size
• Each element has an implicit
timestamp
• Initially created from backing data
stores
34. What are you computing?
PTransforms transform PCollections into other
PCollections.
What Where When How
Element-Wise Aggregating Composite
38. PCollection<KV<String, Integer>> output = input
.apply(Window.into(FixedWindows.of(Minutes(2))))
.apply(Sum.integersPerKey());
What Where When How
Example: Fixed 2-minute Windows
40. What Where When How
When in Processing Time?
• Triggers control
when results are
emitted.
• Triggers are often
relative to the
watermark.
ProcessingTime
Event Time
Watermark
41. PCollection<KV<String, Integer>> output = input
.apply(Window.into(FixedWindows.of(Minutes(2)))
.trigger(AtWatermark())
.apply(Sum.integersPerKey());
What Where When How
Example: Triggering at the Watermark
43. What Where When How
Example: Triggering for Speculative & Late Data
PCollection<KV<String, Integer>> output = input
.apply(Window.into(FixedWindows.of(Minutes(2)))
.trigger(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1))))
.apply(Sum.integersPerKey());
44. What Where When How
Example: Triggering for Speculative & Late Data
45. What Where When How
How do Refinements Relate?
• How should multiple outputs per window
accumulate?
• Appropriate choice depends on consumer.
Firing Elements
Speculative 3
Watermark 5, 1
Late 2
Total Observed 11
Discarding
3
6
2
11
Accumulating
3
9
11
23
Acc. & Retracting
3
9, -3
11, -9
11
46. PCollection<KV<String, Integer>> output = input
.apply(Window.into(Sessions.withGapDuration(Minutes(1)))
.trigger(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetracting())
.apply(new Sum());
What Where When How
Example: Add Newest, Remove Previous
48. 1. Classic Batch 2. Batch with Fixed
Windows
3. Streaming 5. Streaming with
Retractions
4. Streaming with
Speculative + Late Data
Customizing What Where When How
What Where When How
49. Dataflow improvements over Lambda architecture
Low-latency, approximate results
Complete, correct results as soon as possible
One system: less to manage, fewer resources, one set of bugs
Tools for explicit reasoning about time
= Power + Flexibility + Clarity
Never re-architect a working pipeline for operational reasons
50. Open Source SDKs
● Used to construct a Dataflow pipeline.
● Java available now. Python in the works.
● Pipelines can run…
○ On your development machine
○ On the Dataflow Service on Google Cloud Platform
○ On third party environments like Spark (batch only) or
Flink (streaming coming soon)
52. Fully Managed Dataflow Service
Runs the pipeline on Google Cloud Platform. Includes:
● Graph optimization: Modular code, efficient execution
● Smart Workers: Lifecycle management, Autoscaling, and
Smart task rebalancing
● Easy Monitoring: Dataflow UI, Restful API and CLI,
Integration with Cloud Logging, etc.
53. Cloud Dataflow as a No-op Cloud service
Google Cloud Platform
Managed Service
User Code & SDK Work Manager
Deploy&
Schedule
Progress&
Logs
Monitoring UI
Job Manager
Graph
optim
ization
54.
55. Cloud Dataflow is part of a broader data platform
Cloud Logs
Google App
Engine
Google Analytics
Premium
Cloud Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
(SQL)
Capture Store Analyze
Batch
Cloud
DataStore
Process
Stream
Cloud
Monitoring
Cloud
Bigtable
Real time analytics
and Alerts
Cloud Dataflow
Cloud Dataproc
Cloud Datalab
Flink via
bdutil
57. Google Cloud Datalab
Jupyter notebooks
created in one click.
Direct BigQuery
integration.
Automatically stored in
git repo on GCP.
FR
E
S
H
O
FF
TH
E
P
R
E
S
S
58. Learn More
● The Dataflow Model @VLDB 2015
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
● Dataflow SDK for Java
https://github.com/GoogleCloudPlatform/DataflowJavaSDK
● Google Cloud Dataflow on Google Cloud Platform
http://cloud.google.com/dataflow (Free Trial!)
● Contact me: vbp@google.com or on Twitter @vambenepe