Batch and Streaming Data Processing and Vizualize 300Tb in 5 Seconds meetup on April 18th, 2016 (http://www.meetup.com/Big-things-are-happening-here/events/229532500)
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Dataflow - A Unified Model for Batch and Streaming Data Processing
1. Section Slide Template Option 2
Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you.
Make the subtitle something clever. People will think it’s neat.
Dataflow
A Unified Model for Batch and Streaming
Data Processing
Vadim Solovey
Google Developer Expert & Trainer
vadim@doit-intl.com
Shahar Frank
Cloud Solutions Architect
srfrnk@doit-intl.com
2. Goals:
Write interesting
computations
Run in both batch &
streaming
Use custom timestamps
Handle late data
https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg
3. Data Shapes
Google’s Data Processing Story
The Dataflow Model
Agenda
Demo: Google Cloud Dataflow
1
2
3
4
18. FlumeJava: Easy and Efficient MapReduce Pipelines
● Higher-level API with simple data
processing abstractions.
○ Focus on what you want to do to
your data, not what the
underlying system supports.
● A graph of transformations is
automatically transformed into an
optimized series of MapReduces.
23. MillWheel: Streaming Computations
● Framework for building low-latency
data-processing applications
● User provides a DAG of
computations to be performed
● System manages state and
persistent flow of elements
31. What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
32. What are you computing?
• A Pipeline represents a graph
of data processing
transformations
• PCollections flow through the
pipeline
• Optimized and executed as a
unit for efficiency
33. What are you computing?
• A PCollection<T> is a collection
of data of type T
• Maybe be bounded or unbounded
in size
• Each element has an implicit
timestamp
• Initially created from backing data
stores
34. What are you computing?
PTransforms transform PCollections into other
PCollections.
What Where When How
Element-Wise Aggregating Composite
35. Example: Computing Integer Sums
// Collection of raw log lines
PCollection<String> raw = ...;
// Element-wise transformation into team/score pairs
PCollection<KV<String, Integer>> input =
raw.apply(ParDo.of(new ParseFn()))
// Composite transformation containing an aggregation
PCollection<KV<String, Integer>> output = input
.apply(Sum.integersPerKey());
What Where When How
39. PCollection<KV<String, Integer>> output = input
.apply(Window.into(FixedWindows.of(Minutes(2))))
.apply(Sum.integersPerKey());
What Where When How
Example: Fixed 2-minute Windows
41. What Where When How
When in Processing Time?
• Triggers control
when results are
emitted.
• Triggers are often
relative to the
watermark.
ProcessingTime
Event Time
Watermark
42. PCollection<KV<String, Integer>> output = input
.apply(Window.into(FixedWindows.of(Minutes(2)))
.trigger(AtWatermark())
.apply(Sum.integersPerKey());
What Where When How
Example: Triggering at the Watermark
44. What Where When How
Example: Triggering for Speculative & Late Data
PCollection<KV<String, Integer>> output = input
.apply(Window.into(FixedWindows.of(Minutes(2)))
.trigger(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1))))
.apply(Sum.integersPerKey());
45. What Where When How
Example: Triggering for Speculative & Late Data
46. What Where When How
How do Refinements Relate?
• How should multiple outputs per window
accumulate?
• Appropriate choice depends on consumer.
Firing Elements
Speculative 3
Watermark 5, 1
Late 2
Total Observ 11
Discarding
3
6
2
11
Accumulating
3
9
11
23
Acc. & Retracting
3
9, -3
11, -9
11
47. PCollection<KV<String, Integer>> output = input
.apply(Window.into(Sessions.withGapDuration(Minutes(1)))
.trigger(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetracting())
.apply(new Sum());
What Where When How
Example: Add Newest, Remove Previous
49. 1. Classic Batch 2. Batch with Fixed
Windows
3. Streaming 5. Streaming with
Retractions
4. Streaming with
Speculative + Late Data
Customizing What Where When How
What Where When How
51. Cloud DataflowA fully-managed cloud service and
programming model for batch and
streaming big data processing.
Google Cloud Dataflow
52. Open Source SDKs
● Used to construct a Dataflow pipeline.
● Java available now. Python in the open beta.
● Pipelines can run…
○ On your development machine
○ On the Dataflow Service on Google Cloud Platform
○ On third party environments like Spark or Flink.
53. Fully Managed Dataflow Service
Runs the pipeline on Google Cloud Platform. Includes:
● Graph optimization: Modular code, efficient execution
● Smart Workers: Lifecycle management, Autoscaling, and
Smart task rebalancing
● Easy Monitoring: Dataflow UI, Restful API and CLI,
Integration with Cloud Logging, etc.
56. Wrote Reusable
PTransforms
Ran the PTransform in
streaming mode
Used event time, not
processing time
Easily adjusted windowing
and late data settings
57. Learn More!
● The Dataflow Model @VLDB 2015
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
● Dataflow SDK for Java
https://github.com/GoogleCloudPlatform/DataflowJavaSDK
● Google Cloud Dataflow on Google Cloud Platform
http://cloud.google.com/dataflow (Free Trial!)