SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
Section Slide Template Option 2
Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you.
Make the subtitle something clever. People will think it’s neat.
Dataflow
A Unified Model for Batch and Streaming
Data Processing
Vadim Solovey
Google Developer Expert & Trainer
vadim@doit-intl.com
Shahar Frank
Cloud Solutions Architect
srfrnk@doit-intl.com
Goals:
Write interesting
computations
Run in both batch &
streaming
Use custom timestamps
Handle late data
https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg
Data Shapes
Google’s Data Processing Story
The Dataflow Model
Agenda
Demo: Google Cloud Dataflow
1
2
3
4
Data Shapes1
Data...
...can be big...
...really, really big...
Tuesday
Wednesday
Thursday
...maybe even infinitely big...
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
… with unknown delays.
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
8:00
1 + 1 = 2
Completeness Latency Cost
$$$
Data Processing Tradeoffs
Requirements: Billing Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Requirements: Live Cost Estimate Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Requirements: Abuse Detection Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Requirements: Abuse Detection Backfill Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Google’s Data Processing Story2
20122002 2004 2006 2008 2010
MapReduce
GFS Big Table
Dremel
Pregel
FlumeJava
Colossus
Spanner
2014
MillWheel
Dataflow
2016
Google’s Data-Related Papers
(Produce)
MapReduce: Batch Processing
(Prepare)
(Shuffle)
Map
Reduce
FlumeJava: Easy and Efficient MapReduce Pipelines
● Higher-level API with simple data
processing abstractions.
○ Focus on what you want to do to
your data, not what the
underlying system supports.
● A graph of transformations is
automatically transformed into an
optimized series of MapReduces.
MapReduce
Batch Patterns: Creating Structured Data
MapReduce
Batch Patterns: Repetitive Runs
Tuesday
Wednesday
Thursday
MapReduce
Tuesday [11:00 - 12:00)
[12:00 - 13:00)
[13:00 - 14:00)
[14:00 - 15:00)
[15:00 - 16:00)
[16:00 - 17:00)
[18:00 - 19:00)
[19:00 - 20:00)
[21:00 - 22:00)
[22:00 - 23:00)
[23:00 - 0:00)
Batch Patterns: Time Based Windows
MapReduce
TuesdayWednesday
Batch Patterns: Sessions
Jose
Lisa
Ingo
Asha
Cheryl
Ari
WednesdayTuesday
MillWheel: Streaming Computations
● Framework for building low-latency
data-processing applications
● User provides a DAG of
computations to be performed
● System manages state and
persistent flow of elements
Streaming Patterns: Element-wise transformations
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Streaming Patterns: Aggregating Time Based Windows
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
11:0010:00 15:0014:0013:0012:00Event Time
11:0010:00 15:0014:0013:0012:00
Processing
Time
Input
Output
Streaming Patterns: Event Time Based Windows
Streaming Patterns: Session Windows
Event Time
Processing
Time
11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
ProcessingTime
Event Time
Skew
Event-Time Skew
Watermark
Watermarks describe event time
progress.
"No timestamp earlier than the
watermark will be seen"
Often heuristic-based.
Too Slow? Results are delayed.
Too Fast? Some data is late.
Streaming or Batch?
1 + 1 = 2 $$$
Completeness Latency Cost
Why not both?
The Dataflow Model3
What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
What are you computing?
• A Pipeline represents a graph
of data processing
transformations
• PCollections flow through the
pipeline
• Optimized and executed as a
unit for efficiency
What are you computing?
• A PCollection<T> is a collection
of data of type T
• Maybe be bounded or unbounded
in size
• Each element has an implicit
timestamp
• Initially created from backing data
stores
What are you computing?
PTransforms transform PCollections into other
PCollections.
What Where When How
Element-Wise Aggregating Composite
Example: Computing Integer Sums
// Collection of raw log lines
PCollection<String> raw = ...;
// Element-wise transformation into team/score pairs
PCollection<KV<String, Integer>> input =
raw.apply(ParDo.of(new ParseFn()))
// Composite transformation containing an aggregation
PCollection<KV<String, Integer>> output = input
.apply(Sum.integersPerKey());
What Where When How
Example: Computing Integer Sums
What Where When How
What Where When How
Example: Computing Integer Sums
Key
2
Key
1
Key
3
1
Fixed
2
3
4
Key
2
Key
1
Key
3
Sliding
1
2
3
5
4
Key
2
Key
1
Key
3
Sessions
2
4
3
1
Where in Event Time?
• Windowing divides
data into event-
time-based finite
chunks.
• Required when
doing aggregations
over unbounded
data.
What Where When How
PCollection<KV<String, Integer>> output = input
.apply(Window.into(FixedWindows.of(Minutes(2))))
.apply(Sum.integersPerKey());
What Where When How
Example: Fixed 2-minute Windows
What Where When How
Example: Fixed 2-minute Windows
What Where When How
When in Processing Time?
• Triggers control
when results are
emitted.
• Triggers are often
relative to the
watermark.
ProcessingTime
Event Time
Watermark
PCollection<KV<String, Integer>> output = input
.apply(Window.into(FixedWindows.of(Minutes(2)))
.trigger(AtWatermark())
.apply(Sum.integersPerKey());
What Where When How
Example: Triggering at the Watermark
What Where When How
Example: Triggering at the Watermark
What Where When How
Example: Triggering for Speculative & Late Data
PCollection<KV<String, Integer>> output = input
.apply(Window.into(FixedWindows.of(Minutes(2)))
.trigger(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1))))
.apply(Sum.integersPerKey());
What Where When How
Example: Triggering for Speculative & Late Data
What Where When How
How do Refinements Relate?
• How should multiple outputs per window
accumulate?
• Appropriate choice depends on consumer.
Firing Elements
Speculative 3
Watermark 5, 1
Late 2
Total Observ 11
Discarding
3
6
2
11
Accumulating
3
9
11
23
Acc. & Retracting
3
9, -3
11, -9
11
PCollection<KV<String, Integer>> output = input
.apply(Window.into(Sessions.withGapDuration(Minutes(1)))
.trigger(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetracting())
.apply(new Sum());
What Where When How
Example: Add Newest, Remove Previous
What Where When How
Example: Add Newest, Remove Previous
1. Classic Batch 2. Batch with Fixed
Windows
3. Streaming 5. Streaming with
Retractions
4. Streaming with
Speculative + Late Data
Customizing What Where When How
What Where When How
Demo: Google Cloud Dataflow4
Cloud DataflowA fully-managed cloud service and
programming model for batch and
streaming big data processing.
Google Cloud Dataflow
Open Source SDKs
● Used to construct a Dataflow pipeline.
● Java available now. Python in the open beta.
● Pipelines can run…
○ On your development machine
○ On the Dataflow Service on Google Cloud Platform
○ On third party environments like Spark or Flink.
Fully Managed Dataflow Service
Runs the pipeline on Google Cloud Platform. Includes:
● Graph optimization: Modular code, efficient execution
● Smart Workers: Lifecycle management, Autoscaling, and
Smart task rebalancing
● Easy Monitoring: Dataflow UI, Restful API and CLI,
Integration with Cloud Logging, etc.
Demo Time!
user11_BisqueEmu,BisqueEmu,11,1460980305000,2016-04-18 04:59:27.936
user0_BisqueQuokka,BisqueQuokka,15,1460980768000,2016-04-18 04:59:28.473
user11_ArmyGreenNumbat,ArmyGreenNumbat,5,1460980768000,2016-04-18 04:59:28.473
user4_AquaEmu,AquaEmu,19,1460980768000,2016-04-18 04:59:28.473
user5_BeigeDingo,BeigeDingo,16,1460980768000,2016-04-18 04:59:28.473
user1_ArmyGreenKookaburra,ArmyGreenKookaburra,9,1460980768000,2016-04-18 04:59:28.473
user7_AquaDingo,AquaDingo,18,1460980768000,2016-04-18 04:59:28.473
user7_AmberEmu,AmberEmu,11,1460980768000,2016-04-18 04:59:28.473
user1_BeigeDingo,BeigeDingo,6,1460980768000,2016-04-18 04:59:28.473
user2_ArmyGreenKookaburra,ArmyGreenKookaburra,8,1460980768000,2016-04-18 04:59:28.473
user10_AzureBandicoot,AzureBandicoot,12,1460980768000,2016-04-18 04:59:28.473
user1_AzureBandicoot,AzureBandicoot,12,1460980768000,2016-04-18 04:59:28.473
Robot-12,BeigeQuokka,12,1460980768000,2016-04-18 04:59:28.473
user3_MagentaAntechinus,MagentaAntechinus,5,1460980768000,2016-04-18 04:59:28.473
user6_AmberEmu,AmberEmu,15,1460980768000,2016-04-18 04:59:28.473
Data
Wrote Reusable
PTransforms
Ran the PTransform in
streaming mode
Used event time, not
processing time
Easily adjusted windowing
and late data settings
Learn More!
● The Dataflow Model @VLDB 2015
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
● Dataflow SDK for Java
https://github.com/GoogleCloudPlatform/DataflowJavaSDK
● Google Cloud Dataflow on Google Cloud Platform
http://cloud.google.com/dataflow (Free Trial!)

Weitere ähnliche Inhalte

Was ist angesagt?

Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache Beam
PyData
 

Was ist angesagt? (20)

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
DataFlow & Beam
DataFlow & BeamDataFlow & Beam
DataFlow & Beam
 
Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache Beam
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
 
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
 

Andere mochten auch

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
 

Andere mochten auch (11)

DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsDEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing Applications
 
Kubernetes @ Nanit by Chen Fisher
Kubernetes @ Nanit by Chen FisherKubernetes @ Nanit by Chen Fisher
Kubernetes @ Nanit by Chen Fisher
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Latency-aware Elastic Scaling for Distributed Data Stream Processing SystemsLatency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
 
Auto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream ProcessingAuto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream Processing
 
Adaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingAdaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream Processing
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
AWS Cyber Security Best Practices
AWS Cyber Security Best PracticesAWS Cyber Security Best Practices
AWS Cyber Security Best Practices
 

Ähnlich wie Dataflow - A Unified Model for Batch and Streaming Data Processing

William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Flink Forward
 

Ähnlich wie Dataflow - A Unified Model for Batch and Streaming Data Processing (20)

William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain
 
Kafka Streams Windows: Behind the Curtain
Kafka Streams Windows: Behind the CurtainKafka Streams Windows: Behind the Curtain
Kafka Streams Windows: Behind the Curtain
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...
Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...
Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...
 
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
 
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019
 
Big Data Warsaw
Big Data WarsawBig Data Warsaw
Big Data Warsaw
 
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansStream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data Artisans
 
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and Frameworks
 
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
 
About time
About timeAbout time
About time
 
Have your cake and eat it too, further dispelling the myths of the lambda arc...
Have your cake and eat it too, further dispelling the myths of the lambda arc...Have your cake and eat it too, further dispelling the myths of the lambda arc...
Have your cake and eat it too, further dispelling the myths of the lambda arc...
 
Solving some of the scalability problems at booking.com
Solving some of the scalability problems at booking.comSolving some of the scalability problems at booking.com
Solving some of the scalability problems at booking.com
 
SAIMA Session - Salesforce performance considerations
SAIMA Session - Salesforce performance considerationsSAIMA Session - Salesforce performance considerations
SAIMA Session - Salesforce performance considerations
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+Tables
 

Mehr von DoiT International

Mehr von DoiT International (16)

Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules Restructured
 
GAN training with Tensorflow and Tensor Cores
GAN training with Tensorflow and Tensor CoresGAN training with Tensorflow and Tensor Cores
GAN training with Tensorflow and Tensor Cores
 
Orchestrating Redis & K8s Operators
Orchestrating Redis & K8s OperatorsOrchestrating Redis & K8s Operators
Orchestrating Redis & K8s Operators
 
K8s best practices from the field!
K8s best practices from the field!K8s best practices from the field!
K8s best practices from the field!
 
An Open-Source Platform to Connect, Manage, and Secure Microservices
An Open-Source Platform to Connect, Manage, and Secure MicroservicesAn Open-Source Platform to Connect, Manage, and Secure Microservices
An Open-Source Platform to Connect, Manage, and Secure Microservices
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?
 
Applying ML for Log Analysis
Applying ML for Log AnalysisApplying ML for Log Analysis
Applying ML for Log Analysis
 
GCP for AWS Professionals
GCP for AWS ProfessionalsGCP for AWS Professionals
GCP for AWS Professionals
 
Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
 
Amazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopAmazon Athena Hands-On Workshop
Amazon Athena Hands-On Workshop
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL Queries
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s New
 
Running Production-Grade Kubernetes on AWS
Running Production-Grade Kubernetes on AWSRunning Production-Grade Kubernetes on AWS
Running Production-Grade Kubernetes on AWS
 
Scaling Jenkins with Kubernetes by Ami Mahloof
Scaling Jenkins with Kubernetes by Ami MahloofScaling Jenkins with Kubernetes by Ami Mahloof
Scaling Jenkins with Kubernetes by Ami Mahloof
 
CI Implementation with Kubernetes at LivePerson by Saar Demri
CI Implementation with Kubernetes at LivePerson by Saar DemriCI Implementation with Kubernetes at LivePerson by Saar Demri
CI Implementation with Kubernetes at LivePerson by Saar Demri
 
Kubernetes - State of the Union (Q1-2016)
Kubernetes - State of the Union (Q1-2016)Kubernetes - State of the Union (Q1-2016)
Kubernetes - State of the Union (Q1-2016)
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Dataflow - A Unified Model for Batch and Streaming Data Processing

  • 1. Section Slide Template Option 2 Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you. Make the subtitle something clever. People will think it’s neat. Dataflow A Unified Model for Batch and Streaming Data Processing Vadim Solovey Google Developer Expert & Trainer vadim@doit-intl.com Shahar Frank Cloud Solutions Architect srfrnk@doit-intl.com
  • 2. Goals: Write interesting computations Run in both batch & streaming Use custom timestamps Handle late data https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg
  • 3. Data Shapes Google’s Data Processing Story The Dataflow Model Agenda Demo: Google Cloud Dataflow 1 2 3 4
  • 8. ...maybe even infinitely big... 9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
  • 9. … with unknown delays. 9:008:00 14:0013:0012:0011:0010:00 8:00 8:008:00 8:00
  • 10. 1 + 1 = 2 Completeness Latency Cost $$$ Data Processing Tradeoffs
  • 11. Requirements: Billing Pipeline Completeness Low Latency Low Cost Important Not Important
  • 12. Requirements: Live Cost Estimate Pipeline Completeness Low Latency Low Cost Important Not Important
  • 13. Requirements: Abuse Detection Pipeline Completeness Low Latency Low Cost Important Not Important
  • 14. Requirements: Abuse Detection Backfill Pipeline Completeness Low Latency Low Cost Important Not Important
  • 16. 20122002 2004 2006 2008 2010 MapReduce GFS Big Table Dremel Pregel FlumeJava Colossus Spanner 2014 MillWheel Dataflow 2016 Google’s Data-Related Papers
  • 18. FlumeJava: Easy and Efficient MapReduce Pipelines ● Higher-level API with simple data processing abstractions. ○ Focus on what you want to do to your data, not what the underlying system supports. ● A graph of transformations is automatically transformed into an optimized series of MapReduces.
  • 20. MapReduce Batch Patterns: Repetitive Runs Tuesday Wednesday Thursday
  • 21. MapReduce Tuesday [11:00 - 12:00) [12:00 - 13:00) [13:00 - 14:00) [14:00 - 15:00) [15:00 - 16:00) [16:00 - 17:00) [18:00 - 19:00) [19:00 - 20:00) [21:00 - 22:00) [22:00 - 23:00) [23:00 - 0:00) Batch Patterns: Time Based Windows
  • 23. MillWheel: Streaming Computations ● Framework for building low-latency data-processing applications ● User provides a DAG of computations to be performed ● System manages state and persistent flow of elements
  • 24. Streaming Patterns: Element-wise transformations 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
  • 25. Streaming Patterns: Aggregating Time Based Windows 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
  • 26. 11:0010:00 15:0014:0013:0012:00Event Time 11:0010:00 15:0014:0013:0012:00 Processing Time Input Output Streaming Patterns: Event Time Based Windows
  • 27. Streaming Patterns: Session Windows Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output
  • 28. ProcessingTime Event Time Skew Event-Time Skew Watermark Watermarks describe event time progress. "No timestamp earlier than the watermark will be seen" Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.
  • 29. Streaming or Batch? 1 + 1 = 2 $$$ Completeness Latency Cost Why not both?
  • 31. What are you computing? Where in event time? When in processing time? How do refinements relate?
  • 32. What are you computing? • A Pipeline represents a graph of data processing transformations • PCollections flow through the pipeline • Optimized and executed as a unit for efficiency
  • 33. What are you computing? • A PCollection<T> is a collection of data of type T • Maybe be bounded or unbounded in size • Each element has an implicit timestamp • Initially created from backing data stores
  • 34. What are you computing? PTransforms transform PCollections into other PCollections. What Where When How Element-Wise Aggregating Composite
  • 35. Example: Computing Integer Sums // Collection of raw log lines PCollection<String> raw = ...; // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn())) // Composite transformation containing an aggregation PCollection<KV<String, Integer>> output = input .apply(Sum.integersPerKey()); What Where When How
  • 36. Example: Computing Integer Sums What Where When How
  • 37. What Where When How Example: Computing Integer Sums
  • 38. Key 2 Key 1 Key 3 1 Fixed 2 3 4 Key 2 Key 1 Key 3 Sliding 1 2 3 5 4 Key 2 Key 1 Key 3 Sessions 2 4 3 1 Where in Event Time? • Windowing divides data into event- time-based finite chunks. • Required when doing aggregations over unbounded data. What Where When How
  • 39. PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2)))) .apply(Sum.integersPerKey()); What Where When How Example: Fixed 2-minute Windows
  • 40. What Where When How Example: Fixed 2-minute Windows
  • 41. What Where When How When in Processing Time? • Triggers control when results are emitted. • Triggers are often relative to the watermark. ProcessingTime Event Time Watermark
  • 42. PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .apply(Sum.integersPerKey()); What Where When How Example: Triggering at the Watermark
  • 43. What Where When How Example: Triggering at the Watermark
  • 44. What Where When How Example: Triggering for Speculative & Late Data PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());
  • 45. What Where When How Example: Triggering for Speculative & Late Data
  • 46. What Where When How How do Refinements Relate? • How should multiple outputs per window accumulate? • Appropriate choice depends on consumer. Firing Elements Speculative 3 Watermark 5, 1 Late 2 Total Observ 11 Discarding 3 6 2 11 Accumulating 3 9 11 23 Acc. & Retracting 3 9, -3 11, -9 11
  • 47. PCollection<KV<String, Integer>> output = input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetracting()) .apply(new Sum()); What Where When How Example: Add Newest, Remove Previous
  • 48. What Where When How Example: Add Newest, Remove Previous
  • 49. 1. Classic Batch 2. Batch with Fixed Windows 3. Streaming 5. Streaming with Retractions 4. Streaming with Speculative + Late Data Customizing What Where When How What Where When How
  • 50. Demo: Google Cloud Dataflow4
  • 51. Cloud DataflowA fully-managed cloud service and programming model for batch and streaming big data processing. Google Cloud Dataflow
  • 52. Open Source SDKs ● Used to construct a Dataflow pipeline. ● Java available now. Python in the open beta. ● Pipelines can run… ○ On your development machine ○ On the Dataflow Service on Google Cloud Platform ○ On third party environments like Spark or Flink.
  • 53. Fully Managed Dataflow Service Runs the pipeline on Google Cloud Platform. Includes: ● Graph optimization: Modular code, efficient execution ● Smart Workers: Lifecycle management, Autoscaling, and Smart task rebalancing ● Easy Monitoring: Dataflow UI, Restful API and CLI, Integration with Cloud Logging, etc.
  • 55. user11_BisqueEmu,BisqueEmu,11,1460980305000,2016-04-18 04:59:27.936 user0_BisqueQuokka,BisqueQuokka,15,1460980768000,2016-04-18 04:59:28.473 user11_ArmyGreenNumbat,ArmyGreenNumbat,5,1460980768000,2016-04-18 04:59:28.473 user4_AquaEmu,AquaEmu,19,1460980768000,2016-04-18 04:59:28.473 user5_BeigeDingo,BeigeDingo,16,1460980768000,2016-04-18 04:59:28.473 user1_ArmyGreenKookaburra,ArmyGreenKookaburra,9,1460980768000,2016-04-18 04:59:28.473 user7_AquaDingo,AquaDingo,18,1460980768000,2016-04-18 04:59:28.473 user7_AmberEmu,AmberEmu,11,1460980768000,2016-04-18 04:59:28.473 user1_BeigeDingo,BeigeDingo,6,1460980768000,2016-04-18 04:59:28.473 user2_ArmyGreenKookaburra,ArmyGreenKookaburra,8,1460980768000,2016-04-18 04:59:28.473 user10_AzureBandicoot,AzureBandicoot,12,1460980768000,2016-04-18 04:59:28.473 user1_AzureBandicoot,AzureBandicoot,12,1460980768000,2016-04-18 04:59:28.473 Robot-12,BeigeQuokka,12,1460980768000,2016-04-18 04:59:28.473 user3_MagentaAntechinus,MagentaAntechinus,5,1460980768000,2016-04-18 04:59:28.473 user6_AmberEmu,AmberEmu,15,1460980768000,2016-04-18 04:59:28.473 Data
  • 56. Wrote Reusable PTransforms Ran the PTransform in streaming mode Used event time, not processing time Easily adjusted windowing and late data settings
  • 57. Learn More! ● The Dataflow Model @VLDB 2015 http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf ● Dataflow SDK for Java https://github.com/GoogleCloudPlatform/DataflowJavaSDK ● Google Cloud Dataflow on Google Cloud Platform http://cloud.google.com/dataflow (Free Trial!)