Aljoscha Krettek – Notions of Time

•

2 gefällt mir•6,705 views

Flink Forward

Flink Forward 2015

Technologie

Notions of Time
Aljoscha Krettek
aljoscha@apache.org
@aljoscha
How Apache Flink™ Handles Time and Windows

6
In Streaming:
Arriving data never stops!

7
Solution:
Put elements into buckets,
these are called windows

8
Window (5 min)
Count #Hashtags
Just saw #Trump on
#CNN, super cool. :D
Trump: 2394
Cheese: 12984
Money: 42

9
What I didn’t mention
• tweets have a timestamp,
their event time
• tweets from across the globe
arrive with delay
=> tweets with different
timestamps arrive out-of-order

Window (5 min)
Count #Hashtags
12:34 (13.10.2015):
Just saw #Trump on
#CNN, super cool. :D
Trump: 2394
Cheese: 12984
Money: 42
These arrive with
3 minutes slack
Form windows based
on processing time
of the machine.
Processing Time != Event Time
10

11
Why do people use this?
• easy to implement
• low latency
• this is what systems give you
(Spark Streaming, Apex,
Samza, Storm)*
*not Google Cloud Dataflow

13
Window (5 min)
Correlate Tweets
and News
something...
These still have 3 min slack.
These have 8 min slack.
12:33 (13.10.2015):
Donald Trump speaks
at Cheese conference.
Processing Time != Event Time

Processing Time != Event Time
=> Mismatch in the
timespace continuum

15
Use cases
• out-of-order elements
• sources with delay
• recovery/fault-tolerance
• “catching up” with a stream
Who does it?
• Google Cloud Dataflow
• Apache Flink

17
We need a
Global Clock
that runs on
event time
instead of
processing time.

18
This is a source
This is our window operator
1
0
0
0 0
1
2
1
2
1
1
This is the current event-time time
2
2
2
2
2
This is a watermark.

20
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(ProcessingTime);
DataStream<Tweet> text = env.addSource(new TwitterSrc());
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new ExtractHashtags())
.keyBy(“name”)
.timeWindow(Time.of(5, MINUTES)
.apply(new HashtagCounter());
Processing Time

21
Event Time
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(EventTime);
DataStream<Tweet> text = env.addSource(new TwitterSrc());
text = text.assignTimestamps(new MyTimestampExtractor());
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new ExtractHashtags())
.keyBy(“name”)
.timeWindow(Time.of(5, MINUTES)
.apply(new HashtagCounter());

22
TL;DL*
• stream data is infinite
• windows are helpful
• event-time != processing time
• watermarks to the rescue
• Flink can do it
*too long, didn’t listen

32-35
24-27
20-23
8-110-3
4-7
24
Tumbling Windows of 4 Seconds
123412
4
59
9 0
20
20
22212326323321
26
353642

Empfohlen

Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward

K. Tzoumas & S. Ewen – Flink Forward KeynoteFlink Forward

Marton Balassi – Stateful Stream ProcessingFlink Forward

Apache Flink: Streaming Done Right @ FOSDEM 2016Till Rohrmann

Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkFlink Forward

Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Robert Metzger

Debunking Six Common Myths in Stream ProcessingKostas Tzoumas

Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi

Empfohlen

Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward

K. Tzoumas & S. Ewen – Flink Forward KeynoteFlink Forward

Marton Balassi – Stateful Stream ProcessingFlink Forward

Apache Flink: Streaming Done Right @ FOSDEM 2016Till Rohrmann

Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkFlink Forward

Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Robert Metzger

Debunking Six Common Myths in Stream ProcessingKostas Tzoumas

Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi

Debunking Common Myths in Stream ProcessingKostas Tzoumas

Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Apache Flink Taiwan User Group

Apache Flink at Strata San Jose 2016Kostas Tzoumas

Big Data WarsawMaximilian Michels

Flink internals web Kostas Tzoumas

Continuous Processing with Apache Flink - Strata London 2016Stephan Ewen

Streaming Analytics & CEP - Two sides of the same coin?Till Rohrmann

Aljoscha Krettek - The Future of Apache FlinkFlink Forward

Flink Streaming @BudapestDataGyula Fóra

Flink Streaming Hadoop Summit San JoseKostas Tzoumas

Apache Flink: API, runtime, and project roadmapKostas Tzoumas

Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkFlink Forward

Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Ververica

Apache Flink Training: System OverviewFlink Forward

First Flink Bay Area meetupKostas Tzoumas

Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Ververica

Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward

Tech Talk @ Google on Flink Fault Tolerance and HAParis Carbone

Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Flink Forward

Matthias J. Sax – A Tale of Squirrels and StormsFlink Forward

Christian Kreuzfeld – Static vs Dynamic Stream ProcessingFlink Forward

Michael Häusler – Everyday flinkFlink Forward

Weitere ähnliche Inhalte

Was ist angesagt?

Debunking Common Myths in Stream ProcessingKostas Tzoumas

Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Apache Flink Taiwan User Group

Apache Flink at Strata San Jose 2016Kostas Tzoumas

Big Data WarsawMaximilian Michels

Flink internals web Kostas Tzoumas

Continuous Processing with Apache Flink - Strata London 2016Stephan Ewen

Streaming Analytics & CEP - Two sides of the same coin?Till Rohrmann

Aljoscha Krettek - The Future of Apache FlinkFlink Forward

Flink Streaming @BudapestDataGyula Fóra

Flink Streaming Hadoop Summit San JoseKostas Tzoumas

Apache Flink: API, runtime, and project roadmapKostas Tzoumas

Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkFlink Forward

Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Ververica

Apache Flink Training: System OverviewFlink Forward

First Flink Bay Area meetupKostas Tzoumas

Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Ververica

Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward

Tech Talk @ Google on Flink Fault Tolerance and HAParis Carbone

Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Flink Forward

Matthias J. Sax – A Tale of Squirrels and StormsFlink Forward

Was ist angesagt? (20)

Debunking Common Myths in Stream Processing

Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)

Apache Flink at Strata San Jose 2016

Big Data Warsaw

Flink internals web

Continuous Processing with Apache Flink - Strata London 2016

Streaming Analytics & CEP - Two sides of the same coin?

Aljoscha Krettek - The Future of Apache Flink

Flink Streaming @BudapestData

Flink Streaming Hadoop Summit San Jose

Apache Flink: API, runtime, and project roadmap

Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...

Apache Flink Training: System Overview

First Flink Bay Area meetup

Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...

Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...

Tech Talk @ Google on Flink Fault Tolerance and HA

Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...

Matthias J. Sax – A Tale of Squirrels and Storms

Andere mochten auch

Christian Kreuzfeld – Static vs Dynamic Stream ProcessingFlink Forward

Michael Häusler – Everyday flinkFlink Forward

Assaf Araki – Real Time Analytics at ScaleFlink Forward

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward

Mikio Braun – Data flow vs. procedural programming Flink Forward

Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinFlink Forward

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward

Slim Baltagi – Flink vs. SparkFlink Forward

Flink Case Study: Bouygues TelecomFlink Forward

Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkFlink Forward

Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann

Apache Flink Training: DataStream API Part 1 BasicFlink Forward

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaFlink Forward

Vasia Kalavri – Training: Gelly School Flink Forward

William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward

Fabian Hueske – Juggling with Bits and BytesFlink Forward

Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkFlink Forward

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward

Apache Flink Training: DataSet API BasicsFlink Forward

Alexander Kolb – Flink. Yet another Streaming Framework?Flink Forward

Andere mochten auch (20)

Christian Kreuzfeld – Static vs Dynamic Stream Processing

Michael Häusler – Everyday flink

Assaf Araki – Real Time Analytics at Scale

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink

Mikio Braun – Data flow vs. procedural programming

Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...

Slim Baltagi – Flink vs. Spark

Flink Case Study: Bouygues Telecom

Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink

Introduction to Apache Flink - Fast and reliable big data processing

Apache Flink Training: DataStream API Part 1 Basic

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Vasia Kalavri – Training: Gelly School

William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...

Fabian Hueske – Juggling with Bits and Bytes

Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time

Apache Flink Training: DataSet API Basics

Alexander Kolb – Flink. Yet another Streaming Framework?

Ähnlich wie Aljoscha Krettek – Notions of Time

Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...Ververica

Stream processing with Apache Flink - Maximilian Michels Data ArtisansEvention

Graduating Flink Streaming - Chicago meetupMárton Balassi

Flink. Pure StreamingIndizen Technologies

Java Performance MistakesAndreas Grabner

Corporate Secret Challenge - CyberDefenders.org by AzadAzad Mzuri

What Your Tech Lead Thinks You Know (But Didn't Teach You)Chris Riccomini

Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International

Wcl303 russinovichconleyc

Analyzing social media with Python and other tools (2/4) Department of Communication Science, University of Amsterdam

Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International

Linux School: Advanced Administration for IBM SoftwareBill Malchisky Jr.

Discovery DevOpsPatto Kub

.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel ZikmundKarel Zikmund

Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...rschuppe

NDC Oslo 2019 - War stories from .NET team -- Karel ZikmundKarel Zikmund

What the Heck Just Happened?Ken Evans

Silverlight vs HTML5 - Lessons learned from the real world...Peter Gfader

Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg

Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...Flink Forward

Ähnlich wie Aljoscha Krettek – Notions of Time (20)

Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...

Stream processing with Apache Flink - Maximilian Michels Data Artisans

Graduating Flink Streaming - Chicago meetup

Flink. Pure Streaming

Java Performance Mistakes

Corporate Secret Challenge - CyberDefenders.org by Azad

What Your Tech Lead Thinks You Know (But Didn't Teach You)

Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing

Wcl303 russinovich

Analyzing social media with Python and other tools (2/4)

Dataflow - A Unified Model for Batch and Streaming Data Processing

Linux School: Advanced Administration for IBM Software

Discovery DevOps

.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund

Application Performance Troubleshooting 1x1 - Von Schweinen, Schlangen und Pa...

NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund

What the Heck Just Happened?

Silverlight vs HTML5 - Lessons learned from the real world...

Linux Performance Analysis: New Tools and Old Secrets

Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...

Mehr von Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward

Evening out the uneven: dealing with skew in FlinkFlink Forward

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward

Introducing the Apache Flink Kubernetes OperatorFlink Forward

Autoscaling Flink with Reactive ModeFlink Forward

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward

One sink to rule them all: Introducing the new Async SinkFlink Forward

Tuning Apache Kafka Connectors for Flink.pptxFlink Forward

Flink powered stream processing platform at PinterestFlink Forward

Apache Flink in the Cloud-Native EraFlink Forward

Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward

Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward

The Current State of Table API in 2022Flink Forward

Flink SQL on Pulsar made easyFlink Forward

Dynamic Rule-based Real-time Market Data AlertsFlink Forward

Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward

Processing Semantically-Ordered Streams in Financial ServicesFlink Forward

Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward

Batch Processing at Scale with Flink & IcebergFlink Forward

Mehr von Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...

Evening out the uneven: dealing with skew in Flink

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...

Introducing the Apache Flink Kubernetes Operator

Autoscaling Flink with Reactive Mode

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...

One sink to rule them all: Introducing the new Async Sink

Tuning Apache Kafka Connectors for Flink.pptx

Flink powered stream processing platform at Pinterest

Apache Flink in the Cloud-Native Era

Where is my bottleneck? Performance troubleshooting in Flink

Using the New Apache Flink Kubernetes Operator in a Production Deployment

The Current State of Table API in 2022

Flink SQL on Pulsar made easy

Dynamic Rule-based Real-time Market Data Alerts

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

Processing Semantically-Ordered Streams in Financial Services

Tame the small files problem and optimize data layout for streaming ingestion...

Batch Processing at Scale with Flink & Iceberg

Kürzlich hochgeladen

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Story boards and shot lists for my a level piececharlottematthew16

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

CloudStudio User manual (basic edition):comworks

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Kürzlich hochgeladen (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Powerpoint exploring the locations used in television show Time Clash

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Unleash Your Potential - Namagunga Girls Coding Club

Connect Wave/ connectwave Pitch Deck Presentation

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

DMCC Future of Trade Web3 - Special Edition

Streamlining Python Development: A Guide to a Modern Project Setup

Advanced Test Driven-Development @ php[tek] 2024

Dev Dives: Streamline document processing with UiPath Studio Web

DevoxxFR 2024 Reproducible Builds with Apache Maven

Story boards and shot lists for my a level piece

My INSURER PTE LTD - Insurtech Innovation Award 2024

Designing IA for AI - Information Architecture Conference 2024

CloudStudio User manual (basic edition):

Are Multi-Cloud and Serverless Good or Bad?

Nell’iperspazio con Rocket: il Framework Web di Rust!

Gen AI in Business - Global Trends Report 2024.pdf

Aljoscha Krettek – Notions of Time

1. Notions of Time Aljoscha Krettek aljoscha@apache.org @aljoscha How Apache Flink™ Handles Time and Windows

2. Adventures in Timespace

3. 3 Why Windows*? *not Microsoft Windows…

4. 4 That’s why…

5. 5 StreamingBatch

6. 6 In Streaming: Arriving data never stops!

7. 7 Solution: Put elements into buckets, these are called windows

8. 8 Window (5 min) Count #Hashtags Just saw #Trump on #CNN, super cool. :D Trump: 2394 Cheese: 12984 Money: 42

9. 9 What I didn’t mention • tweets have a timestamp, their event time • tweets from across the globe arrive with delay => tweets with different timestamps arrive out-of-order

10. Window (5 min) Count #Hashtags 12:34 (13.10.2015): Just saw #Trump on #CNN, super cool. :D Trump: 2394 Cheese: 12984 Money: 42 These arrive with 3 minutes slack Form windows based on processing time of the machine. Processing Time != Event Time 10

11. 11 Why do people use this? • easy to implement • low latency • this is what systems give you (Spark Streaming, Apex, Samza, Storm)* *not Google Cloud Dataflow

12. 12 Lets look at a more complex example.

13. 13 Window (5 min) Correlate Tweets and News something... These still have 3 min slack. These have 8 min slack. 12:33 (13.10.2015): Donald Trump speaks at Cheese conference. Processing Time != Event Time

14. Processing Time != Event Time => Mismatch in the timespace continuum

15. 15 Use cases • out-of-order elements • sources with delay • recovery/fault-tolerance • “catching up” with a stream Who does it? • Google Cloud Dataflow • Apache Flink

16. 16 How can we do this?

17. 17 We need a Global Clock that runs on event time instead of processing time.

18. 18 This is a source This is our window operator 1 0 0 0 0 1 2 1 2 1 1 This is the current event-time time 2 2 2 2 2 This is a watermark.

19. 19 Now, show me the API!

20. 20 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(ProcessingTime); DataStream<Tweet> text = env.addSource(new TwitterSrc()); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new ExtractHashtags()) .keyBy(“name”) .timeWindow(Time.of(5, MINUTES) .apply(new HashtagCounter()); Processing Time

21. 21 Event Time StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(EventTime); DataStream<Tweet> text = env.addSource(new TwitterSrc()); text = text.assignTimestamps(new MyTimestampExtractor()); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new ExtractHashtags()) .keyBy(“name”) .timeWindow(Time.of(5, MINUTES) .apply(new HashtagCounter());

22. 22 TL;DL* • stream data is infinite • windows are helpful • event-time != processing time • watermarks to the rescue • Flink can do it *too long, didn’t listen

23. flink.apache.org @ApacheFlink

24. 32-35 24-27 20-23 8-110-3 4-7 24 Tumbling Windows of 4 Seconds 123412 4 59 9 0 20 20 22212326323321 26 353642

Hinweis der Redaktion

Slack is the amount of time by which elements arrive late.
Catching up, for example with elements in Kafka, you would still want correct windows based on timestamp in elements.