SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Marton Balassi – data Artisans
Gyula Fora - SICS
Flink committers
mbalassi@apache.org / gyfora@apache.org
Real-time Stream Processing
with Apache Flink
Stream Processing
2
 Data stream: Infinite sequence of data arriving in a continuous fashion.
 Stream processing: Analyzing and acting on real-time streaming data,
using continuous queries
Streaming landscape
3
Apache Storm
•True streaming, low latency - lower throughput
•Low level API (Bolts, Spouts) + Trident
Spark Streaming
•Stream processing on top of batch system, high throughput - higher latency
•Functional API (DStreams), restricted by batch runtime
Apache Samza
•True streaming built on top of Apache Kafka, state is first class citizen
•Slightly different stream notion, low level API
Apache Flink
•True streaming with adjustable latency-throughput trade-off
•Rich functional API exploiting streaming runtime; e.g. rich windowing semantics
Apache Storm
4
 True streaming, low latency - lower throughput
 Low level API (Bolts, Spouts) + Trident
 At-least-once processing guarantees Issues
 Costly fault tolerance
 Serialization
 Low level API
Spark Streaming
5
 Stream processing emulated on a batch system
 High throughput - higher latency
 Functional API (DStreams)
 Exactly-once processing guarantees Issues
 Restricted streaming
semantics
 Windowing
 High latency
Apache Samza
6
 True streaming built on top of Apache Kafka
 Slightly different stream notion, low level API
 At-least-once processing guarantees with state
Issues
 High disk IO
 Low level API
Apache Flink
7
 True streaming with adjustable latency and throughput
 Rich functional API exploiting streaming runtime
 Flexible windowing semantics
 Exactly-once processing guarantees with (small) state
Issues
 Limited state size
 HA issue
Apache Flink
8
What is Flink
9
A "use-case complete" framework to unify
batch and stream processing
Event logs
Historic data
ETL
Relational
Graph analysis
Machine learning
Streaming analysis
Flink
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
ETL, Graphs,
Machine Learning
Relational, …
Low latency
windowing,
aggregations, ...
Event logs
Real-time data
streams
What is Flink
An engine that puts equal emphasis
to streaming and batch
10
Flink stack
11
Python
Gelly
Table
FlinkML
SAMOA
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)Hadoop
M/R
Flink Runtime
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
*current Flink master + few PRs
Streaming Optimizer
Flink Streaming
12
Overview of the API
 Data stream sources
• File system
• Message queue connectors
• Arbitrary source functionality
 Stream transformations
• Basic transformations: Map, Reduce, Filter, Aggregations…
• Binary stream transformations: CoMap, CoReduce…
• Windowing semantics: Policy based flexible windowing (Time, Count, Delta…)
• Temporal binary stream operators: Joins, Crosses…
• Native support for iterations
 Data stream outputs
 For the details please refer to the programming guide:
• http://flink.apache.org/docs/latest/streaming_guide.html
13
Reduce
Merge
Filter
Sum
Map
Src
Sink
Src
Use-case: Financial analytics
14
 Reading from multiple inputs
• Merge stock data from various sources
 Window aggregations
• Compute simple statistics over windows of data
 Data driven windows
• Define arbitrary windowing semantics
 Combine with sentiment analysis
• Enrich your analytics with social media feeds (Twitter)
 Streaming joins
• Join multiple data streams
 Detailed explanation and source code on our blog
• http://flink.apache.org/news/2015/02/09/streaming-example.html
Reading from multiple inputs
case class StockPrice(symbol : String, price : Double)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val socketStockStream = env.socketTextStream("localhost", 9999)
.map(x => { val split = x.split(",")
StockPrice(split(0), split(1).toDouble) })
val SPX_Stream = env.addSource(generateStock("SPX")(10) _)
val FTSE_Stream = env.addSource(generateStock("FTSE")(20) _)
val stockStream = socketStockStream.merge(SPX_Stream, FTSE_STREAM) 15
(1)
(2)
(4)
(3)
(1)
(2)
(3)
(4)
"HDP, 23.8"
"HDP, 26.6"
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 23.8)
StockPrice(HDP, 26.6)
Window aggregations
val windowedStream = stockStream
.window(Time.of(10, SECONDS)).every(Time.of(5, SECONDS))
val lowest = windowedStream.minBy("price")
val maxByStock = windowedStream.groupBy("symbol").maxBy("price")
val rollingMean = windowedStream.groupBy("symbol").mapWindow(mean _)
16
(1)
(2)
(4)
(3)
(1)
(2)
(4)
(3)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 23.8)
StockPrice(HDP, 26.6)
StockPrice(HDP, 23.8)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 26.6)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 25.2)
Data-driven windows
case class Count(symbol : String, count : Int)
val priceWarnings = stockStream.groupBy("symbol")
.window(Delta.of(0.05, priceChange, defaultPrice))
.mapWindow(sendWarning _)
val warningsPerStock = priceWarnings.map(Count(_, 1)) .groupBy("symbol")
.window(Time.of(30, SECONDS))
.sum("count") 17
(1)
(2) (4)
(3)
(1)
(2)
(4)
(3)
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)
StockPrice(HDP, 23.8)
StockPrice(HDP, 26.6)
Count(HDP, 1)StockPrice(HDP, 23.8)
StockPrice(HDP, 26.6)
Combining with a Twitter stream
val tweetStream = env.addSource(generateTweets _)
val mentionedSymbols = tweetStream.flatMap(tweet => tweet.split(" "))
.map(_.toUpperCase())
.filter(symbols.contains(_))
val tweetsPerStock = mentionedSymbols.map(Count(_, 1)).groupBy("symbol")
.window(Time.of(30, SECONDS))
.sum("count")
18
"hdp is on the rise!"
"I wish I bought more
YHOO and HDP stocks"
Count(HDP, 2)
Count(YHOO, 1)(1)
(2)
(4)
(3)
(1)
(2)
(4)
(3)
Streaming joins
val tweetsAndWarning = warningsPerStock.join(tweetsPerStock)
.onWindow(30, SECONDS)
.where("symbol")
.equalTo("symbol"){ (c1, c2) => (c1.count, c2.count) }
val rollingCorrelation = tweetsAndWarning
.window(Time.of(30, SECONDS))
.mapWindow(computeCorrelation _)
19
Count(HDP, 2)
Count(YHOO, 1)
Count(HDP, 1)
(1,2)
(1) (2)
(1)
(2)
0.5
Fault tolerance
 Exactly once semantics
• Asynchronous barrier snapshotting
• Checkpoint barriers streamed from the sources
• Operator state checkpointing + source backup
• Pluggable backend for state management
20
1
1
2 3
JM
SM
State manager
Job manager
Operator
Snapshot barrier
Event channel
Data channel
Checkpoint
JM
SM
Performance
21
 Performance optimizations
• Effective serialization due to strongly typed topologies
• Operator chaining (thread sharing/no serialization)
• Different automatic query optimizations
 Competitive performance
• ~ 1.5m events / sec / core
• As a comparison Storm promises ~ 1m tuples / sec / node
Roadmap
22
 Persistent, high-throughput state backend
 Job manager high availability
 Application libraries
• General statistics over streams
• Pattern matching
• Machine learning pipelines library
• Streaming graph processing library
 Integration with other frameworks
• Zeppelin (Notebook)
• SAMOA (Online ML)
Summary
 Flink is a use-case complete framework to unify batch
and stream processing
 True streaming runtime with high-level APIs
 Flexible, data-driven windowing semantics
 Competitive performance
 We are just getting started!
23
Flink Community
24
0
20
40
60
80
100
120
Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 May-16
Unique git contributors
flink.apache.org
@ApacheFlink

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream ProcessingSuneel Marthi
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupStephan Ewen
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 

Was ist angesagt? (20)

Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Apache flink
Apache flinkApache flink
Apache flink
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 

Andere mochten auch

[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준NAVER D2
 
[115] clean fe development_윤지수
[115] clean fe development_윤지수[115] clean fe development_윤지수
[115] clean fe development_윤지수NAVER D2
 
Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Clemens Valiente
 
[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림NAVER D2
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민NAVER D2
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSAmazon Web Services
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Roberto Hashioka
 
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민NAVER D2
 
[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우NAVER D2
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영NAVER D2
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 

Andere mochten auch (17)

[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준
 
[115] clean fe development_윤지수
[115] clean fe development_윤지수[115] clean fe development_윤지수
[115] clean fe development_윤지수
 
Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago
 
[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWS
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
 
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
 
[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 

Ähnlich wie Real-time Stream Processing with Apache Flink

Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitGyula Fóra
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin MeetupMárton Balassi
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Robert Metzger
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureRobert Metzger
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsStephan Ewen
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Kostas Tzoumas
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Robert Metzger
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteFlink Forward
 
Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in StreamsJamie Grier
 
Meet the squirrel @ #CSHUG
Meet the squirrel @ #CSHUGMeet the squirrel @ #CSHUG
Meet the squirrel @ #CSHUGMárton Balassi
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupRobert Metzger
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 
Strtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineStrtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineMyung Ho Yun
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 

Ähnlich wie Real-time Stream Processing with Apache Flink (20)

Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin Meetup
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in Streams
 
Meet the squirrel @ #CSHUG
Meet the squirrel @ #CSHUGMeet the squirrel @ #CSHUG
Meet the squirrel @ #CSHUG
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Strtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineStrtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP Engine
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Real-time Stream Processing with Apache Flink

  • 1. Marton Balassi – data Artisans Gyula Fora - SICS Flink committers mbalassi@apache.org / gyfora@apache.org Real-time Stream Processing with Apache Flink
  • 2. Stream Processing 2  Data stream: Infinite sequence of data arriving in a continuous fashion.  Stream processing: Analyzing and acting on real-time streaming data, using continuous queries
  • 3. Streaming landscape 3 Apache Storm •True streaming, low latency - lower throughput •Low level API (Bolts, Spouts) + Trident Spark Streaming •Stream processing on top of batch system, high throughput - higher latency •Functional API (DStreams), restricted by batch runtime Apache Samza •True streaming built on top of Apache Kafka, state is first class citizen •Slightly different stream notion, low level API Apache Flink •True streaming with adjustable latency-throughput trade-off •Rich functional API exploiting streaming runtime; e.g. rich windowing semantics
  • 4. Apache Storm 4  True streaming, low latency - lower throughput  Low level API (Bolts, Spouts) + Trident  At-least-once processing guarantees Issues  Costly fault tolerance  Serialization  Low level API
  • 5. Spark Streaming 5  Stream processing emulated on a batch system  High throughput - higher latency  Functional API (DStreams)  Exactly-once processing guarantees Issues  Restricted streaming semantics  Windowing  High latency
  • 6. Apache Samza 6  True streaming built on top of Apache Kafka  Slightly different stream notion, low level API  At-least-once processing guarantees with state Issues  High disk IO  Low level API
  • 7. Apache Flink 7  True streaming with adjustable latency and throughput  Rich functional API exploiting streaming runtime  Flexible windowing semantics  Exactly-once processing guarantees with (small) state Issues  Limited state size  HA issue
  • 9. What is Flink 9 A "use-case complete" framework to unify batch and stream processing Event logs Historic data ETL Relational Graph analysis Machine learning Streaming analysis
  • 10. Flink Historic data Kafka, RabbitMQ, ... HDFS, JDBC, ... ETL, Graphs, Machine Learning Relational, … Low latency windowing, aggregations, ... Event logs Real-time data streams What is Flink An engine that puts equal emphasis to streaming and batch 10
  • 11. Flink stack 11 Python Gelly Table FlinkML SAMOA Batch Optimizer DataSet (Java/Scala) DataStream (Java/Scala)Hadoop M/R Flink Runtime Local Remote Yarn Tez Embedded Dataflow Dataflow *current Flink master + few PRs Streaming Optimizer
  • 13. Overview of the API  Data stream sources • File system • Message queue connectors • Arbitrary source functionality  Stream transformations • Basic transformations: Map, Reduce, Filter, Aggregations… • Binary stream transformations: CoMap, CoReduce… • Windowing semantics: Policy based flexible windowing (Time, Count, Delta…) • Temporal binary stream operators: Joins, Crosses… • Native support for iterations  Data stream outputs  For the details please refer to the programming guide: • http://flink.apache.org/docs/latest/streaming_guide.html 13 Reduce Merge Filter Sum Map Src Sink Src
  • 14. Use-case: Financial analytics 14  Reading from multiple inputs • Merge stock data from various sources  Window aggregations • Compute simple statistics over windows of data  Data driven windows • Define arbitrary windowing semantics  Combine with sentiment analysis • Enrich your analytics with social media feeds (Twitter)  Streaming joins • Join multiple data streams  Detailed explanation and source code on our blog • http://flink.apache.org/news/2015/02/09/streaming-example.html
  • 15. Reading from multiple inputs case class StockPrice(symbol : String, price : Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val socketStockStream = env.socketTextStream("localhost", 9999) .map(x => { val split = x.split(",") StockPrice(split(0), split(1).toDouble) }) val SPX_Stream = env.addSource(generateStock("SPX")(10) _) val FTSE_Stream = env.addSource(generateStock("FTSE")(20) _) val stockStream = socketStockStream.merge(SPX_Stream, FTSE_STREAM) 15 (1) (2) (4) (3) (1) (2) (3) (4) "HDP, 23.8" "HDP, 26.6" StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6)
  • 16. Window aggregations val windowedStream = stockStream .window(Time.of(10, SECONDS)).every(Time.of(5, SECONDS)) val lowest = windowedStream.minBy("price") val maxByStock = windowedStream.groupBy("symbol").maxBy("price") val rollingMean = windowedStream.groupBy("symbol").mapWindow(mean _) 16 (1) (2) (4) (3) (1) (2) (4) (3) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) StockPrice(HDP, 23.8) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 26.6) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 25.2)
  • 17. Data-driven windows case class Count(symbol : String, count : Int) val priceWarnings = stockStream.groupBy("symbol") .window(Delta.of(0.05, priceChange, defaultPrice)) .mapWindow(sendWarning _) val warningsPerStock = priceWarnings.map(Count(_, 1)) .groupBy("symbol") .window(Time.of(30, SECONDS)) .sum("count") 17 (1) (2) (4) (3) (1) (2) (4) (3) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) Count(HDP, 1)StockPrice(HDP, 23.8) StockPrice(HDP, 26.6)
  • 18. Combining with a Twitter stream val tweetStream = env.addSource(generateTweets _) val mentionedSymbols = tweetStream.flatMap(tweet => tweet.split(" ")) .map(_.toUpperCase()) .filter(symbols.contains(_)) val tweetsPerStock = mentionedSymbols.map(Count(_, 1)).groupBy("symbol") .window(Time.of(30, SECONDS)) .sum("count") 18 "hdp is on the rise!" "I wish I bought more YHOO and HDP stocks" Count(HDP, 2) Count(YHOO, 1)(1) (2) (4) (3) (1) (2) (4) (3)
  • 19. Streaming joins val tweetsAndWarning = warningsPerStock.join(tweetsPerStock) .onWindow(30, SECONDS) .where("symbol") .equalTo("symbol"){ (c1, c2) => (c1.count, c2.count) } val rollingCorrelation = tweetsAndWarning .window(Time.of(30, SECONDS)) .mapWindow(computeCorrelation _) 19 Count(HDP, 2) Count(YHOO, 1) Count(HDP, 1) (1,2) (1) (2) (1) (2) 0.5
  • 20. Fault tolerance  Exactly once semantics • Asynchronous barrier snapshotting • Checkpoint barriers streamed from the sources • Operator state checkpointing + source backup • Pluggable backend for state management 20 1 1 2 3 JM SM State manager Job manager Operator Snapshot barrier Event channel Data channel Checkpoint JM SM
  • 21. Performance 21  Performance optimizations • Effective serialization due to strongly typed topologies • Operator chaining (thread sharing/no serialization) • Different automatic query optimizations  Competitive performance • ~ 1.5m events / sec / core • As a comparison Storm promises ~ 1m tuples / sec / node
  • 22. Roadmap 22  Persistent, high-throughput state backend  Job manager high availability  Application libraries • General statistics over streams • Pattern matching • Machine learning pipelines library • Streaming graph processing library  Integration with other frameworks • Zeppelin (Notebook) • SAMOA (Online ML)
  • 23. Summary  Flink is a use-case complete framework to unify batch and stream processing  True streaming runtime with high-level APIs  Flexible, data-driven windowing semantics  Competitive performance  We are just getting started! 23
  • 24. Flink Community 24 0 20 40 60 80 100 120 Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 May-16 Unique git contributors

Hinweis der Redaktion

  1. 3 main components to the system: Connectors(sources), Operators(trafos), Sinks (outputs) The source interface is as general as it gets + preimplemented connectors Rich set of operators, designed for true streaming analytics (long-standing, stateful, windowing) Sinks are very general, same as sources. Simple interfaces + pre-implemented
  2. -The goal is to showcase the main features of the api on a “real world” example -Use-case: Analyze streams of stock market data, which consists of (Stock symbol, Stock price) pairs -Sentiment analysis: Combine the market information with information aquired from social media feeds (in this case the nr of times the stock symbol was mentioned in the twitter stream) -Use stream joins for this ^
  3. - As a first step we need to connect to our data streams, and parse our inputs Here we use a simple socket stream and convert it to a case-class -> Could have used kafka or any other message queue + or more advanced stock representation
  4. Talk about policy based windowing -> eviction(window size), trigger (slide size) Window operations: reduce, mapWindow Grouped vs non-grouped windowing
  5. Flexible windowing -> awesome features Delta policy Case-class Count for convenience Mention other cool use-cases: detecting user sessions
  6. Simple windowed word count on the filtered tweet stream for each symbol