1. dbisINSTITUT FÜR INFORMATIK
HUMBOLDT−UNIVERSITÄT ZU ERLINB
A Tale of Squirrels and Storms
Flink Forward 2015
Matthias J. Sax
mjsax@{informatik.hu-berlin.de|apache.org}
@MatthiasJSax
Humboldt-Universit¨at zu Berlin
Department of Computer Science
October 13st
2015
2. –MatthiasJ.Sax–SquirrelsandStorms
1/22
About Me
Ph. D. student in CS, DBIS Group, HU Berlin
involved in Stratosphere research project
working on data stream processing and optimization
Aeolus: build on top of Apache Storm
(https://github.com/mjsax/aeolus)
Committer at Apache Flink
11. –MatthiasJ.Sax–SquirrelsandStorms
3/22
Similarities of Flink and Storm
true stream processing engines (no micro-batching)
low latencies ( 100ms)
executing data flow programs
parallel and distributed
fault-tolerant
cloud or cluster environment
12. –MatthiasJ.Sax–SquirrelsandStorms
3/22
Similarities of Flink and Storm
true stream processing engines (no micro-batching)
low latencies ( 100ms)
executing data flow programs
parallel and distributed
fault-tolerant
cloud or cluster environment
Trident:
similar Java API
exactly-once processing
13. –MatthiasJ.Sax–SquirrelsandStorms
4/22
Flink vs. Storm
Advantages of Storm:
super low latency (< 10ms)
very robust:
stateless JVM for easy restart on failure
Zookeeper manages cluster state
isolation of topology
dynamic scaling (to some extent)
multi-language protocol (for experts only)
distributed RPC
14. –MatthiasJ.Sax–SquirrelsandStorms
5/22
Flink vs. Storm
Advantages of Flink:1
richer API
Java and Scala
type safe programs
system is aware of multiple input streams
ordered stream processing
system and user timestamps
count/time and customized windows
stateful processing
light weight fault-tolerance
Chandy-Lamport distributed snapshots
1
http:
//data-artisans.com/real-time-stream-processing-the-next-step-for-apache-flink/
15. –MatthiasJ.Sax–SquirrelsandStorms
6/22
Flink vs. Storm
Advantages of Flink (cont.):
provides exactly-once sinks
native flow control (back pressure)2
higher throughput (> x 100)3
no lambda or kappa architecture necessary
native support for iterations (cyclic data flows)
managed memory
2
http://data-artisans.com/how-flink-handles-backpressure/
3
http://data-artisans.com/
high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
32. –MatthiasJ.Sax–SquirrelsandStorms
9/22
Topology Deployment: Storm
per default: round-robin scheduling
high overhead due to intra JVM and/or network
communication
localOfShuffle connection pattern poorly exploited
isolation of topologies
custom scheduler possible (for experts only)
33. –MatthiasJ.Sax–SquirrelsandStorms
9/22
Topology Deployment: Storm
per default: round-robin scheduling
high overhead due to intra JVM and/or network
communication
localOfShuffle connection pattern poorly exploited
isolation of topologies
custom scheduler possible (for experts only)
34. –MatthiasJ.Sax–SquirrelsandStorms
9/22
Topology Deployment: Storm
per default: round-robin scheduling
high overhead due to intra JVM and/or network
communication
localOfShuffle connection pattern poorly exploited
isolation of topologies
custom scheduler possible (for experts only)
Src
35. –MatthiasJ.Sax–SquirrelsandStorms
9/22
Topology Deployment: Storm
per default: round-robin scheduling
high overhead due to intra JVM and/or network
communication
localOfShuffle connection pattern poorly exploited
isolation of topologies
custom scheduler possible (for experts only)
Src T1 T2
36. –MatthiasJ.Sax–SquirrelsandStorms
9/22
Topology Deployment: Storm
per default: round-robin scheduling
high overhead due to intra JVM and/or network
communication
localOfShuffle connection pattern poorly exploited
isolation of topologies
custom scheduler possible (for experts only)
Src T1 T2 F1
F2
37. –MatthiasJ.Sax–SquirrelsandStorms
9/22
Topology Deployment: Storm
per default: round-robin scheduling
high overhead due to intra JVM and/or network
communication
localOfShuffle connection pattern poorly exploited
isolation of topologies
custom scheduler possible (for experts only)
Src T1 T2 F1
F2 C1 C2
38. –MatthiasJ.Sax–SquirrelsandStorms
9/22
Topology Deployment: Storm
per default: round-robin scheduling
high overhead due to intra JVM and/or network
communication
localOfShuffle connection pattern poorly exploited
isolation of topologies
custom scheduler possible (for experts only)
Src T1 T2 F1
F2 C1 C2 Sk